Building a More Efficient NetworkTransform


I’ve been using Unity’s free networking solution, UNET, in an RTS. On the whole, it works, but it doesn’t work especially well. Since UNET has to support many different types of games, the choices made by the developers lean towards versatility and flexibility, rather than efficiency.

Case in point: the NetworkTransform. It supports many things out-of-the-box, including interpolation, rigidbodies, and variable send rate, but it makes tradeoffs in the efficiency department. Every time it syncs a transform, it’s sending position and rotation uncompressed. With 3 floats for position and 3 floats for rotation, that’s 24 bytes every time a sync happens. The entire Unity networking library is open source, so you can analyze the NetworkTransform yourself.

There’s two reasons for wanting to reduce the bandwidth of the NetworkTransform:

  1. The Unity Matchmaking Service enforces a per-second bandwidth limit of 4kb in pre-production mode. 4096/24 is about 171. Assuming 10 updates a second, thats means only 17 NetworkTransforms syncing at once - with no other traffic at all.
  2. Reducing the amount of bandwidth allows us to push the send rate higher than 10x a second, reducing the amount of interpolation needed along with the perceived latency.

A couple of notes before we get started. First, a lot of this article is based on Glenn Fielder’s (Gaffer’s) snapshot compression article, which is applicable no matter if you’re using Unity or not. I’m going to be explaining a few concepts from the article, for completeness, but you should familiarize yourself with it before reading on.

Second, you should be familiar with bitwise operations. Since we’re trying to save as much bandwidth as possible, we’ll be hand-packing bits.

Initial Measurements

Just like in Gaffer’s article, we’ll want to measure our results. Here’s the code I’m using to measure the bandwidth being sent:

int oldBytes = 0;
private List<int> buffer = new List<int>();
int index = 0;

int bytesSentInLastSecond()
{
    int newBytes = NetworkTransport.GetOutgoingFullBytesCount();
    int newSample = newBytes - oldBytes;
    oldBytes = newBytes;

    buffer[index] = newSample;
    index = (index + 1) % buffer.Count;

    double sum = 0.0;
    foreach (int sample in buffer)
    {
        sum += sample;
    }
}

Let’s run this measurement against the NetworkTransform to get a baseline. Here’s the settings I have on my NetworkTransform:

Network transform inspector

Note that I’ve changed the send rate to 20 times a second - I’m hoping to get an acceptable bandwidth at this send rate. With these settings, here’s what kind of traffic I’m getting:


Looks like we’re getting about 1kbps. Accounting for the ambient ~50 bytes/sec, that’s around 950 bytes/sec for just one unit! That’s obviously a lot, but the number isn’t quite accurate; it includes the HLAPI overhead for sending a packet. When multiple units’ updates are sent in a single packet, it’s a bit more efficient:


We can try to estimate what that overhead is by solving some equations:

\[x + c = 1000 \\ 4x + c = 3000\]

It looks like the overhead is about 300 bytes/sec, and the bandwidth taken by a single unit is around 650 bytes/sec. Let’s see how much we can improve on that.

Quantization and Compression

First, I’ll explain the quantization technique from Gaffer’s article. Quantization is a little tricky to understand if you’ve never seen it before. In essence, we assume that a value lies within some range, and then approximate the value by representing it with a discrete point within that range.

Let’s take an example. Say we wanted to quantize a value from sin. We know the function sin can only produce values within the range \([-1, 1]\). Furthermore, say we want a precision of 0.1; that is, we want our result values (remember they’re approximate) to be within 0.1 of the actual value we sent. That means we need at least \(2 \div 0.1 = 20\) discrete values, represented by integers. We’ll make 0 represent -1, 1 represent -0.9, 2 represent -0.8, and so on.

So what if we wanted to represent a number like \(0.12\), which can’t be represented exactly using those values? Here’s the procedure we use.

  1. Subtract the bottom of the range from the number: \(0.12 - (-1) = 1.12\)
  2. Divide by the size of the range: \(1.12 \div 2 = 0.56\)
  3. Multiply by the number of discrete values: \(0.56 \times 20 = 11.2\)
  4. Round to the nearest integer: \(11\)

To get back the approximate value, we apply the operations in reverse:

  1. Divide the quantized number by the number of discrete values: \(11 \div 20 = 0.55\)
  2. Multiply by the size of the range: \(0.55 \times 2 = 1.1\)
  3. Add the bottom of the range: \(1.1 + (-1) = 0.1\)

The value we got back, \(0.1\), is not exactly the value we started with, \(0.12\), but it meets our precision requirements (within 0.1 of the initial value).

So what does this give us? We know that our approximated value only has 20 possible values, meaning we can represent it in just 5 bits (\(2^4 = 16\), which is too few, but \(2^5 = 32\)) - even less than a single byte. Compare that to four bytes for a single float value. Of course, if we wanted more precision, we’d have to use more bits. 8 bits would give us a precision of 0.01 (\(2 \div 2^8 = 0.008\)).

Now, back to our position value. Unfortunately, we can’t save as much with our position vector, since we have larger ranges. For position values in our RTS, the range of \([-512.0, 512.0]\) is acceptable. 16 bits can almost achieve an precision of 0.01; \(1024 / (2^{16}) = 0.015\). That means we’ll store each float of a Vector3 in two bytes instead of four; a 50% savings.

Here’s my code for quantization:

// Converts a float value into an unsigned integer value.
// You can think of this as linearly interpolating min and max against 0 and 2^bitLength.
uint Quantize(float value, float min, float max, int bitLength)
{
    return (uint)(((value - min) / (max - min)) * Math.Pow(2, bitLength));
}

To quantize a Vector3, we can just quantize all of its components. To unquantize, we just do the reverse of those operations, and I’ll leave that up to you to implement.

Compressing Quaternions

We could go ahead and apply quantization to our quaternion components too; after all, a quaternion is nothing but four floats bounded to \([-1.0, 1.0]\). However, Gaffer points out that there’s even better ways to compress a quaternion.

His first trick is to shrink four floats to three. Since a quaternion must follow the rule that \(x^2 + y^2 + z^2 + w^2 = 1\), we can drop one of the components and recompute it on the fly when reconstructing the quaternion. For example, if we drop \(x\), we can send \(y\), \(z\), and \(w\) from the serverside, and then compute \(x = sqrt(1 - y^2 + z^2 + w^2)\) on the clientside.

One wrinkle in that plan: sqrt always returns a positive result, but \(x\) may be negative. To remove the need for a sign bit, Gaffer notes that quaternions have a special property: a quaternion is equivalent to the same quaternion with negated components. That means if \(x\) is negative, we can just negate the entire quaternion and send that instead.

The second trick is, instead of always leaving off \(x\), leave off the component with the greatest magnitude. This loses us out on some bits up front because we now need to tell the client whether it was \(x\), \(y\), \(z\), or \(w\) we dropped. However, it gives us precision back; we can now bound to \([-0.707, 0.707]\) because of the same rule \(x^2 + y^2 + z^2 + w^2 = 1\). We can tell the client which component we removed using two bits, since there are four values. We represent \(x\) with 0, \(y\) with 1, \(z\) with 2, and \(w\) with 3.

With these tricks, we’ve now encoded four floats into three floats each quantized to three 9-bit values, plus an extra two-bit component value. The total comes up to 29 bits - four bytes, down from sixteen!

Here’s my code for all that:

// Encodes a quaternion into a bit array.
// This avoids having to encode the entire quaternion by using tricks outlined in
// the compression article linked above. This reduces the size from 16 bytes to 4.
public List<bool> EncodeQuaternion(Quaternion quat)
{
    // Figure out which component is the greatest
    List<float> components = new List<float> { quat.x, quat.y, quat.z, quat.w };
    int greatestIndex = 0;
    for (int i = 1; i < components.Count; i++)
    {
        if (Math.Abs(components[i]) > Math.Abs(components[greatestIndex]))
        {
            greatestIndex = i;
        }
    }

    // Get rid of it and figure out if we need to negate the quaternion
    float dComp = components[greatestIndex];
    components.RemoveAt(greatestIndex);

    if (dComp < 0)
    {
        for (int i = 0; i < components.Count; i++)
        {
            components[i] = -components[i];
        }
    }

    // Quantize each of the remaining components
    // 9 bits gives us a precision of 0.002, which might be more than we need;
    // feel free to play around with that value
    uint a = Quantize(components[0], -0.707, 0.707, 9);
    uint b = Quantize(components[1], -0.707, 0.707, 9);
    uint c = Quantize(components[2], -0.707, 0.707, 9);

    // EncodeUnsigned is defined above, in the previous section
    List<bool> x_e = EncodeUnsigned(a, 9);
    List<bool> y_e = EncodeUnsigned(b, 9);
    List<bool> z_e = EncodeUnsigned(c, 9);

    // Create the encoded quaternion, appending the two bits representing the
    // dropped component first
    List<bool> arr = new List<bool>(2 + QuaternionQuantizeBits*3);
    arr.Add((greatestIndex & 1) > 0);
    arr.Add((greatestIndex & 2) > 0);
    arr.AddRange(x_e);
    arr.AddRange(y_e);
    arr.AddRange(z_e);

    return arr;
}

I’ll again leave writing the decoder as an exercise for the reader.

Writing values out to the client

Okay, so we have those values, pretty well compressed. If we send both position and rotation, we’ll be sending 10 bytes every frame, down from 24! Pretty good compression.

Here’s where my article deviates from Gaffer’s, and starts to get into Unity specifics on how to send the resulting values. First, we need a way to encode our quantized integer values from the position and rotation components. Here’s my method for doing so.

// Encodes a given number of bits in an unsigned int into a bit array.
public List<bool> EncodeUnsigned(uint n, int length)
{
    List<bool> arr = new List<bool>(length);
    int bits = 1;
    for (int i = 0; i < length; i++)
    {
        arr.Add((n & bits) > 0);
        bits *= 2;
    }

    return arr;
}

So to encode a Vector3, we first quantize each of its components, and then encode the resulting value. How, then, do we send that to the client?

Structure of a NetworkedTransform

The overall structure of the RtsNetworkTransform is based on the Unity NetworkTransform. The general procedure on the server is this:

  1. In FixedUpdate, if position/rotation has changed, set the appropriate dirty flags.
  2. In OnSerialize, check which (if any) of position/rotation has changed, and write them using the NetworkWriter.

Here’s the FixedUpdate:

void FixedUpdateServer()
{
    uint dirtyBit = syncVarDirtyBits;

    if (Vector3.SqrMagnitude(lastSentPosition - transform.position) >= 0.01)
    {
        dirtyBit |= (uint)1;
    }

    if (Quaternion.Angle(lastSentRotation, transform.rotation) >= 0.1f)
    {
        dirtyBit |= (uint)2;
    }

    SetDirtyBit(dirtyBit);
}

Note that SetDirtyBit, despite its name, just assigns to syncVarDirtyBits, so we can’t call SetDirtyBit(1) in the if statement or else we might overwrite a previous value.

In OnSerialize, we can check if position or rotation needs to get sent out, and write them with the NetworkWriter. The implementation of EncodeQuaternion is above; EncodeVector3 just calls Quantize and EncodeUnsigned three times. Note that the two dirty bits we’re writing are essentially free, since our quaternion is stored in 29 bits and we can only send bytes.

public override bool OnSerialize(NetworkWriter writer, bool initialState)
{
    List<bool> boolArr = new List<bool>();
    bool encodePosition = (initialState || (syncVarDirtyBits & (uint)DirtyBits.Position) > 0);
    bool encodeRotation = (initialState || (syncVarDirtyBits & (uint)DirtyBits.Rotation) > 0);
    boolArr.Add(encodePosition);
    boolArr.Add(encodeRotation);

    if (encodePosition)
    {
        boolArr.AddRange(EncodeVector3(transform.position, positionQuantizeParams));
        lastSentPosition = transform.position;
    }

    if (encodeRotation)
    {
        boolArr.AddRange(EncodeQuaternion(transform.rotation));
        lastSentRotation = transform.rotation;
    }

    // NetworkWriter has no methods for dealing with bits, so we need to transform to
    // a byte array first before sending
    int byteCount = (boolArr.Count + 8 - 1) / 8;
    byte[] bytesToSend = new byte[byteCount];
    BitArray finalBitArray = new BitArray(boolArr.ToArray());
    finalBitArray.CopyTo(bytesToSend, 0);

    writer.Write(bytesToSend, byteCount);
}

On the client, OnDeserialize does the exact opposite. It reads the first two bits to check what’s in the payload. Then, it calls DecodeVector3 if the first bit is set, and DecodeQuaternion if the second is set. The only tricky thing to keep in mind here is that the NetworkReader offset may not start at zero; to read all the bytes of the payload, you’ll want to call:

byte[] bytes = reader.ReadBytes(reader.Length - (int)reader.Position);
BitArray bitArray = new BitArray(bytes);
List<bool> bits = new List<bool>(bitArray.Cast<bool>());

One last thing we want to do; as I said near the top of the post, the target send rate is 20 times a second. To achieve that with our custom RtsNetworkTransform, we’ll want to override the GetNetworkSendInterval method:

public override float GetNetworkSendInterval()
{
    return 1 / 20.0f;
}

Results

At this point, are sychronizing position/rotation, just like the UNET NetworkTransform. How is our bandwidth doing, in the end? Let’s find out.


As you can see, there’s a marked improvement; from 950 bytes/sec, we’re down to about 650 bytes/sec for one unit when we’re sending both rotation and position. This is about in line with what we’d expect; we’re saving \(24 - 10 = 14\) bytes every send, and with 20 sends per second, that’s \(14 \times 20 = 280\) bytes per second. We save even more when the unit’s moving in a straight line, i.e. we don’t need to send rotation.

Here’s that same thing with multiple units. The bandwidth is about half of what it was in the same scene with four units using Unity’s NetworkTransform.


You might wonder if the loss of precision from the quantization produces any noticeable artifacts on the clientside. The answer is no! As long as you pick your quantization parameters (bounds, discrete value count) wisely and throw in some interpolation, it’s invisible. In the (admittedly crappy) following video, the left side is the host, and the right side is the client:


Conclusion

There’s a few things I didn’t cover here, most importantly interpolation between frames on the client side. Hopefully at some point in the future I’ll get a chance to write those things up.

I will admit it’s disappointing that a single unit is still consuming so much bandwidth, even though we have made a lot of progress. There’s some more investigation to be done on reducing the overhead which the Unity HLAPI is enforcing - that is, the ~300 bytes/sec overhead from the “Initial Measurements” section. I think this might require using the LLAPI and creating a custom manager object.

One more parting note, for my own vanity: I’ve made many allusions to an RTS in this article. I know that client/server is not necessarily the right way to go for an RTS, and that we could save much, much more bandwidth by using a lockstep model. However, making Unity deterministic is way more work than I’m willing to put into for a hobby project. So, yeah, I know - I shouldn’t even be using UNET, but hey, it does work.

MathJAX used for the equations, OBS used for the recordings, FFMPEG used to edit the recordings.