We all know that one of the reasons RPC frameworks are more efficient than REST APIs is that they use efficient encoding schemes. Compared to the plaintext JSON/XML transmission of REST APIs, exchanging information encoded with binary protocols is far more efficient. For example, Protobuf for gRPC and Avro for Kafka. This article will explain how protobuf/avro encodes and decodes integer types. Why integers specifically? Because they form the foundation for encoding other types and are also the most interesting part of the entire serialization/deserialization process!

Background

Due to work requirements, I recently needed to build a high-availability, low-power-consumption RPC protocol from scratch. Given the widespread adoption of Protobuf and gRPC, along with Google's endorsement, I decided to adopt the Protobuf protocol. So I studied the Protobuf encoding standard and its implementations in various languages. The encoding rules mainly use a variable-length approach (aside from some fixed-length types like float), which is fairly straightforward to implement.

When encoding integers, the ZigZag encoding scheme is used. Since it had been a while since I last worked on algorithm problems, it took me an entire Pomodoro session to prove the correctness of its bitwise operations (note to self: don't let your skills rust!).

Since we need to ensure high performance and low power consumption, we also need to consider compiler and cross-platform optimizations, which I'll document here as well.

Prerequisites

This article requires almost no prior knowledge. The only thing you need to know is Computer Science 101 — how integers are represented:

  • One's complement: For a single digit (binary 0 or 1), the complement operation flips 0 to 1 and 1 to 0. That is, if X=1, the one's complement is 0; if X=0, the one's complement is 1.
  • Two's complement: A method of representing signed numbers in binary, and a way to negate a number's sign, commonly used in computer science. Two's complement is defined using signed-bit binary numbers. The two's complement of positive numbers and 0 is the number itself. The two's complement of a negative number is obtained by inverting all bits of its corresponding positive number and then adding 1.

Main Topic

Encoding

Now let's look at how gRPC encodes Int64. Our focus is on the encoding/decoding portion, so we'll skip the safety checks and unsafe logic for now and look directly at the encoding logic. The code is as follows:

@Override
    public final void writeUInt64NoTag(long value) throws IOException
			...
			while (true) {
        if ((value & ~0x7FL) == 0) {
          buffer[position++] = (byte) value;
          return;
        } else {
          buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
          value >>>= 7;
        }
      }

Here value is the long integer to be encoded, buffer is the target for the encoded result, and position is the current encoding position. The code is short and primarily consists of bitwise operations. At first glance it may look confusing — what are 0x7FL and 0x80? Why >>>=7? Before answering these questions, we need to understand the encoding methods for integers in RPC — VarInt and ZigZag encoding.

VARINT

VARINT is a common technique for serializing and deserializing integers in remote communication. An integer is represented as one or more bytes. Smaller numbers result in shorter encoded byte sequences.

In VARINT, every byte except the last has its most significant bit (msb) set, indicating that there are more bytes to follow and that reading should continue. The lower 7 bits of each byte store 7 bits of the actual value, while the highest bit stores the msb. For example:

NumberBytes
10000 0001
20000 0010
......
1270111 1111
1281000 0000 0000 0001
......
3001010 1100 0000 0010

As we can see, when the target value is less than 127, the highest bit (msb) is 0 and the rest displays the normal binary representation. When the target value exceeds 127, the lowest 7 bits are placed into the first byte with the high bit (msb) set to 1, then the remaining value is placed into subsequent bytes following the same rule, and so on.

For example:

VarInt(1000 0000 0000 0001) = 000 0001(后7位) ++ 000 0000(前七位)
													  = 1 << 7 + 0
													  = 1000 0000
													  = 128
VarInt(1010 1100 0000 0010) = 000 0010 ++ 010 1100
														= 10 << 7 + 10 1100
														= 256 + 44
														= 300

As we can see, with VarInt encoding, smaller numbers occupy fewer bytes. Moreover, no additional data structures are needed to store the offset — all information is stored in the target buffer itself. This storage method effectively reduces the number of bytes required for encoding, achieving compression. At the same time, this data structure carries its own metadata, eliminating the need for additional data structures to represent length or offset information.

However, there's still one problem — how to represent negative numbers. As we know, negative numbers are stored in two's complement form in computers. This means that the high bits of a negative number serve as the sign bit, and paradoxically, the smaller the absolute value of a negative number, the more bytes it occupies. To solve this problem, we need a compression encoding algorithm that can also apply VarInt representation to negative numbers.

ZigZag

ZigZag encoding is a scheme that represents signed integers as unsigned integers, used by both protobuf and avro for integer encoding. In the section above, we wanted an encoding method that can encode integers continuously and consistently, where smaller integers ideally occupy fewer bytes. ZigZag encoding helps us achieve this goal.

As the name suggests, ZigZag alternately encodes positive and negative numbers into unsigned integers in a "zig-zag" pattern. The specific rule is: -1 -> 1, 1 -> 2, -2 -> 3, 3 -> 4, and so on. The details are shown in the table below:

Signed OriginalEncoded As
00
-11
12
-23
24
21474836474294967294
-21474836484294967295

From the definition above, we can derive the encoding formula for 64-bit integers:

(n << 1) ^ (n >> 63)

I'll skip the detailed derivation and instead offer an intuitive explanation. From the table above, we can see that for positive integers, the encoded result is twice the original number; for negative numbers, the encoded result is twice the absolute value of the original number minus one; for zero, it remains unchanged:

$$ \begin{equation} f(x)=\left{ \begin{array}{rcl} 0 & & {x = 0}\\ 2x & & {x > 0}\\ |2x|-1 & & {x < 0}\\ \end{array} \right. \end{equation} $$

Let's examine the encoding formula:

n << 1

This shifts n left by 1 bit, effectively multiplying n by 2.

n >> 63

This shifts the 64-bit integer right by 63 bits. For positive integers, since the sign bit is 0, the result is 0x 0000 0000; for negative integers, the result is 0x ffff ffff.

(n << 1) ^ (n >> 63)

Finally, we XOR the two results above.

Since XOR with 0 yields the original value, for 0 and positive integers, the result is simply multiplying the original value by 2 (0 remains unchanged).

For negative integers, XORing the doubled negative value with all 1s is equivalent to bitwise negation — flipping 1s to 0s and 0s to 1s. Since in computers, two's complement equals one's complement plus 1, conversely, one's complement equals two's complement minus 1. Therefore the result is exactly |2x|-1.

Thus, the proof is complete.

Encoding Explanation

@Override
    public final void writeUInt64NoTag(long value) throws IOException
			...
			while (true) {
        if ((value & ~0x7FL) == 0) {
          buffer[position++] = (byte) value;
          return;
        } else {
          buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
          value >>>= 7;
        }
      }

Now let's revisit the encoding code — it should be much clearer:

if ((value & ~0x7FL) == 0) {

This line checks whether the current value is smaller than 127 (the lower 7 bits). If so, it converts the current value to a byte, and the encoding process is complete.

buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
value >>>= 7;

Otherwise, it converts the lower 7 bits of the current value to a byte (value & 0x7F), then sets the high bit (msb) to 1 (| 0x80). It then right-shifts the value by 7 bits and continues the process.

Finally, by combining the ZigZag encoding step with the VarInt encoding step, the entire integer encoding process is complete.

Decoding

With the encoding process explained above, the decoding process is relatively easy to understand. We'll again break the decoding into two parts.

ZigZag Decoding

Without further ado, here's the code:

(n >> 1) ^ -(n & 1)

The formula is as follows:

$$ \begin{equation} f^{\prime}(x)=\left{ \begin{array}{rcl} 0 & & {x = 0}\\ x/2 & & {x > 0}\\ -x/2 - 1 & & {x < 0}\\ \end{array} \right. \end{equation} $$

n >> 1

This shifts n right by 1 bit, i.e., dividing n by 2.

-(n & 1)

This operation yields the negation of the last bit. For integers whose original value is 0 or positive, the encoded result is always even, so this operation yields 0. For integers whose original value is negative, the encoded result is always odd, so this operation yields 0x ffff ffff.

(n >> 1) ^ -(n & 1)

We XOR the above results. For 0 and even numbers, the result is the same as dividing by 2. For odd numbers, it inverts the bits after dividing by 2. Due to the relationship between one's complement and two's complement, this is equivalent to subtracting one from the original result, which matches the formula above. This completes the proof of ZigZag decoding.

VarInt Decoding

The code is as follows:

public long ReadLong()
        {
            byte b = read();
            ulong n = b & 0x7FUL;
            int shift = 7;
            while ((b & 0x80) != 0)
            {
                b = read();
                n |= (b & 0x7FUL) << shift;
                shift += 7;
            }
            long value = (long)n;
            return (-(value & 0x01L)) ^ ((value >> 1) & 0x7fffffffffffffffL);
        }

Let's briefly walk through the decoding process.

while ((b & 0x80) != 0)

This checks whether the highest bit is 1. If so, it continues; otherwise, it exits the loop.

n |= (b & 0x7FUL) << shift;
shift += 7;

This reads the lower 7 bits of the current byte, converts them to an integer, and left-shifts by shift bits. Here shift is a multiple of 7, representing the current offset in the decoding process.

return (-(value & 0x01L)) ^ ((value >> 1) & 0x7fffffffffffffffL);

This performs ZigZag decoding on the result and returns it.

And with that, the entire encoding/decoding process is complete. Done!

Optimization

Optimization Goals

Since our encoding/decoding library needs to run on smart devices across different platforms and operating systems, we have to come back and perform performance optimizations. Here we'll focus primarily on optimizing the encoding portion. Why? Because decoding happens on the server side!

As we know, CPUs perform branch prediction and prefetching. Since VarInt already encodes at the byte level, we'll mainly optimize for CPU branch prediction.

How can we improve the accuracy of CPU branch prediction? There are many methods, but here we'll use the simplest one — reducing conditional branches.

Optimization Logic

Let's revisit the VarInt encoding code once more:

@Override
    public final void writeUInt64NoTag(long value) throws IOException
			...
			while (true) {
        if ((value & ~0x7FL) == 0) {
          buffer[position++] = (byte) value;
          return;
        } else {
          buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
          value >>>= 7;
        }
      }

We need to compare the number to be encoded against 0x7F to determine whether further shifting is needed. Can we eliminate this conditional check?

Let's reconsider our logic — we need to encode the target integer every 7 bits. So if we know how many bits the binary representation of the target number actually uses, we can divide by 7 to determine how many shift operations we need, without any conditional checks. The answer is yes — we can eliminate the conditional branch!

For example: the binary representation of 15 is 0x 0000 1111, with 32 leading zeros on the left. So the effective number of bits for the integer 15 is 63 - 32 = 31. Therefore, the number of shifts needed for VarInt encoding is 31 / 7 = 4. As we can see, by subtracting the count of consecutive leading zeros from the high bits, we can determine the number of shifts needed, thereby eliminating the conditional check. Fortunately, Java/Go/Rust all have leadingZero related functions, making implementation straightforward. The code is as follows:

int varIntLen(long value) {
		return (63 - Long.numberOfLeadingZeros(value)) / 7;
}

With this, we've successfully eliminated the branch condition. However, we've introduced a division operation. This is quite unfriendly to the compiler. Let's convert the division into a static array lookup to optimize execution.

private static final int[] VAR_INT_LENGTHS = new int[65];

static {
		for (int i = 0; i <= 64; ++i) {
		VAR_INT_LENGTHS[i] = (63 - i) / 7;
}
}

int varIntLen(long value) {
		return VAR_INT_LENGTHS[Long.numberOfLeadingZeros(value)];
}

The encoding logic then becomes:

private void writeVarInt(int offset, long value) {
		int length = varIntLength(value);
    for (int i = 0; i < length; ++i) {
    		buffer[offset + i] = ((byte) ((value & 0x7F) | 0x80));
   			value >>>= 7;
    }
    buffer[offset + i] = (byte) value;
}

With this, we've completed the optimization of the encoding logic. But we've introduced a new function Long.numberOfLeadingZeros — could this become the straw that breaks the camel's back in terms of performance? No need to worry! Thanks to the power of modern processors, there are dedicated instructions to compute LeadingZeros: lzcnt on x86_64, and clz on Arm and RISC-V. A single instruction gives us the result we want — perfect!

Optimization Verification

The cross-platform optimizations above are theoretical assumptions. When actually executing on hardware, will the compiler be smart enough to translate Long.numberOfLeadingZeros into the corresponding platform-specific optimized instruction? Or will reality slap us in the face? Let's verify!

(Compilation process omitted here...)

The result — we did get slapped in the face... halfway.

Let's first look at the successful case. On the Arm64 platform, the result is:

        text    "".LzCount(SB), LEAF|NOFRAME|ABIInternal, $0-16
        funcdata        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        funcdata        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        movd    "".x(FP), R0
        clz     R0, R0
        movd    R0, "".~r1+8(FP)
        ret     (R30)
        text    "".main(SB), LEAF|NOFRAME|ABIInternal, $0-0
        funcdata        ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        funcdata        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        ret     (R30)

We can see that on the Arm platform, the compiler successfully optimized it to clz R0, R0, exactly as we expected.

Now let's look at the x86_64 result:

        text    "".LzCount(SB), NOSPLIT|ABIInternal, $0-16
        funcdata        $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        funcdata        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        movq    "".x+8(SP), AX
        bsrq    AX, AX
        nop
        movq    $-1, CX
        cmovqeq CX, AX
        addq    $-63, AX
        negq    AX
        movq    AX, "".~r1+16(SP)
        ret
        text    "".main(SB), NOSPLIT|ABIInternal, $0-0
        funcdata        $0, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        funcdata        $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
        ret

We didn't find lzcnt. So did the optimization fail on Intel processors? Wait — notice bsrq AX, AX. What does this instruction do? Why does the code following it match our predicted logic — addq $-63, AX?

It turns out that the bsrq instruction scans from the high bits to find the index of the first non-zero bit. Isn't that exactly what we want? On x86, the compiler went one step further and directly found the index of the first non-zero bit for us. Different paths, same destination! Kudos to GCC's powerful compilation capabilities!

With this, the compilation results match our expectations. We've eliminated the conditional branch, removed the performance impact of division, and the compiler has used native machine instructions for optimization on different platforms. Our protobuf encoding/decoding journey comes to an end here.

Summary

This article explained how protobuf and avro encode and decode integers. Along the way, we learned about VarInt and ZigZag encoding methods. Finally, we optimized the encoding logic based on the characteristics of modern CPUs and operating system instruction sets, making it more suitable for cross-platform smart devices.

Epilogue

The end is not forever.

After writing the code and running tests, I found that while performance showed some improvement over the native protobuf code, serialization and deserialization remained the bottleneck. Considering that future network infrastructure may adopt InfiniBand, this kind of CPU-intensive operation would still be a major drag on transmission efficiency.

Therefore, I ultimately decided not to use the above code for serialization and deserialization. So the question becomes — how should we handle RPC communication? The answer is — zero-copy transfer protocols.

These protocols unify the encoding format used in memory and in remote communication, eliminating the need for serialization — hence the name "zero copy." Currently, the popular options are Arrow Flight and FlatBuffers. The former targets OLAP scenarios and primarily uses columnar storage formats, while the latter targets OLTP use cases and employs row-based storage formats. After comparing the principles and efficiency of both, I'll follow up with an article on zero-copy transfer protocols — stay tuned!