Group Varint Encoding for Java

Integer compression can improve performance in various applications. For instance, search engines’ inverted indices can benefits from compression. These data structures, in fact, are mainly composed by integer numbers and need to be kept in RAM. Integer compression can help to reduce their size and to maintain more of the structure in main memory.

Varint (also called Variable Byte o VInt) is a byte-aligned, variable-length, encoding technique to compress integer numbers. It uses the lower 7 bits of a byte to store the number value, and the higher bit to signal if another byte is needed (when 7 bits are not enough). When the higher bit is set to 1, we need to read another byte to get the number value. If it is set to 0, we are done decoding the number.

1 is 00000001
15 is 00001111
511 is 11111111 00000011
131071 is 11111111 11111111 00000111
[1, 15, 511, 131071] is 00000001 00001111 11111111 00000011 11111111 11111111 00000111

While it is simple and effective, Varint needs up to four if-then-else’s to decode an integer. This can limit its decoding speed. Now, consider the fact that the possible higher bits configurations can be represented with 2 bits (00: one byte is needed, 01: two, 10: three, 11: four). We can pack four 2 bits configurations in a single byte and encode four integers in the variable-length format. At decode time, we just need to read the first byte to decompress four numbers at once!

[1, 15, 511, 131071] is 00000110 00000001 00001111 11111111 00000001 11111111 11111111 00000001

This techinque is called Group Varint. More about this can be found in:
1) Jeff Dean (Google), “Challenges in Building Large-Scale Information Retrieval Systems”, WSDM’09

As programming exercise, I implemented Group Varint in Java. You can find the groupvarint code on github.

Usage example:

 int a[] = {1, 2, 3 , 4, 5, 6, 7, 8};
 byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)];
 int c[] = new int[a.length];

 GroupVarint.compress(a, 0, a.length, b, 0);
 GroupVarint.uncompress(b, 0, c, 0, a.length);

with negative numbers:

 int a[] = {0, 1, 2, -1, -2, Integer.MAX_VALUE, Integer.MIN_VALUE};
 byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)];
 int c[] = new int[a.length];
 int support[] = Arrays.copyOf(a, a.length);

 ZigZagGroupVarint.compress(support, 0, support.length, b, 0);
 ZigZagGroupVarint.uncompress(b, 0, c, 0, a.length);