Group Varint Encoding for Java
Integer compression can improve performance in various applications. For instance, search engines’ inverted indices can benefits from compression. These data structures, in fact, are mainly composed by integer numbers and need to be kept in RAM. Integer compression can help to reduce their size and to maintain more of the structure in main memory.
Varint (also called Variable Byte o VInt) is a byte-aligned, variable-length, encoding technique to compress integer numbers. It uses the lower 7 bits of a byte to store the number value, and the higher bit to signal if another byte is needed (when 7 bits are not enough). When the higher bit is set to 1, we need to read another byte to get the number value. If it is set to 0, we are done decoding the number.
Examples:
1 is 00000001
15 is 00001111
511 is 11111111 00000011
131071 is 11111111 11111111 00000111
[1, 15, 511, 131071] is 00000001 00001111 11111111 00000011 11111111 11111111 00000111
While it is simple and effective, Varint needs up to four if-then-else’s to decode an integer. This can limit its decoding speed. Now, consider the fact that the possible higher bits configurations can be represented with 2 bits (00: one byte is needed, 01: two, 10: three, 11: four). We can pack four 2 bits configurations in a single byte and encode four integers in the variable-length format. At decode time, we just need to read the first byte to decompress four numbers at once!
Example:
[1, 15, 511, 131071] is 00000110 00000001 00001111 11111111 00000001 11111111 11111111 00000001
This techinque is called Group Varint. More about this can be found in:
1) Jeff Dean (Google), “Challenges in Building Large-Scale Information Retrieval Systems”, WSDM’09
As programming exercise, I implemented Group Varint in Java. You can find the groupvarint code on github.
Usage example:
int a[] = {1, 2, 3 , 4, 5, 6, 7, 8}; byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)]; int c[] = new int[a.length]; GroupVarint.compress(a, 0, a.length, b, 0); GroupVarint.uncompress(b, 0, c, 0, a.length);
with negative numbers:
int a[] = {0, 1, 2, -1, -2, Integer.MAX_VALUE, Integer.MIN_VALUE}; byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)]; int c[] = new int[a.length]; int support[] = Arrays.copyOf(a, a.length); ZigZagGroupVarint.compress(support, 0, support.length, b, 0); ZigZagGroupVarint.uncompress(b, 0, c, 0, a.length);
powturbo
Mar 13, 2015 @ 10:47:57
Group Varint is inefficient in term of compression ratio and speed. You may consider others schemes such as “Simple Variable” , “Bitpacking” or “PForDelta” see :https://github.com/powturbo/TurboPFor
Matteo Catena
Mar 13, 2015 @ 18:20:43
Dear powturbo, are you talking about Group Varint in general or about this specific implementation? Of course there are compression schemes more efficient than this, and you have cited some. You should also add Elias-Fano encoding when dealing with inverted indices (quasi-succinct indices). Mine was more a programming exercise, since I couldn’t find Java implementations of Group Varint.
powturbo
Jun 02, 2015 @ 20:01:23
I’m talking in general. Group Varint is only popular, because it was used by google. The scalar TurboVbyte is more efficient. I’ve now an “Elias Fano” implementation and as you can see in the benchmarks “TurboPFor” is more efficient.