31/10/2016

groupvarint 0.0.2 is out – now with black magic!

Matteo Catena Compression, Java 0 Comments

GroupVarint is a technique for compressing arrays of integers. I have written about it here. GroupVarint can be used to compress arrays of monotonically increasing integers. In such case, it is a common practice to compress the gaps between values rather than the original numbers. Gaps are smaller than the original integers, thus they can be compressed using less space. GroupVarint can be efficiently implemented using some bit masking and unaligned memory access [1]. Unluckily, unaligned memory access is not easily achievable in Java. This is why my previous implementation of GroupVarint, instead, used a big switch with 256 entries to decide how to decompress a block of four integers. Such approach is faster than Varint when the original data have low sparsity (i.e., the original numbers are close to each other resulting in small gaps) but it is slower when data are sparse (i.e., we have larger gaps). My conjecture is that this is accountable to branch mis-predictions and cache misses due to the very large switch. Yet, branch mis-prediction is exactly what GroupVarint originally aims to avoid.

For this reason, I have updated my Java implementation of GroupVarint. The new implementation reads integers from the compressed byte array, assuming a little-endian architecture and performing unaligned memory accesses using ~~black magic~~ the Unsafe class. A bit mask is then applied to the read integer to obtain its original value, as illustrated in [1]. This new implementation is slower than Varint for low data sparsity (maybe because of the overhead introduced by the Unsafe native methods?). However, GroupVarint now maintains its decompression speed constant w.r.t. data sparsity and it is faster than Varint for sparse data (confirming the results in [1]). Below you can see the results of a benchmark performed on an Intel i7-4500U. You can find the code on my github.

Compression speed (in integers per microsecond) for Varint (VB) and GroupVarint (GV)

Deompression speed (in integers per microsecond) for Varint (VB) and GroupVarint (GV)

References

http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html

29/10/2016

A simpl(istic) Java implementation of the Elias-Fano compression schema

Matteo Catena Compression, Java 2 Comments

As a coding exercise, I have implemented in Java a simple version of the Elias-Fano compression schema. This is a technique for compressing arrays of monotonically increasing integers. The interesting aspect is that the Elias-Fano compression schema permits to retrieve the i-th element in the compressed data, without decompressing the whole array. Similarly, it permits to find the index of the first element in the compressed data which is greater or equal to a given value — without decompressing the whole array. You can find the code on my github (here).

Usage

 int[] a = ...; //an array of monotonically increasing integers; 
//compress the array 
EliasFano ef = new EliasFano(); 
int u = a[a.length - 1]; //the maximum value in a; 
int size = ef.getCompressedSize(u, a.length); //the size of the compressed array
byte[] compressed = new byte[size]; 
ef.compress(a, 0, a.length, compressed, 0); 
//decompress the array 
int[] b = new int[a.length]; 
int L = ef.getL(u, a.length); //the number of lower bits (see references) 
ef.decompress(compressed, 0, a.length, L, b, 0); 
//get the value of the 4-th element in the compressed data 
int val = ef.get(compressed, 0, a.length, L, 3); 
//get the index of the first element, in the compressed data, greater or equal than 1000 
int idx = ef.select(compressed, 0, a.length, L, 1000);

So, I can’t use it to compress non-increasing arrays, isn’t it?
Sure you can! You just need to transform your array into a monotonically increasing one, by adding to the i-th value the sum of the previous values in the array. Then, you can recompute the original i-th element doing i-th minus (i-1)-th value.

EliasFano ef = new EliasFano();
int[] a1 = ...; //a generic array
//make a2 monotonically increasing
int[] a2 = new int[a1.length];
a2[0] = a1[0];
for (int i = 1; i < a1.length; i++) a2[i]=a1[i]+a2[i-1];
//compress a2
int u = a2[a2.length-1]; //the max value in a2
int size = ef.getCompressedSize(u, a2.length);
byte[] compressed = new byte[size];
ef.compress(a2, 0, a2.length, compressed, 0);
//get the original i-th value of a1, as i-th minus (i-1)-th of a2
int L = ef.getL(u, a2.length);
int val = ef.get(compressed, 0, a2.length, L, i)-ef.get(compressed, 0, a2.length, L, i-1);

Dependecies
– JUnit 4

References
For a general understanding of the Elias-Fano technique, see:
Sebastiano Vigna, “The Revenge of Elias and Fano” (here)

More advanced material:
Sebastiano Vigna, “Quasi-succinct indices”, WSDM’13
Giuseppe Ottaviano and Rossano Venturini, “Partitioned Elias-Fano indexes”, SIGIR’14

26/02/2015

Group Varint Encoding for Java

Matteo Catena Compression, Java 3 Comments

Integer compression can improve performance in various applications. For instance, search engines’ inverted indices can benefits from compression. These data structures, in fact, are mainly composed by integer numbers and need to be kept in RAM. Integer compression can help to reduce their size and to maintain more of the structure in main memory.

Varint (also called Variable Byte o VInt) is a byte-aligned, variable-length, encoding technique to compress integer numbers. It uses the lower 7 bits of a byte to store the number value, and the higher bit to signal if another byte is needed (when 7 bits are not enough). When the higher bit is set to 1, we need to read another byte to get the number value. If it is set to 0, we are done decoding the number.

Examples:
1 is 00000001
15 is 00001111
511 is 11111111 00000011
131071 is 11111111 11111111 00000111
[1, 15, 511, 131071] is 00000001 00001111 11111111 00000011 11111111 11111111 00000111

While it is simple and effective, Varint needs up to four if-then-else’s to decode an integer. This can limit its decoding speed. Now, consider the fact that the possible higher bits configurations can be represented with 2 bits (00: one byte is needed, 01: two, 10: three, 11: four). We can pack four 2 bits configurations in a single byte and encode four integers in the variable-length format. At decode time, we just need to read the first byte to decompress four numbers at once!

Example:
[1, 15, 511, 131071] is 00000110 00000001 00001111 11111111 00000001 11111111 11111111 00000001

This techinque is called Group Varint. More about this can be found in:
1) Jeff Dean (Google), “Challenges in Building Large-Scale Information Retrieval Systems”, WSDM’09

As programming exercise, I implemented Group Varint in Java. You can find the groupvarint code on github.

Usage example:

 int a[] = {1, 2, 3 , 4, 5, 6, 7, 8};
 byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)];
 int c[] = new int[a.length];

 GroupVarint.compress(a, 0, a.length, b, 0);
 GroupVarint.uncompress(b, 0, c, 0, a.length);

with negative numbers:

 int a[] = {0, 1, 2, -1, -2, Integer.MAX_VALUE, Integer.MIN_VALUE};
 byte b[] = new byte[GroupVarint.getSafeCompressedLength(a.length)];
 int c[] = new int[a.length];
 int support[] = Arrays.copyOf(a, a.length);

 ZigZagGroupVarint.compress(support, 0, support.length, b, 0);
 ZigZagGroupVarint.uncompress(b, 0, c, 0, a.length);