groupvarint 0.0.2 is out – now with black magic!

GroupVarint is a technique for compressing arrays of integers. I have written about it here. GroupVarint can be used to compress arrays of monotonically increasing integers. In such case, it is a common practice to compress the gaps between values rather than the original numbers. Gaps are smaller than the original integers, thus they can be compressed using less space. GroupVarint can be efficiently implemented using some bit masking and unaligned memory access [1]. Unluckily, unaligned memory access is not easily achievable in Java. This is why my previous implementation of GroupVarint, instead, used a big switch with 256 entries to decide how to decompress a block of four integers. Such approach is faster than Varint when the original data have low sparsity (i.e., the original numbers are close to each other resulting in small gaps) but it is slower when data are sparse (i.e., we have larger gaps). My conjecture is that this is accountable to branch mis-predictions and cache misses due to the very large switch. Yet, branch mis-prediction is exactly what GroupVarint originally aims to avoid.

For this reason, I have updated my Java implementation of GroupVarint. The new implementation reads integers from the compressed byte array, assuming a little-endian architecture and performing unaligned memory accesses using black magic the Unsafe class. A bit mask is then applied to the read integer to obtain its original value, as illustrated in [1]. This new implementation is slower than Varint for low data sparsity (maybe because of the overhead introduced by the Unsafe native methods?). However, GroupVarint now maintains its decompression speed constant w.r.t. data sparsity and it is faster than Varint for sparse data (confirming the results in [1]). Below you can see the results of a benchmark performed on an Intel i7-4500U. You can find the code on my github.

Compression speed (in integers per microsecond) for Varint (VB) and GroupVarint (GV)

Deompression speed (in integers per microsecond) for Varint (VB) and GroupVarint (GV)

References
  1. http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html