29/10/2016

A simpl(istic) Java implementation of the Elias-Fano compression schema

Matteo Catena Compression, Java 2 Comments

As a coding exercise, I have implemented in Java a simple version of the Elias-Fano compression schema. This is a technique for compressing arrays of monotonically increasing integers. The interesting aspect is that the Elias-Fano compression schema permits to retrieve the i-th element in the compressed data, without decompressing the whole array. Similarly, it permits to find the index of the first element in the compressed data which is greater or equal to a given value — without decompressing the whole array. You can find the code on my github (here).

Usage

 int[] a = ...; //an array of monotonically increasing integers; 
//compress the array 
EliasFano ef = new EliasFano(); 
int u = a[a.length - 1]; //the maximum value in a; 
int size = ef.getCompressedSize(u, a.length); //the size of the compressed array
byte[] compressed = new byte[size]; 
ef.compress(a, 0, a.length, compressed, 0); 
//decompress the array 
int[] b = new int[a.length]; 
int L = ef.getL(u, a.length); //the number of lower bits (see references) 
ef.decompress(compressed, 0, a.length, L, b, 0); 
//get the value of the 4-th element in the compressed data 
int val = ef.get(compressed, 0, a.length, L, 3); 
//get the index of the first element, in the compressed data, greater or equal than 1000 
int idx = ef.select(compressed, 0, a.length, L, 1000);

So, I can’t use it to compress non-increasing arrays, isn’t it?
Sure you can! You just need to transform your array into a monotonically increasing one, by adding to the i-th value the sum of the previous values in the array. Then, you can recompute the original i-th element doing i-th minus (i-1)-th value.

EliasFano ef = new EliasFano();
int[] a1 = ...; //a generic array
//make a2 monotonically increasing
int[] a2 = new int[a1.length];
a2[0] = a1[0];
for (int i = 1; i < a1.length; i++) a2[i]=a1[i]+a2[i-1];
//compress a2
int u = a2[a2.length-1]; //the max value in a2
int size = ef.getCompressedSize(u, a2.length);
byte[] compressed = new byte[size];
ef.compress(a2, 0, a2.length, compressed, 0);
//get the original i-th value of a1, as i-th minus (i-1)-th of a2
int L = ef.getL(u, a2.length);
int val = ef.get(compressed, 0, a2.length, L, i)-ef.get(compressed, 0, a2.length, L, i-1);

Dependecies
– JUnit 4

References
For a general understanding of the Elias-Fano technique, see:
Sebastiano Vigna, “The Revenge of Elias and Fano” (here)

More advanced material:
Sebastiano Vigna, “Quasi-succinct indices”, WSDM’13
Giuseppe Ottaviano and Rossano Venturini, “Partitioned Elias-Fano indexes”, SIGIR’14

2 Comments

Craig
Oct 31, 2016 @ 19:47:21

It might be simplistic, but is it fast Matteo?

- Matteo Catena
  Nov 01, 2016 @ 14:49:13
  
  In term of pure compression/decompression speed, my implementation is sure way slower than other codecs out there (e.g., BinaryPacking).
  
  To test the efficiency of the get and select methods, I’ve written a little benchmark to measure the avg. time required to intersect two small arrays (512 elements) using these two methods.
  When data sparsity is low, it requires in avg. ~7 microsecs on my laptop to intersect two uncompressed arrays.
  To intersect the same arrays, compressed w/ EliasFano, my laptop requires in avg. ~150 microsecs .
  The memory footprint, however, is just the 9% of the uncompressed case.
  The code is now on github.
  
  Finally, I didn’t compare my implementation with more sophisticated alternatives, like sux4j (http://sux.di.unimi.it/).

nicecode.eu

A simpl(istic) Java implementation of the Elias-Fano compression schema

2 Comments

Leave a Reply Cancel