At work, we were recently trying to estimate performance differences in keeping some big data structures (> 8Gb) in main memory or in disk memory. Surprisingly, our experiments (run on Linux machines) showed very little discrepancies between the two scenarios, while in-memory experiments were supposed to be much faster than on-disk ones.
A blog post from Changing Bits gave me an hint on what was going on. The “culprit” of this anti-intuitive behavior was the disk caching done by Linux. Due to this caching mechanism, the data structure kept on disk was also in the system cache. Then, accessing to it resulted in reads from the main memory instead than in I/O to the disk.
It is viable, however, to open a file in a way which performs I/O directly on the disk, instead than on the system cache. This is possible using the O_DIRECT flag while opening the file. Nevertheless, this flag it is not supported by Java.
A workaround to this limitation might be to modify the JVM, or to write JNI code. A simpler approach is to use the super cool JNA library. “JNA provides Java programs easy access to native shared libraries without writing anything but Java code – no JNI or native code is required”.
Attached to this post you can find my implementations for InputStream and OutputStream using JNA and the O_DIRECT flag. Please note that these classes are meant to be used under a Linux system with glibc version >= 2.15. To adapt these to aother environments, please look at JNA documentation.
I’ve made some experiments (I’m not going to say it was a proper benchmarking), comparing standard java InputStream/OutputStream, JNAInputStream/JNAOutputStream using and not using direct I/O. I’ve write and red a series of 10000 integers one hundred times, and taken the average time for input and output operations. JNA streams using O_DIRECT are far slower than standard Java’s, while JNA streams with the flag turned off can be even faster. This means that the difference in running time is not due the overhead of using an external library such as JNA. The experimental class is attached as well.
In conclusion, Linux disk caching is really an amazing feature. On an idling machine, it can cache data structure even bigger than 8 GB, drammatically improving software performances. Nevertheless, it can be useful to turn off this feature if required during benchmarking activities.