Java streams for direct I/O

At work, we were recently trying to estimate performance differences in keeping some big data structures (> 8Gb) in main memory or in disk memory. Surprisingly, our experiments (run on Linux machines) showed very little discrepancies between the two scenarios, while in-memory experiments were supposed to be much faster than on-disk ones.

A blog post from Changing Bits gave me an hint on what was going on. The “culprit” of this anti-intuitive behavior was the disk caching done by Linux.  Due to this caching mechanism, the data structure kept on disk was also in the system cache. Then, accessing to it resulted in reads from the main memory instead than in I/O to the disk.

It is viable, however, to open a file in a way which performs I/O directly on the disk, instead than on the system cache. This is possible using the O_DIRECT flag while opening the file. Nevertheless, this flag it is not supported by Java.

A workaround to this limitation might be to modify the JVM, or to write JNI code. A simpler approach is to use the super cool JNA library. “JNA provides Java programs easy access to native shared libraries without writing anything but Java code – no JNI or native code is required”.

Attached to this post you can find my implementations for InputStream and OutputStream using JNA and the O_DIRECT flag. Please note that these classes are meant to be used under a Linux system with glibc version >= 2.15. To adapt these to aother environments, please look at JNA documentation.

I’ve made some experiments (I’m not going to say it was a proper benchmarking), comparing standard java InputStream/OutputStream, JNAInputStream/JNAOutputStream using and not using direct I/O. I’ve write and red a series of 10000 integers one hundred times, and taken the average time for input and output operations. JNA streams using O_DIRECT are far slower than standard Java’s, while JNA streams with the flag turned off can be even faster. This means that the difference in running time is not due the overhead of using an external library such as JNA. The experimental class is attached as well.

In conclusion, Linux disk caching is really an amazing feature. On an idling machine, it can cache data structure even bigger than 8 GB, drammatically improving software performances. Nevertheless, it can be useful to turn off this feature if required during benchmarking activities.

Extract relevant terms and relevant tweets from a collection

Introduction

I’ve developed this application with my friend and colleague Giacomo Lamonaco for the exam of Ottimizzazione Combinatoria 2.
The aim of this software is to extract from a tweet collection the terms that better represent its content.
We’ve tackled the problem modelling it as a set covering problem, thus looking in a tweet collection for the smallest set of terms able to cover the entire collection.

Modeling & implementation

The problem has been modeled with the classical set covering formulation:

$min\sum_{i=0}^{n}x_{i}\&space;\forall&space;x\in&space;dictionary\\*&space;s.t.\\*\\*&space;\sum_{j=0}^{m}x_{j}\&space;\forall&space;x\in&space;tweet\&space;and&space;\&space;\forall&space;\&space;tweet&space;\in&space;collection\\*\\*&space;x\in&space;\left&space;[&space;0,1&space;\right&space;]$

Where xi is the i-th dictionary term, the dictionary is the set of all the terms found in the tweet collection, the tweet is a generic collection tweet and the collection is the set of all the tweet downloaded from Twitter via its API.

We’ve written a Java application using Twitter4J. Our application downloads a set of about 1500 tweets from Twitter. The collection is downloaded searching for a query, selected from the “Trending topics”. The limit of 1500 tweets is imposed by the API limitation.

Once downloaded the collection, this is indexed using our Java classes. Every tweet is analysed this way:

1. It is considered only if it’s not a re-tweet
2. If it’s ‘original’, it’s tokenized
3. Tokens that represent http links, #hashtags*, @mentions are removed.  Remove also tokens that are ‘stopwords’ (words so much used that have little semantic weight)
4. The remaining tokens are stemmed. As result, we have the terms.
5. If the term is shorter than 3 characters, it’s removed. Otherwise save it in the index (which represents the dictionary). Save it also in a term list associated to the ID of the corresponding tweet (every list represents a tweet, the set of all the lists represents the collection).

This process guarantees us to work only with meaningful terms, removing those with little semantic relevance.
Please note that part of this process is language specific (stopword list, stemmer, …). Our application is set for tweet collections written in English.

Once the collection is indexed, the data structures at point 5 are passed as parameters to our set cover solver. The solver has been written using libraries for integer linear programming. The solver implements the described model, returning as result a cover represented by a set of term.The solver has been implemented in two versions. One uses GLPK’s JNI via JavaILP, another one directly uses CPLEX’s JNI.

The benefit in using JavaILP it’s the possibility of switching among JNIs maintaining a single interface. For example, one can use GLPK or lp_solve without modifying the algorithms. Using CPLEX directly, instead, the benefits is in having far better solving performances.

* #hashtags are removed because usually they are already present in the query you use to download the collection. Moreover they may lead to a cover that contains just them.

Usage

Usually I launch the application inside Eclipse, so I’ll refer to that environment.

Create a new run configuration using MainILP as main class.

In VM arguments type: -Djava.library.path=/path/to/your/jni
where /path/to/your/jni is where there are your JNIs for GLPK, lp_solve, CPLEX, etc. (not provided with this software; for GLPK start looking here).

Then you can fill the Program arguments depending on what you’re doing:

-d path/to/the/collection query

Cover a tweet collection:

-c path/to/the/collection

The program will find a term cover for the tweet collection at path/to/the/collection. Ex.: -c /home/matteo/Downloads/columbine.xml

Cover a tweet collection excluding some terms:

-ec path/to/the/collection word_to_exclude

Shortly after the the dramatic shooting at the Batman premier, I’ve downloaded a collection for the term ‘Columbine’. Covering that collection, this is the result:

–SOLUTION–
Number of element in the cover: 15.0
columbin, freq: 895
shoot, freq: 298
peopl, freq: 86
movi, freq: 80
crazi, freq: 45
dark, freq: 17
noth, freq: 10
prai, freq: 7
deep, freq: 1
turnonoth, freq: 1
justic, freq: 1
edit, freq: 1

—MOST VALUABLE TWEETS—
[226359371112792064]First the Columbine shooting and now at the movies. That’s so sad. People are so sick. My heart goes out to those in Colorado.
[226359731848085505]im never going to colorado they had columbine and now niggas shooting people in movie theatres
[226358442917511168]apparently colorado is filled with fucked up people who like to shoot other people for no reason. columbine and now the movie shootings. smh
[226357986409476096]Colorado is the weirdest state in the union. The people there are cracked. The Ramseys, Columbine, and a guy who shoots babies at the movies
[226357081932984321]Remind me to never move to Colorado. Jonbenet Ramsey, columbine high school, the dark knight rises shooting. Crazy people in the Midwest.

A Java TSP solver which uses Google Maps API

During my studies at the Web Technology Master, I had to do a J2EE web application as exam project. I chose to do an application that manages technical interventions of a company. One exam’s request was to mash up the application contents with contents from the web. So I decided to implement a functionality which tells the technicians what is the shortest route in order to visit every customer with technical problems and to go back to the company’s headquarter.

This was clearly a TSP problem. I solved it using Google Maps API and refactoring Dr.René Grothmann’s Travelling Salesman Applet.

A screenshot of the example application

Basically, I wrote a class called GTspService from which you can call the method calculateRoute(). This method takes as its only parameter a collection of String that represent the locations to visit (you can use the name of a city or a complete address). The first element of this collection should be the origin of the route. As result, calculateRoute() returns another String collection that represent the order in which you should visit the locations to have the shortest route.

It’s easy as doing:

Collection<String> locations = new ArrayList<String>();
TspService tsps = new GTspService();
Collection<String> route = tsps.calculateRoute(locations);
System.out.println(route);

The result:

[L'Aquila, Napoli, Palermo, Lecce, Pescara, Bologna, Venezia, Milano, Torino, Firenze]

GTspService use GoogleMaps API to obtain the locations distance matrix and then gives it to Dr.Grothmann solver, which actually calculates the route.

Since you shouldn’t use Google distance matrix service in applications that don’t show a Google Map, I’ve encapsulated my code in a simple web application that you can find attached to this post (you’ll need a servlet container like Apache Tomcat to make it works). Unfortunately, Google Maps direction service permits you to draw up to 8 locations, not much for a TSP problem. However, my application can still work as a proof-of-concept if you find a service to use instead of Google Maps distance matrix. The actual TSP solver, in fact, works just fine until you give it any distance matrix, its source doesn’t matter. You can find the core code under the directory WEB-INF/lib/tspsolver.jar of the attached .war