05/09/2012

Extract relevant terms and relevant tweets from a collection

Matteo Catena Java data mining, java, set covering, twitter 0 Comments

Introduction

I’ve developed this application with my friend and colleague Giacomo Lamonaco for the exam of Ottimizzazione Combinatoria 2.
The aim of this software is to extract from a tweet collection the terms that better represent its content.
We’ve tackled the problem modelling it as a set covering problem, thus looking in a tweet collection for the smallest set of terms able to cover the entire collection.

Modeling & implementation

The problem has been modeled with the classical set covering formulation:

$min\sum_{i=0}^{n}x_{i}\ \forall x\in dictionary\\* s.t.\\*\\* \sum_{j=0}^{m}x_{j}\ \forall x\in tweet\ and \ \forall \ tweet \in collection\\*\\* x\in \left [ 0,1 \right ]$

Where x_i is the i-th dictionary term, the dictionary is the set of all the terms found in the tweet collection, the tweet is a generic collection tweet and the collection is the set of all the tweet downloaded from Twitter via its API.

We’ve written a Java application using Twitter4J. Our application downloads a set of about 1500 tweets from Twitter. The collection is downloaded searching for a query, selected from the “Trending topics”. The limit of 1500 tweets is imposed by the API limitation.

Once downloaded the collection, this is indexed using our Java classes. Every tweet is analysed this way:

It is considered only if it’s not a re-tweet
If it’s ‘original’, it’s tokenized
Tokens that represent http links, #hashtags*, @mentions are removed. Remove also tokens that are ‘stopwords’ (words so much used that have little semantic weight)
The remaining tokens are stemmed. As result, we have the terms.
If the term is shorter than 3 characters, it’s removed. Otherwise save it in the index (which represents the dictionary). Save it also in a term list associated to the ID of the corresponding tweet (every list represents a tweet, the set of all the lists represents the collection).

This process guarantees us to work only with meaningful terms, removing those with little semantic relevance.
Please note that part of this process is language specific (stopword list, stemmer, …). Our application is set for tweet collections written in English.

Once the collection is indexed, the data structures at point 5 are passed as parameters to our set cover solver. The solver has been written using libraries for integer linear programming. The solver implements the described model, returning as result a cover represented by a set of term.The solver has been implemented in two versions. One uses GLPK’s JNI via JavaILP, another one directly uses CPLEX’s JNI.

The benefit in using JavaILP it’s the possibility of switching among JNIs maintaining a single interface. For example, one can use GLPK or lp_solve without modifying the algorithms. Using CPLEX directly, instead, the benefits is in having far better solving performances.

* #hashtags are removed because usually they are already present in the query you use to download the collection. Moreover they may lead to a cover that contains just them.

Usage

Usually I launch the application inside Eclipse, so I’ll refer to that environment.

Create a new run configuration using MainILP as main class.

In VM arguments type: -Djava.library.path=/path/to/your/jni
where /path/to/your/jni is where there are your JNIs for GLPK, lp_solve, CPLEX, etc. (not provided with this software; for GLPK start looking here).

Then you can fill the Program arguments depending on what you’re doing:

Download a tweet collection:

-d path/to/the/collection query

The program will download from twitter to path/to/the/collection a tweet collection related to the query. Ex.: -d /home/matteo/Downloads/columbine.xml Columbine

Cover a tweet collection:

-c path/to/the/collection

The program will find a term cover for the tweet collection at path/to/the/collection. Ex.: -c /home/matteo/Downloads/columbine.xml

Cover a tweet collection excluding some terms:

-ec path/to/the/collection word_to_exclude

The program won’t index word_to_exclude and then will find a term cover. Ex.: -ec /home/matteo/Downloads/columbine.xml Colorado

Shortly after the the dramatic shooting at the Batman premier, I’ve downloaded a collection for the term ‘Columbine’. Covering that collection, this is the result:

–SOLUTION–
Number of element in the cover: 15.0
columbin, freq: 895
shoot, freq: 298
colorado, freq: 262
peopl, freq: 86
movi, freq: 80
crazi, freq: 45
sad, freq: 27
read, freq: 23
dark, freq: 17
noth, freq: 10
prai, freq: 7
deep, freq: 1
turnonoth, freq: 1
justic, freq: 1
edit, freq: 1

—MOST VALUABLE TWEETS—
[226359371112792064]First the Columbine shooting and now at the movies. That’s so sad. People are so sick. My heart goes out to those in Colorado.
[226359731848085505]im never going to colorado they had columbine and now niggas shooting people in movie theatres
[226358442917511168]apparently colorado is filled with fucked up people who like to shoot other people for no reason. columbine and now the movie shootings. smh
[226357986409476096]Colorado is the weirdest state in the union. The people there are cracked. The Ramseys, Columbine, and a guy who shoots babies at the movies
[226357081932984321]Remind me to never move to Colorado. Jonbenet Ramsey, columbine high school, the dark knight rises shooting. Crazy people in the Midwest.

Download: TTE.jar columbine.xml