This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration (TREC KBA) dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects (ThriftFileInputFormat
).
The project requires some external jars, see below. Please download them and place them in the lib/ folder.
- CodeModel 2.4.1
- Commons Lang 2.5
- Commons Logging 1.1.1
- Jackson Core 2.0.0
- Jackson Databind 2.0.0
- Jackson Annotations 2.0.0
- hadoop-core-0.20.2-cdh3u3.jar
- libthrift-0.8.0.jar
- log4j 1.2.14
Then, to build, type
ant -f trec-kba.xml
Eclipse project files are included, so you can also directly check out/build the code there.
The project comes with two example Hadoop applications; a simple genre counter and a toy KBA system. Both assume that the KBA files have been downloaded, un-gpg'ed, and un-xz'ed. in the official folder structure. Below, I'm assuming the root folder is kba/kba-stream-corpus-2012-cleansed-only-out
.
To actually fetch the data (if you haven't already done so), you can write a simple shell script and use Hadoop streaming to fetch, un-gpg, and un-xz the data.
This application merely counts the the different genres (web, social, news) in the KBA data. To run, use
hadoop jar trec-kba.jar ilps.hadoop.bin CountGenres \
-i kba/kba-stream-corpus-2012-cleansed-only-out/*/* \
-o kba/kba-stream-corpus-2012-cleansed-only-genre-counts
This application is inspired by the Python toy KBA system and provides similar functionality. To run, use
hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem \
-i kba/tiny-kba-stream-corpus/*/* \
-o kba/tiny-kba-stream-corpus-out \
-q kba/filter-topics.sample-trec-kba-targets-2012.json \
-r toy_02 -t UvA -d "My first run." \
> toy_kba_system.run_1.json
Note that the tiny-kba-stream-corpus
can be found in the official toy KBA system.
Type hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem --help
for all possible options.
Have a bug? Please create an issue here on GitHub!
Copyright 2012 Edgar Meij.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0