Skip to content

This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects.

Notifications You must be signed in to change notification settings

trec-kba/kba-2012-hadoop-job

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop code for TREC KBA

This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration (TREC KBA) dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects (ThriftFileInputFormat).

Installing

Dependencies

The project requires some external jars, see below. Please download them and place them in the lib/ folder.

Building

Then, to build, type

ant -f trec-kba.xml

Eclipse project files are included, so you can also directly check out/build the code there.

Running

The project comes with two example Hadoop applications; a simple genre counter and a toy KBA system. Both assume that the KBA files have been downloaded, un-gpg'ed, and un-xz'ed. in the official folder structure. Below, I'm assuming the root folder is kba/kba-stream-corpus-2012-cleansed-only-out.

To actually fetch the data (if you haven't already done so), you can write a simple shell script and use Hadoop streaming to fetch, un-gpg, and un-xz the data.

Genre counter

This application merely counts the the different genres (web, social, news) in the KBA data. To run, use

hadoop jar trec-kba.jar ilps.hadoop.bin CountGenres \
    -i kba/kba-stream-corpus-2012-cleansed-only-out/*/* \
    -o kba/kba-stream-corpus-2012-cleansed-only-genre-counts

Toy KBA system

This application is inspired by the Python toy KBA system and provides similar functionality. To run, use

hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem \
    -i kba/tiny-kba-stream-corpus/*/* \
    -o kba/tiny-kba-stream-corpus-out \
    -q kba/filter-topics.sample-trec-kba-targets-2012.json \
    -r toy_02 -t UvA -d "My first run." \
    > toy_kba_system.run_1.json

Note that the tiny-kba-stream-corpus can be found in the official toy KBA system.

Type hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem --help for all possible options.

Issues

Have a bug? Please create an issue here on GitHub!

License

Copyright 2012 Edgar Meij.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

About

This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published