GitHub - trec-kba/kba-2012-hadoop-job: This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects.

Hadoop code for TREC KBA

This project contains some Hadoop code for working with the TREC Knowledge Base Acceleration (TREC KBA) dataset. In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects (ThriftFileInputFormat).

Installing

Dependencies

The project requires some external jars, see below. Please download them and place them in the lib/ folder.

Building

Then, to build, type

ant -f trec-kba.xml

Eclipse project files are included, so you can also directly check out/build the code there.

Running

The project comes with two example Hadoop applications; a simple genre counter and a toy KBA system. Both assume that the KBA files have been downloaded, un-gpg'ed, and un-xz'ed. in the official folder structure. Below, I'm assuming the root folder is kba/kba-stream-corpus-2012-cleansed-only-out.

To actually fetch the data (if you haven't already done so), you can write a simple shell script and use Hadoop streaming to fetch, un-gpg, and un-xz the data.

Genre counter

This application merely counts the the different genres (web, social, news) in the KBA data. To run, use

hadoop jar trec-kba.jar ilps.hadoop.bin CountGenres \
    -i kba/kba-stream-corpus-2012-cleansed-only-out/*/* \
    -o kba/kba-stream-corpus-2012-cleansed-only-genre-counts

Toy KBA system

This application is inspired by the Python toy KBA system and provides similar functionality. To run, use

hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem \
    -i kba/tiny-kba-stream-corpus/*/* \
    -o kba/tiny-kba-stream-corpus-out \
    -q kba/filter-topics.sample-trec-kba-targets-2012.json \
    -r toy_02 -t UvA -d "My first run." \
    > toy_kba_system.run_1.json

Note that the tiny-kba-stream-corpus can be found in the official toy KBA system.

Type hadoop jar trec-kba.jar ilps.hadoop.bin.ToyKbaSystem --help for all possible options.

Issues

Have a bug? Please create an issue here on GitHub!

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.settings		.settings
bin		bin
lib		lib
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE-2.0.html		LICENSE-2.0.html
README.md		README.md
trec-kba.xml		trec-kba.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop code for TREC KBA

Installing

Dependencies

Building

Running

Genre counter

Toy KBA system

Issues

License

About

Releases

Packages

trec-kba/kba-2012-hadoop-job

Folders and files

Latest commit

History

Repository files navigation

Hadoop code for TREC KBA

Installing

Dependencies

Building

Running

Genre counter

Toy KBA system

Issues

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages