Support query of multiple HDT files from CLI #166

donpellegrino · 2022-07-20T22:02:17Z

Querying HDT with SPARQL from the CLI only accepts a single HDT file at a time (https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java). It would be a useful enhancement if multiple HDT files could be provided, and the query run over the aggregation.

One candidate implementation might use a Jena DatasetFactory for the aggregation, but I have not seen an example of how that might be used. If anyone can post an example of the correct use of Jena for this, then I should be able to implement the feature in HDTSparql.java.

ate47 · 2022-07-20T23:15:15Z

I think you can achieve that with the Model ModelFactory#createUnion(Model,Model) method, the datasets are usually for named graphs.

But if you are using that, the Union implementation is working with only 2 models and by using an HashSet to store seen triples. Chaining multiple Unions using 1 union/hdt might be memory consuming.

Edit: An internal method to HDT-CORE would be better (and harder) to implement if you can :)

donpellegrino · 2022-07-21T13:58:13Z

It looks like the Apache Jena API DatasetFactory, Dataset.addNamedModel, and Dataset.getUnionModel could be combined as another approach. @ate47 - Do you have any thoughts on what the consequences or efficiency of ModelFactory.createUnion would be versus Dataset.getUnionModel?

I can take a look at HDT-CORE as well. @ate47 - do you have a class or function point you could suggest for me to use as a starting point?

ate47 · 2022-07-21T14:50:32Z

I'm not sure, but from my memories, you need to run store updates in the main dataset to merge the union model, so you need a Jena model because the HDT model can't handle updates and it will be long to load and to manage in memory, but I'm not a expert about this part, so you can try if you want.

To learn the internal usage of HDT, I would suggest to read this submission about it and then you can start by the Dictionaries, the default implementation (org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary) is the easiest to understand, then you can follow by the org.rdfhdt.hdt.compact packages with the usage of the bitmaps in org.rdfhdt.hdt.triples.impl.BitmapTriples class and reading in org.rdfhdt.hdt.hdt.impl.HDTImpl how everything is linked together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support query of multiple HDT files from CLI #166

Support query of multiple HDT files from CLI #166

donpellegrino commented Jul 20, 2022

ate47 commented Jul 20, 2022 •

edited

Loading

donpellegrino commented Jul 21, 2022

ate47 commented Jul 21, 2022

Support query of multiple HDT files from CLI #166

Support query of multiple HDT files from CLI #166

Comments

donpellegrino commented Jul 20, 2022

ate47 commented Jul 20, 2022 • edited Loading

donpellegrino commented Jul 21, 2022

ate47 commented Jul 21, 2022

ate47 commented Jul 20, 2022 •

edited

Loading