Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support query of multiple HDT files from CLI #166

Open
donpellegrino opened this issue Jul 20, 2022 · 3 comments
Open

Support query of multiple HDT files from CLI #166

donpellegrino opened this issue Jul 20, 2022 · 3 comments

Comments

@donpellegrino
Copy link

Querying HDT with SPARQL from the CLI only accepts a single HDT file at a time (https://github.com/rdfhdt/hdt-java/blob/master/hdt-jena/src/main/java/org/rdfhdt/hdtjena/cmd/HDTSparql.java). It would be a useful enhancement if multiple HDT files could be provided, and the query run over the aggregation.

One candidate implementation might use a Jena DatasetFactory for the aggregation, but I have not seen an example of how that might be used. If anyone can post an example of the correct use of Jena for this, then I should be able to implement the feature in HDTSparql.java.

@ate47
Copy link
Contributor

ate47 commented Jul 20, 2022

I think you can achieve that with the Model ModelFactory#createUnion(Model,Model) method, the datasets are usually for named graphs.

But if you are using that, the Union implementation is working with only 2 models and by using an HashSet to store seen triples. Chaining multiple Unions using 1 union/hdt might be memory consuming.

Edit: An internal method to HDT-CORE would be better (and harder) to implement if you can :)

@donpellegrino
Copy link
Author

It looks like the Apache Jena API DatasetFactory, Dataset.addNamedModel, and Dataset.getUnionModel could be combined as another approach. @ate47 - Do you have any thoughts on what the consequences or efficiency of ModelFactory.createUnion would be versus Dataset.getUnionModel?

I can take a look at HDT-CORE as well. @ate47 - do you have a class or function point you could suggest for me to use as a starting point?

@ate47
Copy link
Contributor

ate47 commented Jul 21, 2022

I'm not sure, but from my memories, you need to run store updates in the main dataset to merge the union model, so you need a Jena model because the HDT model can't handle updates and it will be long to load and to manage in memory, but I'm not a expert about this part, so you can try if you want.

To learn the internal usage of HDT, I would suggest to read this submission about it and then you can start by the Dictionaries, the default implementation (org.rdfhdt.hdt.dictionary.impl.FourSectionDictionary) is the easiest to understand, then you can follow by the org.rdfhdt.hdt.compact packages with the usage of the bitmaps in org.rdfhdt.hdt.triples.impl.BitmapTriples class and reading in org.rdfhdt.hdt.hdt.impl.HDTImpl how everything is linked together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants