Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdt::QueryProcessor.searchJoin() gives incorrect results #265

Open
donpellegrino opened this issue Aug 29, 2022 · 6 comments
Open

hdt::QueryProcessor.searchJoin() gives incorrect results #265

donpellegrino opened this issue Aug 29, 2022 · 6 comments

Comments

@donpellegrino
Copy link
Contributor

Filing this issue for hdt-cpp work on RDFLib/rdflib-hdt#14. See other issue for test case and additional details.

donpellegrino added a commit to DeciSym/hdt-cpp that referenced this issue Aug 29, 2022
… BasicVarBindingString class to its own pair of implementation files to improve readability.
@mielvds
Copy link
Member

mielvds commented Aug 30, 2022

Strange indeed. Are you following up on this?

@donpellegrino
Copy link
Contributor Author

I have read through some of the code, but I am still investigating the cause. My next step is to trace through the execution of the test case and see if I can find where the logic breaks. I have a sandbox setup with the Python and C++ repositories working together. So far, I have just run clang-format on the relevant C++ classes and moved BasicVarBindingString to its own .hpp/.cpp files. Based on the comments in the code, it looks like hdt::QueryProcessor.searchJoin() was a work-in-progress and never fully implemented.

@mielvds - if you or anyone else knows the history of the QueryProcessor.searchJoin(), please let me know.

@mielvds
Copy link
Member

mielvds commented Aug 30, 2022

only @MarioAriasGa and if you're lucky @LaurensRietveld might know more.

@donpellegrino
Copy link
Contributor Author

Around https://github.com/rdfhdt/hdt-cpp/blob/develop/libhdt/src/sparql/QueryProcessor.cpp#L90, I suspect the triplePatID variable is assigning "0" for cases that should be distinct. A "0" is used when a subject, predicate, or object is a variable and will therefore match anything. However, a "0" is also used when the string does not match anything from the dictionary. Thus, strings that are non-matches are effectively treated as variables that match anything.

@mielvds
Copy link
Member

mielvds commented Aug 31, 2022

TBH, I didn't even know that there was a (partial) SPARQL implementation in the HDT-CPP. My guess is that it is used nowhere and was probably never finished. In the Java version, the query processing is offloaded to Jena, maybe something similar is possible with oxigraph or even rdflib.

@donpellegrino
Copy link
Contributor Author

It looks like hdt::QueryProcessor is limited to processing Basic Graph Patterns (BGP), so it is still short of a SPARQL implementation. I understand that one branch of code queries triples via a single BGP at a time. The RDFLib/rdflib-hdt library uses that approach by default. The hdt:QueryProcessor appears to extend that capability to add efficiencies for the case of multiple BGPs at once. This is critical for performance and leveraging the Dictionary (index). The rdflib-hdt library has an optimize_sparql() function that causes it to use the QueryProcessor for multiple BGPs instead of querying one BGP at a time and then aggregating them in the rdflib SPARQL engine.

I suspect that any pluggable SPARQL engine sitting on top of HDT (e.g., Apache Jena ARQ, Python rdflib, etc.) can interface with the HDT function for querying a single BGP at a time. But, anytime that approach is taken, performance will be left on the table as the Dictionary may remain underutilized for specific optimizations. An HDT function (hdt::QueryProcessor.searchJoin) that can provide an interface to a set of BGP and give an efficient response does seem like it would be an essential interface underneath any SPARQL engine.

It would be interesting to compare how the HDT Java version handles this. If anyone familiar with that codebase could confirm my assumptions of how things work and point me to the relevant Java code for comparison, that would be very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants