Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random access pagination or iteration? #59

Open
lhannest opened this issue Oct 15, 2018 · 4 comments
Open

Random access pagination or iteration? #59

lhannest opened this issue Oct 15, 2018 · 4 comments

Comments

@lhannest
Copy link
Member

@cmungall @RichardBruskiewich @vdancik Must we support random access pagination (getting a page of a given offset and size)? Would it be troubling to support only iteration (getting the next page of a given size)?

There may not be a bijective mapping from the knowledge source's records to the statements we want to extract. Sometimes I might infer a single statements from multiple records, or multiple statements from a single record. And sometimes I'm not able to apply filters when getting records from the knowledge source, and I must throw away records that don't match the filters. This is easy if we don't need to support random access, and pretty challenging otherwise. With NDEx we've been caching all results and then returning pages from the cache, but that seems like a pretty impractical solution.

@vdancik
Copy link
Contributor

vdancik commented Oct 15, 2018

We chose pagination (offset + size) because it allows servers to to fulfill requests without need to keep internally the "status" of the requests.

@lhannest
Copy link
Member Author

lhannest commented Oct 16, 2018

Maybe I'm not explaining very well, but I'm talking about a case where the server cannot fulfill the request without keeping some kind of information about the requests. That there is no function mapping an offset and size of beacon records to an offset and size of knowledge source records.

NDEx is an example of this (where you get pages of networks, and the size of those networks is variable), and so far I think Rhea's sparql endpoint is too--though maybe that's just because I'm not very familiar with sparql.

Another solution is to allow beacon responses to be larger or smaller than the requested page size. Maybe the requested page size could be treated as a suggestion rather than a requirement.

@lhannest
Copy link
Member Author

lhannest commented Nov 8, 2018

Instead of an iterator key the server could pass back a next page token. This would be the best of both worlds. The server wouldn't have to keep track of each clients, but it would also have the freedom to use more than just size and offset to get pages. For NDEx the token could encode a network offset and an offset within that network.

@RichardBruskiewich
Copy link
Collaborator

@cmungall had the right idea in talking about the notion of a database "cursor". At the end of the day, it is all about simply retrieving all the relevant knowledge in this wild west of relatively boundless knowledge harvesting (graph processing can be NP-hard!). Streaming rather than random access satisfies this urge.

@lhannest's idea of returning a "next page" token containing information that is "server specific" state is fine (@hsolbrig had a similar thought at the hackathon, albeit, expressed slightly differently, more like a HATEOS "more data" URL).

That said, wouldn't one need to somehow account for all the parameters of the original query, not just some naked index into the data, like the (nDex) network offset and offset within that network? In other words, what constitutes the total "state" of a given query that informs the server to "continue" the work of retrieving a specific chunk of data.

Unless one encodes all such information into the return values, this still smells a bit like some kind of web server "session" state management. Maybe it doesn't make send to have the beacon server really completely abdicate cursor management responsibility.. maybe it still has to have some provisions to keep track of an "ongoing" query.

Even so, as @lhannest suggests, I suspect that it suffices to keep beacons "sequential streaming" rather than "random access paging", and let client software worry about presenting a cleanly behaved paging world to the end users.

For example, KBA, as a representative "client" of beacons, already "harvests" statements into its local Neo4j graph cache, thus making the data set better behaved with respect to offset/size paging, however, the ordering of the data is then based on Neo4j ordering, not the original knowledge sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants