Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component values revisited #47

Open
jstcki opened this issue Nov 27, 2019 · 5 comments
Open

Component values revisited #47

jstcki opened this issue Nov 27, 2019 · 5 comments

Comments

@jstcki
Copy link
Contributor

jstcki commented Nov 27, 2019

Follow-up to #37 and #38.

It turns out that in practice using componentsValues() is ~2x slower than fetching componentValues for each dimension in parallel 😅

I'm not sure if there's a solution to this on the level of this library (optimizing the generated query), or if it depends how the triple store is set up (possible to index this query?), specify the dimension values explicitly as rdfs:range …

Currently, we're using this functionality for three things:

  1. To correctly infer the data type for filtering (e.g. gYear), which is clearly a workaround as discussed in Filtering by gYear #32 and I hope we can eventually get rid of this
  2. To construct a default filter set for the initial query (for this we actually just use the first dimension value).
  3. To show all values to the user, so they can adjust filters

As said, for 1. I hope that this won't be needed at all in the future. For 3. I realize that componentsValues or fetching all values up-front is probably overkill and we can should just use componentValues when we need it (ie when the UI is shown).

👉 For 2. (and 1.) I think it would really be useful to be able to specify a limit on componentValues/componentsValues to avoid over-fetching because for these cases we really only need one value. I'm not sure about ordering/sampling because using anything other than LIMIT results in a much slower query (e.g. see http://yasgui.org/short/EyeZfAUrv)

@ktk
Copy link
Member

ktk commented Nov 27, 2019

@herrstucki I'm very interested in slow queries, what would help me is to have a list of generated SPARQL queries on our datasets that run slow so I can see the query plan & talk to the triplstore vendor for optimizing it.

@jstcki
Copy link
Contributor Author

jstcki commented Nov 27, 2019

@ktk See e.g. the query I linked in my first message http://yasgui.org/short/EyeZfAUrv … when you remove the LIMIT 1 and/or do any operation like ORDER BY, use MIN/MAX/SAMPLE it becomes super slow (>1.5s)

@jstcki
Copy link
Contributor Author

jstcki commented May 14, 2020

@ktk For reference, this is a dataset with LOTS of observations which seems to slow down the componentValues query significantly (even if there are only 3 distinct values in this case).

Example query

@jstcki
Copy link
Contributor Author

jstcki commented May 14, 2020

The above query is ~10x faster if the labels are removed. It's still pretty slow though (~1s).

@l00mi
Copy link

l00mi commented May 14, 2020

Currently this is solved with a DISTINCT query, which does need to query all Observations to be sure that there is not one more other kind of 'value'. So it will take longer (linearely) for bigger datasets. (It can potentially be optimised by first getting the distinct URI's and afterwards get the labels for the URI's, as the URI's have a more performant index normally.)

But overall this is should be solved by adding the 'shape' from the new cube description, which simply defines the possible values explicitly for ordinal dimensions. At least as long as we do not have any filters set. The filters need to scan anyway, but potentially a smaller part of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants