Component values revisited #47

jstcki · 2019-11-27T09:20:07Z

Follow-up to #37 and #38.

It turns out that in practice using componentsValues() is ~2x slower than fetching componentValues for each dimension in parallel 😅

I'm not sure if there's a solution to this on the level of this library (optimizing the generated query), or if it depends how the triple store is set up (possible to index this query?), specify the dimension values explicitly as rdfs:range …

Currently, we're using this functionality for three things:

To correctly infer the data type for filtering (e.g. gYear), which is clearly a workaround as discussed in Filtering by gYear #32 and I hope we can eventually get rid of this
To construct a default filter set for the initial query (for this we actually just use the first dimension value).
To show all values to the user, so they can adjust filters

As said, for 1. I hope that this won't be needed at all in the future. For 3. I realize that componentsValues or fetching all values up-front is probably overkill and we can should just use componentValues when we need it (ie when the UI is shown).

👉 For 2. (and 1.) I think it would really be useful to be able to specify a limit on componentValues/componentsValues to avoid over-fetching because for these cases we really only need one value. I'm not sure about ordering/sampling because using anything other than LIMIT results in a much slower query (e.g. see http://yasgui.org/short/EyeZfAUrv)

The text was updated successfully, but these errors were encountered:

ktk · 2019-11-27T09:39:20Z

@herrstucki I'm very interested in slow queries, what would help me is to have a list of generated SPARQL queries on our datasets that run slow so I can see the query plan & talk to the triplstore vendor for optimizing it.

jstcki · 2019-11-27T14:48:21Z

@ktk See e.g. the query I linked in my first message http://yasgui.org/short/EyeZfAUrv … when you remove the LIMIT 1 and/or do any operation like ORDER BY, use MIN/MAX/SAMPLE it becomes super slow (>1.5s)

jstcki · 2020-05-14T16:42:08Z

@ktk For reference, this is a dataset with LOTS of observations which seems to slow down the componentValues query significantly (even if there are only 3 distinct values in this case).

Example query

jstcki · 2020-05-14T16:49:38Z

The above query is ~10x faster if the labels are removed. It's still pretty slow though (~1s).

l00mi · 2020-05-14T17:58:59Z

Currently this is solved with a DISTINCT query, which does need to query all Observations to be sure that there is not one more other kind of 'value'. So it will take longer (linearely) for bigger datasets. (It can potentially be optimised by first getting the distinct URI's and afterwards get the labels for the URI's, as the URI's have a more performant index normally.)

But overall this is should be solved by adding the 'shape' from the new cube description, which simply defines the possible values explicitly for ordinal dimensions. At least as long as we do not have any filters set. The filters need to scan anyway, but potentially a smaller part of the dataset.

jstcki mentioned this issue Feb 11, 2020

Explicitly vs. automatically selected dimensions #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component values revisited #47

Component values revisited #47

jstcki commented Nov 27, 2019

ktk commented Nov 27, 2019

jstcki commented Nov 27, 2019

jstcki commented May 14, 2020

jstcki commented May 14, 2020 •

edited

Loading

l00mi commented May 14, 2020 •

edited

Loading

Component values revisited #47

Component values revisited #47

Comments

jstcki commented Nov 27, 2019

ktk commented Nov 27, 2019

jstcki commented Nov 27, 2019

jstcki commented May 14, 2020

jstcki commented May 14, 2020 • edited Loading

l00mi commented May 14, 2020 • edited Loading

jstcki commented May 14, 2020 •

edited

Loading

l00mi commented May 14, 2020 •

edited

Loading