Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly vs. automatically selected dimensions #51

Open
jstcki opened this issue Feb 4, 2020 · 5 comments
Open

Explicitly vs. automatically selected dimensions #51

jstcki opened this issue Feb 4, 2020 · 5 comments

Comments

@jstcki
Copy link
Contributor

jstcki commented Feb 4, 2020

Hi!

I just noticed a discrepancy between how explicitly and automatically selected dimensions are handled, and another aspect which makes the automatic selects less-than-useful.

  1. Labels are only returned for explicitly selected dimensions (oddly, except for years which have an empty label).
  2. Automatic selects can not really be used for anything since they use keys that are derived from the translated dimension label. These slugified keys can not be re-associated with a dimension (which is necessary to get the dimension's label etc.). So eventually, we end up having to manually select all dimensions anyway.

Point 2 could actually neatly be solved by not generating keys from the label but by using the dimension IRI. If behavior in point 1 would be consistent (i.e. labels present for auto-selects), this would actually remove the need to explicitly select dimensions at all.

For example:

// Instead of this
[{ forestZone: {...}, canton: {...}}, ...]
// something like this could be returned
[{ "http://environment.ld.admin.ch/foen/px/0703010000_102/dimension/1": {...}, "http://environment.ld.admin.ch/foen/px/0703010000_102/dimension/2": {...}}, ...]

If IRIs are used as keys, the argument to .select() could be simply an array of components or just their IRIs instead of having to specify binding names myself (which is also dangerous since these are not slugified!).

@vhf
Copy link
Contributor

vhf commented Feb 10, 2020

Hey, thanks!

  1. Missing labels for automatically selected dimensions: will fix!

  2. To me your suggestion makes sense. It will make a few things uglier, for instance:

    • .groupBy("raum") -> .groupBy("https://ld.stadt-zuerich.ch/statistics/property/RAUM")
    • .filter(({ someDate }) => someDate.not.equals("2019-08-29T07:27:56.241Z")); not possible anymore (no big deal though)

I'll try something and we'll then discuss the details in a PR.

@jstcki
Copy link
Contributor Author

jstcki commented Feb 10, 2020

Note that querying for labels on all dimensions makes everything much slower, so I wonder if there would be a better way to do this. E.g. by only querying for labels in cube.dimensions() and then stitching them together with a label-less result from cube.query(). Haven't tried though.

@vhf
Copy link
Contributor

vhf commented Feb 11, 2020

Note that querying for labels on all dimensions makes everything much slower

Could you please tell us more about this? Would running datacube.components() to fetch all labels be too costly?

@jstcki
Copy link
Contributor Author

jstcki commented Feb 11, 2020

I meant that currently, cube.select(allDimensions).query() is much slower than cube.select([]).query() because selecting dimensions queries for all dimension value labels on each observation.

This is probably related to #47 … adding labels to the query unfortunately makes it much slower.

BTW, we're currently also always setting all potential languages on the entrypoint, e.g. ["de", "fr", "it", "en", ""], because some datasets can be only available in one of these and it's not clear what the fallback should be. Does adding more languages make the query slower? This could probably be optimized if the datasets declared available languages correctly.

@vhf
Copy link
Contributor

vhf commented Feb 12, 2020

Yeah adding labels definitely makes things slower, and yes adding more languages makes it even slower.

I think not fetching labels for automatically selected dimensions and using dimensions IRIs as keys would solve most of the issue. Users could fetch dimensions and their labels independently and possibly cache them.

This could probably be optimized if the datasets declared available languages correctly.

@ktk what do you think about this, is it possible to declare the languages somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants