Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple tables / star schema #3

Open
pudo opened this issue Dec 4, 2015 · 10 comments
Open

Support multiple tables / star schema #3

pudo opened this issue Dec 4, 2015 · 10 comments
Milestone

Comments

@pudo
Copy link
Contributor

pudo commented Dec 4, 2015

Right now, only a single fact table is supported, and aggregations of dimensions will assume that all columns for a given dimension are perfectly denormalised. The application should support JOINs for a star schema instead.

@pwalsh
Copy link
Member

pwalsh commented Dec 7, 2015

@akariv first we need to understand if we need this. Assigning to you to investigate as part of general deep dive on babbage.

@pudo
Copy link
Contributor Author

pudo commented Dec 7, 2015

Might be useful to separate between being able to load from multiple resources and keeping multiple tables for dimensions internally. The two are almost certainly a "false friend" equality (i.e. looks like the same thing but in practice turns out not to be one).

@pwalsh
Copy link
Member

pwalsh commented Dec 7, 2015

@pudo totally agree. seems to me that single resources in FDP speak could (should?) load into multiple tables (star schema).

I definitely do not want babbage to change to reflect the various modelling strategies in an FDP mapping: rather, that the loader can work with the mapping to write to a consistent (normalised?) babbage model.

I guess you are currently in the best position to advise on a way forward around that.

@pudo
Copy link
Contributor Author

pudo commented Dec 7, 2015

There was an interesting thought from @Stiivi (also relevant to #4): he proposed loading the source data facts into a JSON column on the fact table (i.e. as-is) and then generating the actual columns and star schema as the model is defined through PostgreSQL commands (i.e. inside the DB).

Otherwise you end up in that spot where each time someone changes the label for an attribute, you have to re-load 900 MB of source CSV.

@pwalsh
Copy link
Member

pwalsh commented Dec 15, 2015

I'm going to backlog this now. We don't need this now, and while there are great ideas in here, let's focus on our core tasks first.

@pwalsh pwalsh modified the milestones: Backlog, Current Dec 15, 2015
@larjohn
Copy link

larjohn commented Apr 21, 2016

I am not sure if now is the time to pop this out of the backlog.

We are interested in using such a feature in openbudgets.eu. Other than the backend implementation, in which I can't offer any opinion yet, what is the suggested way to support this API-wise?

Would it be a call to the aggregates/facts etc endpoints without a cube path part?

@akariv
Copy link
Member

akariv commented Apr 22, 2016

I would say that from an API point of view, nothing needs to change.
The API should reflect to the user a denormalized data table, while using normalized tables in the backend.

@larjohn
Copy link

larjohn commented Apr 22, 2016

Yes, but how do you imagine a call of this service? This call:

http://next.openspending.org/api/3/cubes/405f5e1681d4b6673fbfc0ceb196fe47:boost-armenia-2006-2014/aggregate?drilldown=activity.program&pagesize=30&cut=classification.econ_func%3A%22Economic%22&order=adjusted.sum%3Adesc

already contains the cube name in it. Will "cube-less" queries be supported somehow?

@pudo
Copy link
Contributor Author

pudo commented Apr 22, 2016

Just out of interest: what are the semantics of a cube-less call? I imagine it would be quite hard to properly aggregate budgetary data even with slightly different origins. If we're talking RDF here: Wouldn't it be less dangerous for a data integration tool to actually materialise merged cubes with well-defined semantics? (e.g. "here's the merged cube of all cohesion fund data sub-cubes")

@larjohn
Copy link

larjohn commented Apr 22, 2016

Well a cube-less call is a fictional, ideal call, intended to force us explore alternative solutions.

We are indeed working with RDF at openbudgets.eu, but we do not desire to suggest any direction for openspending based on our expactations from RDF.

This means we don't even care if openspending's loader materialises the merged cubes. In rudolf, we have to tackle the same issues: determine which dimensions are similiar, determine which values of these dimensions are similar (in order to support common cuts) and find a way to get normalized values (mostly currency-related normalization).

Regarding the dimensions: we can either cross our fingers and rely on dimensions inheriting from common parents (also having values linked with some kind of similarity predicate) or go with CONSTRUCTqueries which indeed materialize the datasets into a new cube. Both ways are implementable our side, I was just wondering if, API-wise, there would be some actual super-cube to query in the same way as with the ordinary cubes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants