Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Median and other Quantiles #367

Open
joshua-oss opened this issue May 26, 2021 · 0 comments
Open

Add support for Median and other Quantiles #367

joshua-oss opened this issue May 26, 2021 · 0 comments

Comments

@joshua-oss
Copy link
Contributor

The SQL layer requires all aggregates to be computed precisely by the backend engine. This enables us to deal with very large datasets that cannot efficiently be stored in memory. However, it makes it tricky to compute differentially private quantiles such as median.

We are considering an approach where the backend engine computes 100 exact quantiles for each aggregate row in the output. This raises several questions:

Accuracy

It's unclear how the accuracy of this approach will compare with traditional methods that access the whole rowset at once. The distribution of data could have an impact on accuracy.

Rewritten query size

This would greatly increase the size of the rewritten query, since each MEDIAN in the original SELECT would turn into 100 new QUANTILE queries on the backend. It's conceivable that large queries requesting many quantiles would run into size limitations with the rewritten query. This could be worked around by issuing multiple queries, but max query size will differ based on engine.

Engine performance

Most modern database engines can optimize all 100 quantile calls over the same column to share a sort, so query performance hit should not be large compared to requesting one exact quantile. However, it's possible that some engines have trouble with this, in which case we could see unacceptable slowdowns in some queries.

Result set complexity

This will add a lot of extra columns to the response from the database. For example, a query with 5 Median columns would result in 500 columns in each response row. More complex queries could result in exceeding the column limits, which vary depending on engine. We can work around this by asking the engine to aggregate each set of 100 quantiles into a single comma-separated list.

Engine cross-compatibility

Most modern engines have a percentile_cont and percentile_disc from the SQL standard, which can be used to calculate exact quantiles. However, two big gaps are MySQL and SQLite. We could hand-implement for these databases by sorting and ranking, but this will greatly increase the complexity of the rewritten query. Since SQLite is our Pandas solution, we could also just extract the in-memory rows and compute quantiles that way, leaving only MySQL as a gap. Alternately, we could simply disable support on one or both of these engines.

Additionally, there is no standardized SQL way to aggregate the 100 quantiles into a comma-separated string. So we would need to use engine specific syntax: ARRAY_AGG (U-SQL), STRING_AGG (T-SQL), GROUP_CONCAT (MySQL, SQLite), array_join (SparkSQL), ListAgg (Oracle, Redshift). The Reader architecture supports this in a straightforward way, but we would need to do testing to ensure there aren't differences in how each engine implements the functionality. This would also require us to update the grammar to support all of these functions, if we want the output queries to always round-trip.

Alternate methods

For queries with only a few MEDIAN requests, it might be better to uniformly sample some number of rows (e.g. 10,000), and return them to the client for in-memory computation of differentially private quantile. This has a disadvantage of seriously limiting the number of quantile columns per query, and would have an unknown impact on utility. But it would have the advantage of using the same rewritten query on all database engines.

For tables with few rows, we could return all rows and do in-memory processing. There might also be cases where it makes most sense for the private_reader to issue multiple backend queries in sequence, for example, to perform a binary search. Both of these options would result in different accuracy and privacy usage across different engines, though, which may not be desirable.

Finally, there might be some approaches we aren't aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant