Summary and count causes performance issues on large datasets #37

markbrough · 2022-11-20T15:34:05Z

With very large datasets (e.g. 13m rows), summary and count appear to significantly slow down the response:

Lines 89 to 96 in 9416105

    
           # Count 
        
           count = count_results(self, prep(cuts, 
        
                                            drilldowns=drilldowns, 
        
                                            columns=[1])[0]) 
        
           # Summary 
        
           summary = first_result(self, prep(cuts, 
        
                                             aggregates=aggregates)[0].limit(1))

Without generating summary and count, it's 2-3 times faster to return the response.

It would be useful to make returning these properties optional. E.g. by adding an optional &simple parameter to the request.

The text was updated successfully, but these errors were encountered:

jbothma · 2022-11-20T15:45:36Z

Nice find! Alternatively "include_fields" might be more maintainable than "simple", so there isn't need to come up with more parameters for different partial response combinations. Are you able to make a pull request for something like this?

…

On Sun, 20 Nov 2022, 17:34 Mark Brough, ***@***.***> wrote: With very large datasets (e.g. 13m rows), summary and count appear to significantly slow down the response: https://github.com/openspending/babbage/blob/9416105fd18dda13b06aaaeec0ce7abdd13d8453/babbage/cube.py#L89-L96 Without generating summary and count, it's 2-3 times faster to return the response. It would be useful to make returning these properties optional. E.g. by adding an optional &simple parameter to the request. — Reply to this email directly, view it on GitHub <#37>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABZSGLYZXXXBTGGNQCQEQ3WJJAHTANCNFSM6AAAAAASF4W6YY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

markbrough · 2022-11-20T16:41:30Z

Maybe include_fields might be confusing, e.g. compared with the different dimensions etc? Presuming that the default should be to include all properties, how about exclude, with an optional list, e.g. &exclude="count";"summary"

markbrough · 2022-11-20T21:16:09Z

Because I need this for something quickly, I implemented it already the way I described above - but happy to hear your feedback on the above and then I can adjust:
https://github.com/markbrough/babbage/tree/37-simple-request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary and count causes performance issues on large datasets #37

Summary and count causes performance issues on large datasets #37

markbrough commented Nov 20, 2022

jbothma commented Nov 20, 2022 via email

markbrough commented Nov 20, 2022

markbrough commented Nov 20, 2022

Summary and count causes performance issues on large datasets #37

Summary and count causes performance issues on large datasets #37

Comments

markbrough commented Nov 20, 2022

jbothma commented Nov 20, 2022 via email

markbrough commented Nov 20, 2022

markbrough commented Nov 20, 2022