[jdbc.read] Don't track realized keys in TransientRow #185

alexander-yakushev · 2024-10-05T09:41:37Z

Tracking the hashset of realized keys for each row in the resultset has noticeable overhead. Since this is only needed to print the transient row during debugging (as far as I can tell), it makes sense to shave the overhead here. All column names can be computed upfront and then used to print the values available in transient row. Please tell me if I'm wrong or missing something.

codecov · 2024-10-05T11:36:09Z

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Project coverage is 83.17%. Comparing base (86929ea) to head (76a77f9).

Files with missing lines	Patch %	Lines
src/toucan2/jdbc/row.clj	66.66%	2 Missing and 1 partial ⚠️
src/toucan2/jdbc/result_set.clj	94.73%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #185      +/-   ##
==========================================
- Coverage   83.20%   83.17%   -0.03%     
==========================================
  Files          37       37              
  Lines        2524     2526       +2     
  Branches      214      215       +1     
==========================================
+ Hits         2100     2101       +1     
  Misses        210      210              
- Partials      214      215       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

camsaul · 2024-10-07T19:44:33Z

src/toucan2/jdbc/result_set.clj

@@ -142,6 +145,7 @@
        col-names         (get builder :cols (next.jdbc.rs/get-modified-column-names
                                              (.getMetaData rset)
                                              combined-opts))
+        col-names-kw      (vec (keep #(column-name->keyword % label-fn) col-names))


If we're really worried about performance, (into [] (keep ...) ...) would be faster, right?

This code only runs once per ResultSet, so it's not as critical to make optimizations here. But sure, there's no harm in switching to a transducer.

camsaul · 2024-10-07T19:48:00Z

test/toucan2/jdbc/row_test.clj

-(defn- realized-keys [^TransientRow row]
-  @(.realized_keys row))


Couldn't this still be calculated by looking at .column_names and removing that ones that are ::not-found? That way we don't need to delete a ton of test code below

camsaul · 2024-10-07T19:52:06Z

src/toucan2/jdbc/result_set.clj

@@ -142,6 +145,7 @@
        col-names         (get builder :cols (next.jdbc.rs/get-modified-column-names
                                              (.getMetaData rset)
                                              combined-opts))
+        col-names-kw      (vec (keep #(column-name->keyword % label-fn) col-names))
        col-name->index   (make-column-name->index col-names label-fn)]


Seems like it will be bad to performance to calculate all the column name keywords ahead of time in the line above and then create a memorized function that has to recalculate all those names a second time here... surely if you want to go down this road it would make sense to have make-column-name->index take col-names-kw so it doesn't need to call column-name->keyword again. Right?

As an added bonus I guess we'd avoid the overhead of having to do column-name->keyword calculations on the key passed to the without method since presumably it should already be a keyword that is formatted correctly

Like I've said above, column names here are calculated only once per ResultSet. The current realized-keys update is triggered for every individual cell access (N columns * M rows). That's why optimizing the once per ResultSet part is not as important. But if there are easy wins there, it makes sense to do them as well.

The reason why I left the current memoized make-column-name->index conversion is because, if I read the code correctly, the user may pass a string, a keyword, or a symbol as a column name, and they all should work. ~~We can perhaps calculate a map for all possible valid column names. I am not against it but it feels a bit dirty. We can go that route if you wish.~~ We can't precalculate all valid column names because of arbitrary label-fn. I see that it transforms names to kebab case for example, so :my_good_column, "my-good_column", 'my_good-column are all valid keys that should be transformed to :my-good-coumn in the end. We can't precalculate that.

camsaul · 2024-10-07T20:05:32Z

It would be nice to see something like Criterium benchmarks of before and after these changes to get a sense of how significant of a performance impact this really makes. I would be interested in seeing whether doing away with realized-keys but always calculating all of the column names ahead of time even when we do something like only realize the value of a single column from the transient row actually speeds things up significantly or not.

We lose a bit of info by taking this stuff out (as evidenced by the number of lines you had to delete from the tests) so if we're talking like a performance improvement of a few microseconds then I'm not sure it's really worth it or not.

If column-names is really only needed for printing a transient row, could we only calculate it on demand only when printing a transient row, that way we avoid the overhead of calculating it for every single row when 99.99% of the time we never print the transient rows? That seems like it would be an even bigger performance improvement, right? Is there a way to do that?

camsaul

Is there a way we can stop pre-calculating all of the keyword column names and only calculate them on demand if and when we print a transient row?
If we actually do need to pre-calculate all of the keyword column names, make-column-name->index can probably be updated to use those so we're not calculating them twice for every row.
If we're getting really crazy with performance optimizations, (into [] ...) is faster than (vec (keep ...))
If we remove realized-keys but still have column-names, you should still be able to calculate the realized keys by looking at column names and seeing what is not ::not-found. If we write a function for that in row-test then you don't need to remove all the tests you've removed

alexander-yakushev · 2024-10-08T09:41:25Z

UPD regarding tests: the current tests are written using :venues model that has columns [:id :name :category :created-at :updated-at], but e.g. the assoc test uses :k as a key and then checks that it is among realized-keys. Is this testing the intentional behavior that you want to preserve?

alexander-yakushev · 2024-10-08T13:22:12Z

Here's the benchmark:

;; preparation - insert 10k rows into people table
(toucan2.insert/insert!
 ::test/people
 (for [id (range 10 10000)]
   (instance/instance ::test/people
                      {:id id
                       :name       "Somebody"
                       :created-at (LocalDateTime/parse "2017-01-01T00:00")})))

;; the benchmark - I use time+ to get allocations too, but criterium would work just as well
(time+ (select/select ::test/people {:select [:*], :from [[:people]]}))

Results:

Before:
Time per call: 18.89 ms   Alloc per call: 17,484,278b
Time per call: 18.90 ms   Alloc per call: 17,483,998b
Time per call: 18.87 ms   Alloc per call: 17,483,817b

After:
Time per call: 15.17 ms   Alloc per call: 16,771,048b
Time per call: 15.14 ms   Alloc per call: 16,770,815b
Time per call: 15.18 ms   Alloc per call: 16,770,697b

alexander-yakushev · 2024-10-08T13:26:17Z

And another follow-up on having to calculate all columns ahead of time: I'm mostly looking at Metabase benchmarks (not microbenchmarks of Toucan2 itself) and I see that in Metabase fetch-all-columns! is usually called; hence, all column names are eventually calculated anyway.

alexander-yakushev requested a review from camsaul as a code owner October 5, 2024 09:41

alexander-yakushev force-pushed the no-realized-keys branch 3 times, most recently from 84ff99e to 7319b51 Compare October 5, 2024 11:35

alexander-yakushev force-pushed the no-realized-keys branch 2 times, most recently from 7330740 to 76a77f9 Compare October 5, 2024 11:39

camsaul reviewed Oct 7, 2024

View reviewed changes

camsaul requested changes Oct 7, 2024

View reviewed changes

alexander-yakushev force-pushed the no-realized-keys branch 2 times, most recently from 463e943 to 3e6e748 Compare October 8, 2024 09:25

This was referenced Oct 8, 2024

[log] Cache logger object in a private var #184

Open

[jdbc.read] Properly pre-compute column fetching thunk #189

Merged

[jdbc.read] Don't track realized keys in TransientRow

262b2c0

alexander-yakushev force-pushed the no-realized-keys branch from 3e6e748 to 262b2c0 Compare October 15, 2024 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jdbc.read] Don't track realized keys in TransientRow #185

[jdbc.read] Don't track realized keys in TransientRow #185

alexander-yakushev commented Oct 5, 2024 •

edited

Loading

codecov bot commented Oct 5, 2024 •

edited

Loading

camsaul Oct 7, 2024

alexander-yakushev Oct 8, 2024

camsaul Oct 7, 2024

camsaul Oct 7, 2024

camsaul Oct 7, 2024 •

edited

Loading

alexander-yakushev Oct 8, 2024 •

edited

Loading

camsaul commented Oct 7, 2024

camsaul left a comment

alexander-yakushev commented Oct 8, 2024 •

edited

Loading

alexander-yakushev commented Oct 8, 2024

alexander-yakushev commented Oct 8, 2024

		(defn- realized-keys [^TransientRow row]
		@(.realized_keys row))

[jdbc.read] Don't track realized keys in TransientRow #185

Are you sure you want to change the base?

[jdbc.read] Don't track realized keys in TransientRow #185

Conversation

alexander-yakushev commented Oct 5, 2024 • edited Loading

codecov bot commented Oct 5, 2024 • edited Loading

Codecov Report

camsaul Oct 7, 2024

Choose a reason for hiding this comment

alexander-yakushev Oct 8, 2024

Choose a reason for hiding this comment

camsaul Oct 7, 2024

Choose a reason for hiding this comment

camsaul Oct 7, 2024

Choose a reason for hiding this comment

camsaul Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

alexander-yakushev Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

camsaul commented Oct 7, 2024

camsaul left a comment

Choose a reason for hiding this comment

alexander-yakushev commented Oct 8, 2024 • edited Loading

alexander-yakushev commented Oct 8, 2024

alexander-yakushev commented Oct 8, 2024

alexander-yakushev commented Oct 5, 2024 •

edited

Loading

codecov bot commented Oct 5, 2024 •

edited

Loading

camsaul Oct 7, 2024 •

edited

Loading

alexander-yakushev Oct 8, 2024 •

edited

Loading

alexander-yakushev commented Oct 8, 2024 •

edited

Loading