Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluate values canonicalization in DictArray::into_canonical #1041

Closed
lwwmanning opened this issue Oct 15, 2024 · 1 comment · Fixed by #1146
Closed

Re-evaluate values canonicalization in DictArray::into_canonical #1041

lwwmanning opened this issue Oct 15, 2024 · 1 comment · Fixed by #1146

Comments

@lwwmanning
Copy link
Member

In #1032 we always canonicalize the dictionary, since that is optimal under the assumption that codes.len() >= values.len(). This is always true in the absence of shared dictionaries.

After we address #252, we may want to take a more nuanced approach. In particular, a #1038 may be a better approach

@lwwmanning lwwmanning changed the title re-evaluate #1032 Re-evaluate values canonicalization in DictArray::into_canonical Oct 23, 2024
@lwwmanning
Copy link
Member Author

lwwmanning commented Oct 23, 2024

After #1082 and #1121, the values canonicalization will only really apply to primitive types. Worth testing if we can revert #1032 entirely, possibly alongside some work to speed up BitPackedArray::take (see #1039 )

The benefit of #1032 was by far the greatest on string-heavy datasets (where it called into VarBin::take, which is very expensive, but is obviated by German strings).

Especially on DictArrays with FSST-encoded values, this was kind of pathological prior to #1032:

  • we would first do a VarBin take on the encoded FSST strings (copying each encoded string for each instance of its code)
  • we would then FSST decompress the now-much-larger pile of compressed strings (full of duplicates)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant