You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #1032 we always canonicalize the dictionary, since that is optimal under the assumption that codes.len() >= values.len(). This is always true in the absence of shared dictionaries.
After we address #252, we may want to take a more nuanced approach. In particular, a #1038 may be a better approach
The text was updated successfully, but these errors were encountered:
lwwmanning
changed the title
re-evaluate #1032
Re-evaluate values canonicalization in DictArray::into_canonicalOct 23, 2024
After #1082 and #1121, the values canonicalization will only really apply to primitive types. Worth testing if we can revert #1032 entirely, possibly alongside some work to speed up BitPackedArray::take (see #1039 )
The benefit of #1032 was by far the greatest on string-heavy datasets (where it called into VarBin::take, which is very expensive, but is obviated by German strings).
Especially on DictArrays with FSST-encoded values, this was kind of pathological prior to #1032:
we would first do a VarBin take on the encoded FSST strings (copying each encoded string for each instance of its code)
we would then FSST decompress the now-much-larger pile of compressed strings (full of duplicates)
In #1032 we always canonicalize the dictionary, since that is optimal under the assumption that
codes.len() >= values.len()
. This is always true in the absence of shared dictionaries.After we address #252, we may want to take a more nuanced approach. In particular, a #1038 may be a better approach
The text was updated successfully, but these errors were encountered: