Re-evaluate values canonicalization in `DictArray::into_canonical` #1041

lwwmanning · 2024-10-15T13:41:45Z

In #1032 we always canonicalize the dictionary, since that is optimal under the assumption that codes.len() >= values.len(). This is always true in the absence of shared dictionaries.

After we address #252, we may want to take a more nuanced approach. In particular, a #1038 may be a better approach

The text was updated successfully, but these errors were encountered:

lwwmanning · 2024-10-23T14:27:22Z

After #1082 and #1121, the values canonicalization will only really apply to primitive types. Worth testing if we can revert #1032 entirely, possibly alongside some work to speed up BitPackedArray::take (see #1039 )

The benefit of #1032 was by far the greatest on string-heavy datasets (where it called into VarBin::take, which is very expensive, but is obviated by German strings).

Especially on DictArrays with FSST-encoded values, this was kind of pathological prior to #1032:

we would first do a VarBin take on the encoded FSST strings (copying each encoded string for each instance of its code)
we would then FSST decompress the now-much-larger pile of compressed strings (full of duplicates)

lwwmanning changed the title ~~re-evaluate #1032~~ Re-evaluate values canonicalization in DictArray::into_canonical Oct 23, 2024

This was referenced Oct 23, 2024

feat: use buffer for VarBinView views #1121

Closed

feat: faster take for BitPackedArray and SparseArray #1133

Merged

feat: revert values canonicalization in DictArray::into_canonical #1136

Closed

a10y mentioned this issue Oct 28, 2024

feat: specialized IntoCanonical for DictArray utf8/binary #1146

Merged

lwwmanning closed this as completed in #1146 Oct 28, 2024

lwwmanning closed this as completed in 8467f64 Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-evaluate values canonicalization in `DictArray::into_canonical` #1041

Re-evaluate values canonicalization in `DictArray::into_canonical` #1041

lwwmanning commented Oct 15, 2024

lwwmanning commented Oct 23, 2024 •

edited

Loading

Re-evaluate values canonicalization in DictArray::into_canonical #1041

Re-evaluate values canonicalization in DictArray::into_canonical #1041

Comments

lwwmanning commented Oct 15, 2024

lwwmanning commented Oct 23, 2024 • edited Loading

Re-evaluate values canonicalization in `DictArray::into_canonical` #1041

Re-evaluate values canonicalization in `DictArray::into_canonical` #1041

lwwmanning commented Oct 23, 2024 •

edited

Loading