Support pyarrow large_list #7019

albertvillanova · 2024-07-02T09:52:52Z

Allow Polars round trip by supporting pyarrow large list.

Supersede and close #4800, close #6835, close #6986.

HuggingFaceDocBuilderDev · 2024-07-02T09:55:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dakotamurdock · 2024-07-17T14:08:41Z

@albertvillanova really happy to see this fix.

Have you attempted to save a dataset to disk after this? I attempted to utilize your fix in a build from source, and while I can now successfully get a dataset object from a polars df containing a large list, I am getting the following error when attempting to save the resulting dataset to disk:

File "/Users/x/VSCodeProjects/HuggingFace/hf.py", line 9, in <module>
    dataset.save_to_disk("data/test.hf")
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 1591, in save_to_disk
    for kwargs in kwargs_per_job:
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 1568, in <genexpr>
    "shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 4757, in shard
    return self.select(
           ^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/fingerprint.py", line 482, in wrapper
    out = func(dataset, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 3892, in select
    return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/fingerprint.py", line 482, in wrapper
    out = func(dataset, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 3955, in _select_contiguous
    return Dataset(
           ^^^^^^^^
  File "/Users/x/VSCodeProjects/HuggingFace/datasets/src/datasets/arrow_dataset.py", line 731, in __init__
    raise ValueError(
ValueError: External features info don't match the dataset:
Got
{'0': Value(dtype='int64', id=None), '1': Value(dtype='int64', id=None), '2': Value(dtype='int64', id=None), '3': Value(dtype='int64', id=None), '4': Value(dtype='int64', id=None), '5': Value(dtype='int64', id=None), '6': Value(dtype='int64', id=None), '7': Value(dtype='int64', id=None), '8': Value(dtype='int64', id=None), '9': Value(dtype='int64', id=None), '10': Value(dtype='int64', id=None), '11': Value(dtype='int64', id=None), '12': Value(dtype='int64', id=None), '13': Value(dtype='int64', id=None), '14': Value(dtype='int64', id=None), '15': Value(dtype='int64', id=None), '16': Value(dtype='int64', id=None), '17': Value(dtype='int64', id=None), '18': Value(dtype='int64', id=None), '19': Value(dtype='int64', id=None), 'A': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=False, id=None), 'B': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=False, id=None), 'C': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=False, id=None), 'D': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=False, id=None), '__index_level_0__': Value(dtype='int64', id=None)}
with type
struct<0: int64, 1: int64, 2: int64, 3: int64, 4: int64, 5: int64, 6: int64, 7: int64, 8: int64, 9: int64, 10: int64, 11: int64, 12: int64, 13: int64, 14: int64, 15: int64, 16: int64, 17: int64, 18: int64, 19: int64, A: list<item: int64>, B: list<item: int64>, C: list<item: int64>, D: list<item: int64>, __index_level_0__: int64>

but expected something like
{'0': Value(dtype='int64', id=None), '1': Value(dtype='int64', id=None), '2': Value(dtype='int64', id=None), '3': Value(dtype='int64', id=None), '4': Value(dtype='int64', id=None), '5': Value(dtype='int64', id=None), '6': Value(dtype='int64', id=None), '7': Value(dtype='int64', id=None), '8': Value(dtype='int64', id=None), '9': Value(dtype='int64', id=None), '10': Value(dtype='int64', id=None), '11': Value(dtype='int64', id=None), '12': Value(dtype='int64', id=None), '13': Value(dtype='int64', id=None), '14': Value(dtype='int64', id=None), '15': Value(dtype='int64', id=None), '16': Value(dtype='int64', id=None), '17': Value(dtype='int64', id=None), '18': Value(dtype='int64', id=None), '19': Value(dtype='int64', id=None), 'A': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=True, id=None), 'B': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=True, id=None), 'C': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=True, id=None), 'D': Sequence(feature=Value(dtype='int64', id=None), length=-1, large=True, id=None), '__index_level_0__': Value(dtype='int64', id=None)}
with type
struct<0: int64, 1: int64, 2: int64, 3: int64, 4: int64, 5: int64, 6: int64, 7: int64, 8: int64, 9: int64, 10: int64, 11: int64, 12: int64, 13: int64, 14: int64, 15: int64, 16: int64, 17: int64, 18: int64, 19: int64, A: large_list<item: int64>, B: large_list<item: int64>, C: large_list<item: int64>, D: large_list<item: int64>, __index_level_0__: int64>

code to reproduce is actually 2 separate scripts below.

creating test data:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100000, size=(100000, 20)))
featureVector = np.random.randint(0, 100000, size=(100000, 1000)).tolist()

df['A'] = featureVector
df['B'] = featureVector
df['C'] = featureVector
df['D'] = featureVector

df.to_parquet('data/train_data.parquet', engine='pyarrow')

loading data, converting to HF dataset, attempting to save to disk

import datasets
import polars as pl

df = pl.read_parquet('data/train_data.parquet')

dataset = datasets.Dataset.from_polars(df)

dataset.save_to_disk("data/test.hf")

If this isn't the appropriate place to put this, let me know. Since it isn't merged yet I didn't think raising an issue was appropriate.

albertvillanova · 2024-07-18T06:29:21Z

Thanks for your useful review comments, @dakotamurdock.

I am investigating that issue to fix it in this PR.

albertvillanova · 2024-08-05T10:57:17Z

There are many feature-functions and most of them are not properly covered by tests.

I am adding tests and fixing these feature-functions.

albertvillanova · 2024-08-07T13:19:29Z

I think this PR is ready for review, @huggingface/datasets.

lhoestq

Cool LGTM ! I only left minor suggestions

lhoestq · 2024-08-12T09:27:32Z

src/datasets/table.py

-            feature = {
-                name: Sequence(subfeature, length=feature.length) for name, subfeature in feature.feature.items()
-            }
+            sequence_kwargs = vars(feature).copy()
+            feature = sequence_kwargs.pop("feature")
+            feature = {name: Sequence(subfeature, **sequence_kwargs) for name, subfeature in feature.items()}


those changes are not necessary but I'm fine with keeping them

Yes, I made them when implementing Sequence.large and decided to keep them for robustness in case we add some other attribute to Sequence in the future.

lhoestq · 2024-08-12T09:32:03Z

src/datasets/table.py

+    elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):
        # feature must be either [subfeature] or Sequence(subfeature)


Suggested change

elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):

# feature must be either [subfeature] or Sequence(subfeature)

elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):

# feature must be either [subfeature] or LargeList(subfeature) or Sequence(subfeature)

lhoestq · 2024-08-12T09:33:39Z

src/datasets/table.py

+            if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type:
+                # Both array and feature have equal: list type and values (within the list) types


maybe simpler ?

Suggested change

if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type:

# Both array and feature have equal: list type and values (within the list) types

if pa.types.is_list(array.type) and casted_array_values.type == array.values.type:

# Both array and feature have equal: list type and values (within the list) types

lhoestq · 2024-08-12T09:34:50Z

src/datasets/table.py

+            if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type:
+                # Both array and feature have equal: list type and values (within the list) types
+                return array


same

Suggested change

if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type:

# Both array and feature have equal: list type and values (within the list) types

return array

if pa.types.is_large_list(array.type) and casted_array_values.type == array.values.type:

# Both array and feature have equal: large list type and values (within the list) types

return array

lhoestq · 2024-08-12T09:35:59Z

src/datasets/table.py

+                if (
+                    type(array.type) is type(get_nested_type(feature))
+                    and casted_array_values.type == array.values.type
+                ):
+                    # Both array and feature have equal: list type and values (within the list) types


same

Suggested change

if (

type(array.type) is type(get_nested_type(feature))

and casted_array_values.type == array.values.type

):

# Both array and feature have equal: list type and values (within the list) types

if pa.types.is_list(array.type) and casted_array_values.type == array.values.type:

# Both array and feature have equal: list type and values (within the list) types

lhoestq · 2024-08-12T09:37:08Z

src/datasets/table.py

@@ -2128,6 +2154,11 @@ def embed_array_storage(array: pa.Array, feature: "FeatureType"):
            return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature[0]))
        if isinstance(feature, Sequence) and feature.length == -1:
            return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature.feature))
+    elif pa.types.is_large_list(array.type):
+        # feature must be either LargeList(subfeature)


Suggested change

# feature must be either LargeList(subfeature)

# feature must be LargeList(subfeature)

github-actions · 2024-08-12T14:49:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005640 / 0.011353 (-0.005713)	0.003926 / 0.011008 (-0.007083)	0.063103 / 0.038508 (0.024595)	0.032088 / 0.023109 (0.008979)	0.238615 / 0.275898 (-0.037283)	0.268379 / 0.323480 (-0.055101)	0.003146 / 0.007986 (-0.004840)	0.002813 / 0.004328 (-0.001516)	0.049681 / 0.004250 (0.045431)	0.044577 / 0.037052 (0.007525)	0.249782 / 0.258489 (-0.008708)	0.282548 / 0.293841 (-0.011293)	0.029986 / 0.128546 (-0.098560)	0.012474 / 0.075646 (-0.063172)	0.203347 / 0.419271 (-0.215925)	0.035950 / 0.043533 (-0.007583)	0.243410 / 0.255139 (-0.011729)	0.267056 / 0.283200 (-0.016143)	0.022086 / 0.141683 (-0.119597)	1.145513 / 1.452155 (-0.306641)	1.207583 / 1.492716 (-0.285133)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095584 / 0.018006 (0.077578)	0.304264 / 0.000490 (0.303774)	0.000215 / 0.000200 (0.000015)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019460 / 0.037411 (-0.017952)	0.062268 / 0.014526 (0.047742)	0.074943 / 0.176557 (-0.101613)	0.121657 / 0.737135 (-0.615478)	0.075930 / 0.296338 (-0.220408)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.288975 / 0.215209 (0.073766)	2.869610 / 2.077655 (0.791955)	1.491057 / 1.504120 (-0.013063)	1.384160 / 1.541195 (-0.157035)	1.380977 / 1.468490 (-0.087513)	0.723181 / 4.584777 (-3.861596)	2.397960 / 3.745712 (-1.347752)	2.899919 / 5.269862 (-2.369942)	1.878714 / 4.565676 (-2.686962)	0.078162 / 0.424275 (-0.346113)	0.005115 / 0.007607 (-0.002493)	0.337599 / 0.226044 (0.111555)	3.367450 / 2.268929 (1.098522)	1.823745 / 55.444624 (-53.620880)	1.540528 / 6.876477 (-5.335949)	1.546146 / 2.142072 (-0.595927)	0.796927 / 4.805227 (-4.008300)	0.134389 / 6.500664 (-6.366275)	0.042298 / 0.075469 (-0.033172)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.959687 / 1.841788 (-0.882101)	11.505269 / 8.074308 (3.430961)	9.631551 / 10.191392 (-0.559841)	0.142301 / 0.680424 (-0.538123)	0.013912 / 0.534201 (-0.520289)	0.314940 / 0.579283 (-0.264343)	0.263134 / 0.434364 (-0.171229)	0.352966 / 0.540337 (-0.187372)	0.440421 / 1.386936 (-0.946515)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005878 / 0.011353 (-0.005475)	0.003866 / 0.011008 (-0.007142)	0.051347 / 0.038508 (0.012839)	0.032662 / 0.023109 (0.009553)	0.270701 / 0.275898 (-0.005197)	0.345277 / 0.323480 (0.021797)	0.004485 / 0.007986 (-0.003501)	0.002782 / 0.004328 (-0.001546)	0.048302 / 0.004250 (0.044051)	0.040355 / 0.037052 (0.003303)	0.285196 / 0.258489 (0.026707)	0.320339 / 0.293841 (0.026499)	0.032937 / 0.128546 (-0.095610)	0.012298 / 0.075646 (-0.063348)	0.061579 / 0.419271 (-0.357692)	0.034129 / 0.043533 (-0.009403)	0.265985 / 0.255139 (0.010846)	0.302066 / 0.283200 (0.018867)	0.018812 / 0.141683 (-0.122871)	1.175705 / 1.452155 (-0.276450)	1.197207 / 1.492716 (-0.295510)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096076 / 0.018006 (0.078070)	0.312793 / 0.000490 (0.312303)	0.000228 / 0.000200 (0.000028)	0.000053 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022858 / 0.037411 (-0.014553)	0.077160 / 0.014526 (0.062634)	0.089742 / 0.176557 (-0.086815)	0.130929 / 0.737135 (-0.606207)	0.093431 / 0.296338 (-0.202907)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.298884 / 0.215209 (0.083675)	2.961050 / 2.077655 (0.883395)	1.620694 / 1.504120 (0.116574)	1.499331 / 1.541195 (-0.041863)	1.513118 / 1.468490 (0.044628)	0.734738 / 4.584777 (-3.850039)	0.972978 / 3.745712 (-2.772734)	2.928172 / 5.269862 (-2.341690)	1.903667 / 4.565676 (-2.662010)	0.079207 / 0.424275 (-0.345068)	0.005803 / 0.007607 (-0.001804)	0.350144 / 0.226044 (0.124099)	3.519456 / 2.268929 (1.250528)	1.983809 / 55.444624 (-53.460815)	1.690527 / 6.876477 (-5.185950)	1.739301 / 2.142072 (-0.402772)	0.802045 / 4.805227 (-4.003182)	0.133041 / 6.500664 (-6.367623)	0.042112 / 0.075469 (-0.033357)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.030056 / 1.841788 (-0.811731)	12.077692 / 8.074308 (4.003384)	9.988253 / 10.191392 (-0.203139)	0.142745 / 0.680424 (-0.537679)	0.015842 / 0.534201 (-0.518359)	0.299055 / 0.579283 (-0.280228)	0.123788 / 0.434364 (-0.310576)	0.352782 / 0.540337 (-0.187555)	0.451140 / 1.386936 (-0.935796)

* Test polars round trip * Test Features.from_arrow_schema * Add large attribute to Sequence * Update get_nested_type to support pa.large_list * Update generate_from_arrow_type to support pa.LargeListType * Fix typo * Rename test * Add require_polars to test * Test from_polars large_list * Update test array_cast with large list * Support large list in array_cast * Test cast_array_to_feature for large list * Support large list in cast_array_to_feature * Fix support large list in cast_array_to_feature * Test save_to_disk with a dataset from polars with large_list * Test Features.reorder_fields_as with large Sequence * Fix Features.reorder_fields_as by using all Sequence params * Test save_to/load_from disk round trip with large_list dataset * Test DatasetInfo.from_dict with large Sequence * Test Features to/from dict round trip with large Sequence * Fix features generate_from_dict by using all Sequence params * Remove debug comments * Test cast_array_to_feature with struct array * Fix cast_array_to_feature for struct array * Test cast_array_to_feature from/to the same Sequence feature dtype * Fix cast_array_to_feature for the same Sequence feature dtype * Add more tests for dataset with large Sequence * Remove Sequence.large * Remove Sequence.large from tests * Add LargeList to tests * Replace tests with Sequence.large with LargeList * Replace Sequence.large with LargeList in test_dataset_info_from_dict * Implement LargeList * Test features to_yaml_list with LargeList * Support LargeList in Features._to_yaml_list * Test Features.from_dict with LargeList * Support LargeList in Features.from_dict * Test Features from_yaml_list with LargeList * Support LargeList in Features._from_yaml_list * Test get_nested_type with scalar/list features * Support LargeList in get_nested_type * Test generate_from_arrow_type with primitive/nested data types * Support LargeList in generate_from_arrow_type * Remove Sequence of dict from test cast_array_to_feature * Support LargeList in cast_array_to_feature * Test Features.encode_example * Test encode_nested_example with list types * Support LargeList in encode_nested_example * Test check_non_null_non_empty_recursive with list types * Support LargeList in check_non_null_non_empty_recursive * Test require_decoding with list types * Support LargeList in require_decoding * Test decode_nested_example with list types * Support LargeList in decode_nested_example * Test generate_from_dict with list types * Test Features.from_dict with list types * Test _visit with list types * Support LargeList in _visit * Test require_storage_cast with list types * Support LargeList in require_storage_cast * Refactor test_require_storage_cast_with_list_types * Test require_storage_embed with list types * Support LargeList in require_storage_embed * Fix test_features_reorder_fields_as * Test Features.reorder_fields_as with list types * Test Features.reorder_fields_as with dict within list types * Support LargeList in Features.reorder_fields_as * Test Features.flatten with list types * Test embed_array_storage with list types * Support LargeList in embed_array_storage * Delete unused tf_utils.is_numeric_feature * Add LargeList docstring * Add LargeList to main classes docs * Address requested changes

albertvillanova added 5 commits July 2, 2024 11:44

Test polars round trip

0545de4

Test Features.from_arrow_schema

4f23eb0

Add large attribute to Sequence

c870450

Update get_nested_type to support pa.large_list

427f117

Update generate_from_arrow_type to support pa.LargeListType

69f3548

albertvillanova added 12 commits July 2, 2024 15:17

Fix typo

9fdec4d

Rename test

84f3014

Merge remote-tracking branch 'upstream/main' into fix-6834-6984

9ea8eaf

Add require_polars to test

df13687

Test from_polars large_list

9bc5182

Merge remote-tracking branch 'upstream/main' into fix-6834-6984

6345fdc

Update test array_cast with large list

0d997cd

Support large list in array_cast

d1bd580

Test cast_array_to_feature for large list

87bd7e3

Support large list in cast_array_to_feature

a772762

Merge remote-tracking branch 'upstream/main' into fix-6834-6984

882d363

Fix support large list in cast_array_to_feature

78b3a8f

albertvillanova marked this pull request as ready for review July 8, 2024 15:33

albertvillanova added 9 commits July 18, 2024 10:57

Test save_to_disk with a dataset from polars with large_list

300a5a9

Test Features.reorder_fields_as with large Sequence

cd0901c

Fix Features.reorder_fields_as by using all Sequence params

a2c7bd0

Test save_to/load_from disk round trip with large_list dataset

d0e114c

Test DatasetInfo.from_dict with large Sequence

1f9f594

Test Features to/from dict round trip with large Sequence

a4eb288

Fix features generate_from_dict by using all Sequence params

9020ccf

Remove debug comments

057d184

Test cast_array_to_feature with struct array

8f3b02c

albertvillanova added 19 commits August 6, 2024 09:28

Test generate_from_dict with list types

b1a3db7

Test Features.from_dict with list types

b2a5789

Test _visit with list types

7c39b51

Support LargeList in _visit

48d143c

Test require_storage_cast with list types

3968181

Support LargeList in require_storage_cast

8e94ca0

Refactor test_require_storage_cast_with_list_types

40622e5

Test require_storage_embed with list types

1dea864

Support LargeList in require_storage_embed

c3bacba

Fix test_features_reorder_fields_as

c055ff3

Test Features.reorder_fields_as with list types

823a049

Test Features.reorder_fields_as with dict within list types

45326a9

Support LargeList in Features.reorder_fields_as

9acf8d9

Test Features.flatten with list types

5c8646b

Test embed_array_storage with list types

f11c56d

Support LargeList in embed_array_storage

27d0f94

Delete unused tf_utils.is_numeric_feature

4821c24

Add LargeList docstring

41f6068

Add LargeList to main classes docs

bb6baf5

lhoestq approved these changes Aug 12, 2024

View reviewed changes

Address requested changes

431694f

albertvillanova merged commit 0cf0be8 into main Aug 12, 2024
15 checks passed

albertvillanova deleted the fix-6834-6984 branch August 12, 2024 14:43

albertvillanova mentioned this pull request Sep 27, 2024

support LargeListArray in pyarrow #4800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pyarrow large_list #7019

Support pyarrow large_list #7019

albertvillanova commented Jul 2, 2024

HuggingFaceDocBuilderDev commented Jul 2, 2024

dakotamurdock commented Jul 17, 2024 •

edited

Loading

albertvillanova commented Jul 18, 2024 •

edited

Loading

albertvillanova commented Aug 5, 2024

albertvillanova commented Aug 7, 2024

lhoestq left a comment

lhoestq Aug 12, 2024

albertvillanova Aug 12, 2024

lhoestq Aug 12, 2024

lhoestq Aug 12, 2024

lhoestq Aug 12, 2024

lhoestq Aug 12, 2024

lhoestq Aug 12, 2024

github-actions bot commented Aug 12, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

		elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):
		# feature must be either [subfeature] or Sequence(subfeature)

		if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type:
		# Both array and feature have equal: list type and values (within the list) types

	# feature must be either LargeList(subfeature)
	# feature must be LargeList(subfeature)

Support pyarrow large_list #7019

Support pyarrow large_list #7019

Conversation

albertvillanova commented Jul 2, 2024

HuggingFaceDocBuilderDev commented Jul 2, 2024

dakotamurdock commented Jul 17, 2024 • edited Loading

albertvillanova commented Jul 18, 2024 • edited Loading

albertvillanova commented Aug 5, 2024

albertvillanova commented Aug 7, 2024

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

albertvillanova Aug 12, 2024

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

lhoestq Aug 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 12, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

dakotamurdock commented Jul 17, 2024 •

edited

Loading

albertvillanova commented Jul 18, 2024 •

edited

Loading