-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pyarrow large_list #7019
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@albertvillanova really happy to see this fix. Have you attempted to save a dataset to disk after this? I attempted to utilize your fix in a build from source, and while I can now successfully get a dataset object from a polars df containing a large list, I am getting the following error when attempting to save the resulting dataset to disk:
code to reproduce is actually 2 separate scripts below. creating test data:
loading data, converting to HF dataset, attempting to save to disk
If this isn't the appropriate place to put this, let me know. Since it isn't merged yet I didn't think raising an issue was appropriate. |
Thanks for your useful review comments, @dakotamurdock. I am investigating that issue to fix it in this PR. |
There are many feature-functions and most of them are not properly covered by tests. I am adding tests and fixing these feature-functions. |
I think this PR is ready for review, @huggingface/datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool LGTM ! I only left minor suggestions
feature = { | ||
name: Sequence(subfeature, length=feature.length) for name, subfeature in feature.feature.items() | ||
} | ||
sequence_kwargs = vars(feature).copy() | ||
feature = sequence_kwargs.pop("feature") | ||
feature = {name: Sequence(subfeature, **sequence_kwargs) for name, subfeature in feature.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those changes are not necessary but I'm fine with keeping them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I made them when implementing Sequence.large and decided to keep them for robustness in case we add some other attribute to Sequence in the future.
src/datasets/table.py
Outdated
elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type): | ||
# feature must be either [subfeature] or Sequence(subfeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type): | |
# feature must be either [subfeature] or Sequence(subfeature) | |
elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type): | |
# feature must be either [subfeature] or LargeList(subfeature) or Sequence(subfeature) |
src/datasets/table.py
Outdated
if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type: | ||
# Both array and feature have equal: list type and values (within the list) types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simpler ?
if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type: | |
# Both array and feature have equal: list type and values (within the list) types | |
if pa.types.is_list(array.type) and casted_array_values.type == array.values.type: | |
# Both array and feature have equal: list type and values (within the list) types |
src/datasets/table.py
Outdated
if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type: | ||
# Both array and feature have equal: list type and values (within the list) types | ||
return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
if type(array.type) is type(get_nested_type(feature)) and casted_array_values.type == array.values.type: | |
# Both array and feature have equal: list type and values (within the list) types | |
return array | |
if pa.types.is_large_list(array.type) and casted_array_values.type == array.values.type: | |
# Both array and feature have equal: large list type and values (within the list) types | |
return array |
src/datasets/table.py
Outdated
if ( | ||
type(array.type) is type(get_nested_type(feature)) | ||
and casted_array_values.type == array.values.type | ||
): | ||
# Both array and feature have equal: list type and values (within the list) types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
if ( | |
type(array.type) is type(get_nested_type(feature)) | |
and casted_array_values.type == array.values.type | |
): | |
# Both array and feature have equal: list type and values (within the list) types | |
if pa.types.is_list(array.type) and casted_array_values.type == array.values.type: | |
# Both array and feature have equal: list type and values (within the list) types |
src/datasets/table.py
Outdated
@@ -2128,6 +2154,11 @@ def embed_array_storage(array: pa.Array, feature: "FeatureType"): | |||
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature[0])) | |||
if isinstance(feature, Sequence) and feature.length == -1: | |||
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature.feature)) | |||
elif pa.types.is_large_list(array.type): | |||
# feature must be either LargeList(subfeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# feature must be either LargeList(subfeature) | |
# feature must be LargeList(subfeature) |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Test polars round trip * Test Features.from_arrow_schema * Add large attribute to Sequence * Update get_nested_type to support pa.large_list * Update generate_from_arrow_type to support pa.LargeListType * Fix typo * Rename test * Add require_polars to test * Test from_polars large_list * Update test array_cast with large list * Support large list in array_cast * Test cast_array_to_feature for large list * Support large list in cast_array_to_feature * Fix support large list in cast_array_to_feature * Test save_to_disk with a dataset from polars with large_list * Test Features.reorder_fields_as with large Sequence * Fix Features.reorder_fields_as by using all Sequence params * Test save_to/load_from disk round trip with large_list dataset * Test DatasetInfo.from_dict with large Sequence * Test Features to/from dict round trip with large Sequence * Fix features generate_from_dict by using all Sequence params * Remove debug comments * Test cast_array_to_feature with struct array * Fix cast_array_to_feature for struct array * Test cast_array_to_feature from/to the same Sequence feature dtype * Fix cast_array_to_feature for the same Sequence feature dtype * Add more tests for dataset with large Sequence * Remove Sequence.large * Remove Sequence.large from tests * Add LargeList to tests * Replace tests with Sequence.large with LargeList * Replace Sequence.large with LargeList in test_dataset_info_from_dict * Implement LargeList * Test features to_yaml_list with LargeList * Support LargeList in Features._to_yaml_list * Test Features.from_dict with LargeList * Support LargeList in Features.from_dict * Test Features from_yaml_list with LargeList * Support LargeList in Features._from_yaml_list * Test get_nested_type with scalar/list features * Support LargeList in get_nested_type * Test generate_from_arrow_type with primitive/nested data types * Support LargeList in generate_from_arrow_type * Remove Sequence of dict from test cast_array_to_feature * Support LargeList in cast_array_to_feature * Test Features.encode_example * Test encode_nested_example with list types * Support LargeList in encode_nested_example * Test check_non_null_non_empty_recursive with list types * Support LargeList in check_non_null_non_empty_recursive * Test require_decoding with list types * Support LargeList in require_decoding * Test decode_nested_example with list types * Support LargeList in decode_nested_example * Test generate_from_dict with list types * Test Features.from_dict with list types * Test _visit with list types * Support LargeList in _visit * Test require_storage_cast with list types * Support LargeList in require_storage_cast * Refactor test_require_storage_cast_with_list_types * Test require_storage_embed with list types * Support LargeList in require_storage_embed * Fix test_features_reorder_fields_as * Test Features.reorder_fields_as with list types * Test Features.reorder_fields_as with dict within list types * Support LargeList in Features.reorder_fields_as * Test Features.flatten with list types * Test embed_array_storage with list types * Support LargeList in embed_array_storage * Delete unused tf_utils.is_numeric_feature * Add LargeList docstring * Add LargeList to main classes docs * Address requested changes
* Test polars round trip * Test Features.from_arrow_schema * Add large attribute to Sequence * Update get_nested_type to support pa.large_list * Update generate_from_arrow_type to support pa.LargeListType * Fix typo * Rename test * Add require_polars to test * Test from_polars large_list * Update test array_cast with large list * Support large list in array_cast * Test cast_array_to_feature for large list * Support large list in cast_array_to_feature * Fix support large list in cast_array_to_feature * Test save_to_disk with a dataset from polars with large_list * Test Features.reorder_fields_as with large Sequence * Fix Features.reorder_fields_as by using all Sequence params * Test save_to/load_from disk round trip with large_list dataset * Test DatasetInfo.from_dict with large Sequence * Test Features to/from dict round trip with large Sequence * Fix features generate_from_dict by using all Sequence params * Remove debug comments * Test cast_array_to_feature with struct array * Fix cast_array_to_feature for struct array * Test cast_array_to_feature from/to the same Sequence feature dtype * Fix cast_array_to_feature for the same Sequence feature dtype * Add more tests for dataset with large Sequence * Remove Sequence.large * Remove Sequence.large from tests * Add LargeList to tests * Replace tests with Sequence.large with LargeList * Replace Sequence.large with LargeList in test_dataset_info_from_dict * Implement LargeList * Test features to_yaml_list with LargeList * Support LargeList in Features._to_yaml_list * Test Features.from_dict with LargeList * Support LargeList in Features.from_dict * Test Features from_yaml_list with LargeList * Support LargeList in Features._from_yaml_list * Test get_nested_type with scalar/list features * Support LargeList in get_nested_type * Test generate_from_arrow_type with primitive/nested data types * Support LargeList in generate_from_arrow_type * Remove Sequence of dict from test cast_array_to_feature * Support LargeList in cast_array_to_feature * Test Features.encode_example * Test encode_nested_example with list types * Support LargeList in encode_nested_example * Test check_non_null_non_empty_recursive with list types * Support LargeList in check_non_null_non_empty_recursive * Test require_decoding with list types * Support LargeList in require_decoding * Test decode_nested_example with list types * Support LargeList in decode_nested_example * Test generate_from_dict with list types * Test Features.from_dict with list types * Test _visit with list types * Support LargeList in _visit * Test require_storage_cast with list types * Support LargeList in require_storage_cast * Refactor test_require_storage_cast_with_list_types * Test require_storage_embed with list types * Support LargeList in require_storage_embed * Fix test_features_reorder_fields_as * Test Features.reorder_fields_as with list types * Test Features.reorder_fields_as with dict within list types * Support LargeList in Features.reorder_fields_as * Test Features.flatten with list types * Test embed_array_storage with list types * Support LargeList in embed_array_storage * Delete unused tf_utils.is_numeric_feature * Add LargeList docstring * Add LargeList to main classes docs * Address requested changes
* Test polars round trip * Test Features.from_arrow_schema * Add large attribute to Sequence * Update get_nested_type to support pa.large_list * Update generate_from_arrow_type to support pa.LargeListType * Fix typo * Rename test * Add require_polars to test * Test from_polars large_list * Update test array_cast with large list * Support large list in array_cast * Test cast_array_to_feature for large list * Support large list in cast_array_to_feature * Fix support large list in cast_array_to_feature * Test save_to_disk with a dataset from polars with large_list * Test Features.reorder_fields_as with large Sequence * Fix Features.reorder_fields_as by using all Sequence params * Test save_to/load_from disk round trip with large_list dataset * Test DatasetInfo.from_dict with large Sequence * Test Features to/from dict round trip with large Sequence * Fix features generate_from_dict by using all Sequence params * Remove debug comments * Test cast_array_to_feature with struct array * Fix cast_array_to_feature for struct array * Test cast_array_to_feature from/to the same Sequence feature dtype * Fix cast_array_to_feature for the same Sequence feature dtype * Add more tests for dataset with large Sequence * Remove Sequence.large * Remove Sequence.large from tests * Add LargeList to tests * Replace tests with Sequence.large with LargeList * Replace Sequence.large with LargeList in test_dataset_info_from_dict * Implement LargeList * Test features to_yaml_list with LargeList * Support LargeList in Features._to_yaml_list * Test Features.from_dict with LargeList * Support LargeList in Features.from_dict * Test Features from_yaml_list with LargeList * Support LargeList in Features._from_yaml_list * Test get_nested_type with scalar/list features * Support LargeList in get_nested_type * Test generate_from_arrow_type with primitive/nested data types * Support LargeList in generate_from_arrow_type * Remove Sequence of dict from test cast_array_to_feature * Support LargeList in cast_array_to_feature * Test Features.encode_example * Test encode_nested_example with list types * Support LargeList in encode_nested_example * Test check_non_null_non_empty_recursive with list types * Support LargeList in check_non_null_non_empty_recursive * Test require_decoding with list types * Support LargeList in require_decoding * Test decode_nested_example with list types * Support LargeList in decode_nested_example * Test generate_from_dict with list types * Test Features.from_dict with list types * Test _visit with list types * Support LargeList in _visit * Test require_storage_cast with list types * Support LargeList in require_storage_cast * Refactor test_require_storage_cast_with_list_types * Test require_storage_embed with list types * Support LargeList in require_storage_embed * Fix test_features_reorder_fields_as * Test Features.reorder_fields_as with list types * Test Features.reorder_fields_as with dict within list types * Support LargeList in Features.reorder_fields_as * Test Features.flatten with list types * Test embed_array_storage with list types * Support LargeList in embed_array_storage * Delete unused tf_utils.is_numeric_feature * Add LargeList docstring * Add LargeList to main classes docs * Address requested changes
Allow Polars round trip by supporting pyarrow large list.
Fix #6834, fix #6984.
Supersede and close #4800, close #6835, close #6986.