Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pyarrow large_list #7019

Merged
merged 78 commits into from
Aug 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
0545de4
Test polars round trip
albertvillanova Jul 2, 2024
4f23eb0
Test Features.from_arrow_schema
albertvillanova Jul 2, 2024
c870450
Add large attribute to Sequence
albertvillanova Jul 2, 2024
427f117
Update get_nested_type to support pa.large_list
albertvillanova Jul 2, 2024
69f3548
Update generate_from_arrow_type to support pa.LargeListType
albertvillanova Jul 2, 2024
9fdec4d
Fix typo
albertvillanova Jul 2, 2024
84f3014
Rename test
albertvillanova Jul 2, 2024
9ea8eaf
Merge remote-tracking branch 'upstream/main' into fix-6834-6984
albertvillanova Jul 2, 2024
df13687
Add require_polars to test
albertvillanova Jul 3, 2024
9bc5182
Test from_polars large_list
albertvillanova Jul 3, 2024
6345fdc
Merge remote-tracking branch 'upstream/main' into fix-6834-6984
albertvillanova Jul 3, 2024
0d997cd
Update test array_cast with large list
albertvillanova Jul 8, 2024
d1bd580
Support large list in array_cast
albertvillanova Jul 8, 2024
87bd7e3
Test cast_array_to_feature for large list
albertvillanova Jul 8, 2024
a772762
Support large list in cast_array_to_feature
albertvillanova Jul 8, 2024
882d363
Merge remote-tracking branch 'upstream/main' into fix-6834-6984
albertvillanova Jul 8, 2024
78b3a8f
Fix support large list in cast_array_to_feature
albertvillanova Jul 8, 2024
300a5a9
Test save_to_disk with a dataset from polars with large_list
albertvillanova Jul 18, 2024
cd0901c
Test Features.reorder_fields_as with large Sequence
albertvillanova Jul 18, 2024
a2c7bd0
Fix Features.reorder_fields_as by using all Sequence params
albertvillanova Jul 18, 2024
d0e114c
Test save_to/load_from disk round trip with large_list dataset
albertvillanova Jul 18, 2024
1f9f594
Test DatasetInfo.from_dict with large Sequence
albertvillanova Jul 18, 2024
a4eb288
Test Features to/from dict round trip with large Sequence
albertvillanova Jul 18, 2024
9020ccf
Fix features generate_from_dict by using all Sequence params
albertvillanova Jul 18, 2024
057d184
Remove debug comments
albertvillanova Jul 18, 2024
8f3b02c
Test cast_array_to_feature with struct array
albertvillanova Jul 19, 2024
f6e528f
Fix cast_array_to_feature for struct array
albertvillanova Jul 19, 2024
89d4366
Test cast_array_to_feature from/to the same Sequence feature dtype
albertvillanova Jul 19, 2024
eaf4c64
Fix cast_array_to_feature for the same Sequence feature dtype
albertvillanova Jul 19, 2024
1f28c5f
Add more tests for dataset with large Sequence
albertvillanova Jul 19, 2024
6f3604c
Merge branch 'main' into fix-6834-6984
albertvillanova Jul 30, 2024
33a1a55
Remove Sequence.large
albertvillanova Jul 31, 2024
6e6e9b7
Remove Sequence.large from tests
albertvillanova Jul 31, 2024
bfa8fae
Add LargeList to tests
albertvillanova Aug 2, 2024
8215a61
Replace tests with Sequence.large with LargeList
albertvillanova Aug 2, 2024
152d6dd
Replace Sequence.large with LargeList in test_dataset_info_from_dict
albertvillanova Aug 2, 2024
632d1ea
Implement LargeList
albertvillanova Aug 2, 2024
1f247bc
Test features to_yaml_list with LargeList
albertvillanova Aug 2, 2024
f08f216
Support LargeList in Features._to_yaml_list
albertvillanova Aug 2, 2024
a79e337
Test Features.from_dict with LargeList
albertvillanova Aug 2, 2024
b76aaa0
Support LargeList in Features.from_dict
albertvillanova Aug 2, 2024
a677143
Test Features from_yaml_list with LargeList
albertvillanova Aug 2, 2024
31d22dd
Support LargeList in Features._from_yaml_list
albertvillanova Aug 2, 2024
79772a6
Test get_nested_type with scalar/list features
albertvillanova Aug 2, 2024
af22e52
Support LargeList in get_nested_type
albertvillanova Aug 2, 2024
0611fdc
Test generate_from_arrow_type with primitive/nested data types
albertvillanova Aug 2, 2024
a1eff5c
Support LargeList in generate_from_arrow_type
albertvillanova Aug 2, 2024
e72d8fe
Remove Sequence of dict from test cast_array_to_feature
albertvillanova Aug 2, 2024
bf646ac
Support LargeList in cast_array_to_feature
albertvillanova Aug 2, 2024
78a9a78
Test Features.encode_example
albertvillanova Aug 5, 2024
968364c
Test encode_nested_example with list types
albertvillanova Aug 5, 2024
60465af
Support LargeList in encode_nested_example
albertvillanova Aug 5, 2024
77aa27f
Test check_non_null_non_empty_recursive with list types
albertvillanova Aug 5, 2024
19e9deb
Support LargeList in check_non_null_non_empty_recursive
albertvillanova Aug 5, 2024
b27a8a1
Test require_decoding with list types
albertvillanova Aug 5, 2024
9ec883b
Support LargeList in require_decoding
albertvillanova Aug 5, 2024
ab8724b
Test decode_nested_example with list types
albertvillanova Aug 5, 2024
30ba3bc
Support LargeList in decode_nested_example
albertvillanova Aug 5, 2024
b1a3db7
Test generate_from_dict with list types
albertvillanova Aug 6, 2024
b2a5789
Test Features.from_dict with list types
albertvillanova Aug 6, 2024
7c39b51
Test _visit with list types
albertvillanova Aug 6, 2024
48d143c
Support LargeList in _visit
albertvillanova Aug 6, 2024
3968181
Test require_storage_cast with list types
albertvillanova Aug 6, 2024
8e94ca0
Support LargeList in require_storage_cast
albertvillanova Aug 6, 2024
40622e5
Refactor test_require_storage_cast_with_list_types
albertvillanova Aug 6, 2024
1dea864
Test require_storage_embed with list types
albertvillanova Aug 6, 2024
c3bacba
Support LargeList in require_storage_embed
albertvillanova Aug 6, 2024
c055ff3
Fix test_features_reorder_fields_as
albertvillanova Aug 6, 2024
823a049
Test Features.reorder_fields_as with list types
albertvillanova Aug 6, 2024
45326a9
Test Features.reorder_fields_as with dict within list types
albertvillanova Aug 6, 2024
9acf8d9
Support LargeList in Features.reorder_fields_as
albertvillanova Aug 6, 2024
5c8646b
Test Features.flatten with list types
albertvillanova Aug 6, 2024
f11c56d
Test embed_array_storage with list types
albertvillanova Aug 6, 2024
27d0f94
Support LargeList in embed_array_storage
albertvillanova Aug 6, 2024
4821c24
Delete unused tf_utils.is_numeric_feature
albertvillanova Aug 6, 2024
41f6068
Add LargeList docstring
albertvillanova Aug 7, 2024
bb6baf5
Add LargeList to main classes docs
albertvillanova Aug 7, 2024
431694f
Address requested changes
albertvillanova Aug 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -211,11 +211,13 @@ Dictionary with split names as keys ('train', 'test' for example), and `Iterable

[[autodoc]] datasets.Features

[[autodoc]] datasets.Sequence
[[autodoc]] datasets.Value

[[autodoc]] datasets.ClassLabel

[[autodoc]] datasets.Value
[[autodoc]] datasets.LargeList

[[autodoc]] datasets.Sequence

[[autodoc]] datasets.Translation

Expand Down
3 changes: 2 additions & 1 deletion src/datasets/features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,14 @@
"Array5D",
"ClassLabel",
"Features",
"LargeList",
"Sequence",
"Value",
"Image",
"Translation",
"TranslationVariableLanguages",
]
from .audio import Audio
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, Sequence, Value
from .features import Array2D, Array3D, Array4D, Array5D, ClassLabel, Features, LargeList, Sequence, Value
from .image import Image
from .translation import Translation, TranslationVariableLanguages
170 changes: 114 additions & 56 deletions src/datasets/features/features.py

Large diffs are not rendered by default.

46 changes: 37 additions & 9 deletions src/datasets/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -1884,7 +1884,7 @@ def array_cast(
return array
arrays = [_c(array.field(field.name), field.type) for field in pa_type]
return pa.StructArray.from_arrays(arrays, fields=list(pa_type), mask=array.is_null())
elif pa.types.is_list(array.type):
elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):
if pa.types.is_fixed_size_list(pa_type):
if _are_list_values_of_length(array, pa_type.list_size):
if array.null_count > 0:
Expand All @@ -1911,6 +1911,10 @@ def array_cast(
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
return pa.ListArray.from_arrays(array_offsets, _c(array.values, pa_type.value_type))
elif pa.types.is_large_list(pa_type):
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
return pa.LargeListArray.from_arrays(array_offsets, _c(array.values, pa_type.value_type))
elif pa.types.is_fixed_size_list(array.type):
if pa.types.is_fixed_size_list(pa_type):
if pa_type.list_size == array.type.list_size:
Expand All @@ -1923,6 +1927,11 @@ def array_cast(
elif pa.types.is_list(pa_type):
array_offsets = (np.arange(len(array) + 1) + array.offset) * array.type.list_size
return pa.ListArray.from_arrays(array_offsets, _c(array.values, pa_type.value_type), mask=array.is_null())
elif pa.types.is_large_list(pa_type):
array_offsets = (np.arange(len(array) + 1) + array.offset) * array.type.list_size
return pa.LargeListArray.from_arrays(
array_offsets, _c(array.values, pa_type.value_type), mask=array.is_null()
)
else:
if pa.types.is_string(pa_type):
if not allow_primitive_to_str and pa.types.is_primitive(array.type):
Expand Down Expand Up @@ -1972,7 +1981,7 @@ def cast_array_to_feature(
Returns:
array (`pyarrow.Array`): the casted array
"""
from .features.features import Sequence, get_nested_type
from .features.features import LargeList, Sequence, get_nested_type

_c = partial(
cast_array_to_feature,
Expand All @@ -1988,24 +1997,34 @@ def cast_array_to_feature(
elif pa.types.is_struct(array.type):
# feature must be a dict or Sequence(subfeatures_dict)
if isinstance(feature, Sequence) and isinstance(feature.feature, dict):
feature = {
name: Sequence(subfeature, length=feature.length) for name, subfeature in feature.feature.items()
}
sequence_kwargs = vars(feature).copy()
feature = sequence_kwargs.pop("feature")
feature = {name: Sequence(subfeature, **sequence_kwargs) for name, subfeature in feature.items()}
Comment on lines -1991 to +2002
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those changes are not necessary but I'm fine with keeping them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I made them when implementing Sequence.large and decided to keep them for robustness in case we add some other attribute to Sequence in the future.

if isinstance(feature, dict) and {field.name for field in array.type} == set(feature):
if array.type.num_fields == 0:
return array
arrays = [_c(array.field(name), subfeature) for name, subfeature in feature.items()]
return pa.StructArray.from_arrays(arrays, names=list(feature), mask=array.is_null())
elif pa.types.is_list(array.type):
# feature must be either [subfeature] or Sequence(subfeature)
elif pa.types.is_list(array.type) or pa.types.is_large_list(array.type):
# feature must be either [subfeature] or LargeList(subfeature) or Sequence(subfeature)
if isinstance(feature, list):
casted_array_values = _c(array.values, feature[0])
if casted_array_values.type == array.values.type:
if pa.types.is_list(array.type) and casted_array_values.type == array.values.type:
# Both array and feature have equal list type and values (within the list) type
return array
else:
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
return pa.ListArray.from_arrays(array_offsets, casted_array_values)
elif isinstance(feature, LargeList):
casted_array_values = _c(array.values, feature.dtype)
if pa.types.is_large_list(array.type) and casted_array_values.type == array.values.type:
# Both array and feature have equal large_list type and values (within the list) type
return array
else:
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
return pa.LargeListArray.from_arrays(array_offsets, casted_array_values)
elif isinstance(feature, Sequence):
if feature.length > -1:
if _are_list_values_of_length(array, feature.length):
Expand Down Expand Up @@ -2042,7 +2061,8 @@ def cast_array_to_feature(
return pa.FixedSizeListArray.from_arrays(_c(array_values, feature.feature), feature.length)
else:
casted_array_values = _c(array.values, feature.feature)
if casted_array_values.type == array.values.type:
if pa.types.is_list(array.type) and casted_array_values.type == array.values.type:
# Both array and feature have equal list type and values (within the list) type
return array
else:
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
Expand All @@ -2053,6 +2073,9 @@ def cast_array_to_feature(
if isinstance(feature, list):
array_offsets = (np.arange(len(array) + 1) + array.offset) * array.type.list_size
return pa.ListArray.from_arrays(array_offsets, _c(array.values, feature[0]), mask=array.is_null())
elif isinstance(feature, LargeList):
array_offsets = (np.arange(len(array) + 1) + array.offset) * array.type.list_size
return pa.LargeListArray.from_arrays(array_offsets, _c(array.values, feature.dtype), mask=array.is_null())
elif isinstance(feature, Sequence):
if feature.length > -1:
if feature.length == array.type.list_size:
Expand Down Expand Up @@ -2128,6 +2151,11 @@ def embed_array_storage(array: pa.Array, feature: "FeatureType"):
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature[0]))
if isinstance(feature, Sequence) and feature.length == -1:
return pa.ListArray.from_arrays(array_offsets, _e(array.values, feature.feature))
elif pa.types.is_large_list(array.type):
# feature must be LargeList(subfeature)
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
return pa.LargeListArray.from_arrays(array_offsets, _e(array.values, feature.dtype))
elif pa.types.is_fixed_size_list(array.type):
# feature must be Sequence(subfeature)
if isinstance(feature, Sequence) and feature.length > -1:
Expand Down
18 changes: 0 additions & 18 deletions src/datasets/utils/tf_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,24 +67,6 @@ def is_numeric_pa_type(pa_type):
return pa.types.is_integer(pa_type) or pa.types.is_floating(pa_type) or pa.types.is_decimal(pa_type)


def is_numeric_feature(feature):
from .. import ClassLabel, Sequence, Value
from ..features.features import _ArrayXD

if isinstance(feature, Sequence):
return is_numeric_feature(feature.feature)
elif isinstance(feature, list):
return is_numeric_feature(feature[0])
elif isinstance(feature, _ArrayXD):
return is_numeric_pa_type(feature().storage_dtype)
elif isinstance(feature, Value):
return is_numeric_pa_type(feature())
elif isinstance(feature, ClassLabel):
return True
else:
return False


def np_get_batch(
indices, dataset, cols_to_retain, collate_fn, collate_fn_args, columns_to_np_types, return_dict=False
):
Expand Down
Loading
Loading