Consider using "Sequence" instead of "List" #5354

tranhd95 · 2022-12-12T15:39:45Z

Feature request

Hi, please consider using Sequence type annotation instead of List in function arguments such as in Dataset.from_parquet(). It leads to type checking errors, see below.

How to reproduce

list_of_filenames = ["foo.parquet", "bar.parquet"]
ds = Dataset.from_parquet(list_of_filenames)

Expected mypy output:

Success: no issues found

Actual mypy output:

test.py:19: error: Argument 1 to "from_parquet" of "Dataset" has incompatible type "List[str]"; expected "Union[Union[str, bytes, PathLike[Any]], List[Union[str, bytes, PathLike[Any]]]]"  [arg-type]
test.py:19: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
test.py:19: note: Consider using "Sequence" instead, which is covariant

Env: mypy 0.991, Python 3.10.0, datasets 2.7.1

The text was updated successfully, but these errors were encountered:

mariosasko · 2022-12-13T17:18:51Z

Hi! Linking a comment to provide more info on the issue: https://stackoverflow.com/a/39458225. This means we should replace all (most of) the occurrences of List with Sequence in function signatures.

@tranhd95 Would you be interested in submitting a PR?

dantema · 2022-12-16T19:14:28Z

Hi all! I tried to reproduce this issue and didn't work for me. Also in your example i noticed that the variables have different names: list_of_filenames and list_of_files, could this be related to that?

#I found random data in parquet format:
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet"
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"

#Then i try reproduce
list_of_files = ["userdata1.parquet", "userdata2.parquet"]
ds = Dataset.from_parquet(list_of_files)

My output:

WARNING:datasets.builder:Using custom data configuration default-e287d097dc54e046
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%
1/1 [00:00<00:00, 40.38it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.43it/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

P.S. This is my first experience with open source. So do not judge strictly if I do not understand something)

tranhd95 · 2022-12-16T20:46:19Z

@dantema There is indeed a typo in variable names. Nevertheless, I'm sorry if I was not clear but the output is from mypy type checker. You can run the code snippet without issues. The problem is with the type checking.

tranhd95 · 2022-12-16T21:02:43Z

However, I found out that the type annotation is actually misleading. The from_parquet method should also accept list of PathLike objects which includes os.PathLike. But if I would ran the code snippet below, an exception is thrown.

Code

from pathlib import Path

list_of_filenames = [Path("foo.parquet"), Path("bar.parquet")]
ds = Dataset.from_parquet(list_of_filenames)

Output

[/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in from_parquet(path_or_paths, split, features, cache_dir, keep_in_memory, columns, **kwargs)
   1071         from .io.parquet import ParquetDatasetReader
   1072 
-> 1073         return ParquetDatasetReader(
   1074             path_or_paths,
   1075             split=split,

[/usr/local/lib/python3.8/dist-packages/datasets/io/parquet.py](https://localhost:8080/#) in __init__(self, path_or_paths, split, features, cache_dir, keep_in_memory, streaming, **kwargs)
     35         path_or_paths = path_or_paths if isinstance(path_or_paths, dict) else {self.split: path_or_paths}
     36         hash = _PACKAGED_DATASETS_MODULES["parquet"][1]
---> 37         self.builder = Parquet(
     38             cache_dir=cache_dir,
     39             data_files=path_or_paths,

[/usr/local/lib/python3.8/dist-packages/datasets/builder.py](https://localhost:8080/#) in __init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, name, **config_kwargs)
    298 
    299         if data_files is not None and not isinstance(data_files, DataFilesDict):
--> 300             data_files = DataFilesDict.from_local_or_remote(
    301                 sanitize_patterns(data_files), base_path=base_path, use_auth_token=use_auth_token
    302             )

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    794         for key, patterns_for_key in patterns.items():
    795             out[key] = (
--> 796                 DataFilesList.from_local_or_remote(
    797                     patterns_for_key,
    798                     base_path=base_path,

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    762     ) -> "DataFilesList":
    763         base_path = base_path if base_path is not None else str(Path().resolve())
--> 764         data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    765         origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
    766         return cls(data_files, origin_metadata)

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    357     data_files = []
    358     for pattern in patterns:
--> 359         if is_remote_url(pattern):
    360             data_files.append(Url(pattern))
    361         else:

[/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in is_remote_url(url_or_filename)
     62 
     63 def is_remote_url(url_or_filename: str) -> bool:
---> 64     parsed = urlparse(url_or_filename)
     65     return parsed.scheme in ("http", "https", "s3", "gs", "hdfs", "ftp")
     66 

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in urlparse(url, scheme, allow_fragments)
    373     Note that we don't break the components up in smaller bits
    374     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 375     url, scheme, _coerce_result = _coerce_args(url, scheme)
    376     splitresult = urlsplit(url, scheme, allow_fragments)
    377     scheme, netloc, url, query, fragment = splitresult

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _coerce_args(*args)
    125     if str_input:
    126         return args + (_noop,)
--> 127     return _decode_args(args) + (_encode_result,)
    128 
    129 # Result objects are more helpful than simple tuples

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _decode_args(args, encoding, errors)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in <genexpr>(.0)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

AttributeError: 'PosixPath' object has no attribute 'decode'

@mariosasko Should I create a new issue?

avinashsai · 2023-02-01T05:14:14Z

@mariosasko I would like to take this issue up.

mariosasko · 2023-02-01T13:55:13Z

@avinashsai Hi, I've assigned you the issue.

@tranhd95 Yes, feel free to report this in a new issue.

Unknown3141592 · 2023-07-26T16:25:50Z

@avinashsai Are you still working on this? If not I would like to give it a try.

UNCANNY69 · 2023-11-20T06:21:28Z

@mariosasko I would like to take this issue up!

ojuschugh1 · 2024-01-20T19:57:16Z

Hi @tranhd95 @mariosasko ,I hope you all are doing well.

I am interested in this issue, is this still open and unresolved ?

Thanks and Regards

hritikranjan1 · 2024-10-05T14:38:43Z

@mariosasko I would like to take this issue up.

tranhd95 added the enhancement New feature or request label Dec 12, 2022

mariosasko added the good first issue Good for newcomers label Dec 15, 2022

mariosasko assigned avinashsai Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using "Sequence" instead of "List" #5354

Consider using "Sequence" instead of "List" #5354

tranhd95 commented Dec 12, 2022 •

edited

Loading

mariosasko commented Dec 13, 2022

dantema commented Dec 16, 2022

tranhd95 commented Dec 16, 2022

tranhd95 commented Dec 16, 2022

avinashsai commented Feb 1, 2023

mariosasko commented Feb 1, 2023

Unknown3141592 commented Jul 26, 2023

UNCANNY69 commented Nov 20, 2023

ojuschugh1 commented Jan 20, 2024

hritikranjan1 commented Oct 5, 2024

Consider using "Sequence" instead of "List" #5354

Consider using "Sequence" instead of "List" #5354

Comments

tranhd95 commented Dec 12, 2022 • edited Loading

Feature request

mariosasko commented Dec 13, 2022

dantema commented Dec 16, 2022

tranhd95 commented Dec 16, 2022

tranhd95 commented Dec 16, 2022

avinashsai commented Feb 1, 2023

mariosasko commented Feb 1, 2023

Unknown3141592 commented Jul 26, 2023

UNCANNY69 commented Nov 20, 2023

ojuschugh1 commented Jan 20, 2024

hritikranjan1 commented Oct 5, 2024

tranhd95 commented Dec 12, 2022 •

edited

Loading