Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using "Sequence" instead of "List" #5354

Open
tranhd95 opened this issue Dec 12, 2022 · 10 comments
Open

Consider using "Sequence" instead of "List" #5354

tranhd95 opened this issue Dec 12, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@tranhd95
Copy link

tranhd95 commented Dec 12, 2022

Feature request

Hi, please consider using Sequence type annotation instead of List in function arguments such as in Dataset.from_parquet(). It leads to type checking errors, see below.

How to reproduce

list_of_filenames = ["foo.parquet", "bar.parquet"]
ds = Dataset.from_parquet(list_of_filenames)

Expected mypy output:

Success: no issues found

Actual mypy output:

test.py:19: error: Argument 1 to "from_parquet" of "Dataset" has incompatible type "List[str]"; expected "Union[Union[str, bytes, PathLike[Any]], List[Union[str, bytes, PathLike[Any]]]]"  [arg-type]
test.py:19: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
test.py:19: note: Consider using "Sequence" instead, which is covariant

Env: mypy 0.991, Python 3.10.0, datasets 2.7.1

@tranhd95 tranhd95 added the enhancement New feature or request label Dec 12, 2022
@mariosasko
Copy link
Collaborator

Hi! Linking a comment to provide more info on the issue: https://stackoverflow.com/a/39458225. This means we should replace all (most of) the occurrences of List with Sequence in function signatures.

@tranhd95 Would you be interested in submitting a PR?

@mariosasko mariosasko added the good first issue Good for newcomers label Dec 15, 2022
@dantema
Copy link

dantema commented Dec 16, 2022

Hi all! I tried to reproduce this issue and didn't work for me. Also in your example i noticed that the variables have different names: list_of_filenames and list_of_files, could this be related to that?

#I found random data in parquet format:
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet"
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"

#Then i try reproduce
list_of_files = ["userdata1.parquet", "userdata2.parquet"]
ds = Dataset.from_parquet(list_of_files)

My output:

WARNING:datasets.builder:Using custom data configuration default-e287d097dc54e046
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%
1/1 [00:00<00:00, 40.38it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.43it/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

P.S. This is my first experience with open source. So do not judge strictly if I do not understand something)

@tranhd95
Copy link
Author

@dantema There is indeed a typo in variable names. Nevertheless, I'm sorry if I was not clear but the output is from mypy type checker. You can run the code snippet without issues. The problem is with the type checking.

@tranhd95
Copy link
Author

However, I found out that the type annotation is actually misleading. The from_parquet method should also accept list of PathLike objects which includes os.PathLike. But if I would ran the code snippet below, an exception is thrown.

Code

from pathlib import Path

list_of_filenames = [Path("foo.parquet"), Path("bar.parquet")]
ds = Dataset.from_parquet(list_of_filenames)

Output

[/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in from_parquet(path_or_paths, split, features, cache_dir, keep_in_memory, columns, **kwargs)
   1071         from .io.parquet import ParquetDatasetReader
   1072 
-> 1073         return ParquetDatasetReader(
   1074             path_or_paths,
   1075             split=split,

[/usr/local/lib/python3.8/dist-packages/datasets/io/parquet.py](https://localhost:8080/#) in __init__(self, path_or_paths, split, features, cache_dir, keep_in_memory, streaming, **kwargs)
     35         path_or_paths = path_or_paths if isinstance(path_or_paths, dict) else {self.split: path_or_paths}
     36         hash = _PACKAGED_DATASETS_MODULES["parquet"][1]
---> 37         self.builder = Parquet(
     38             cache_dir=cache_dir,
     39             data_files=path_or_paths,

[/usr/local/lib/python3.8/dist-packages/datasets/builder.py](https://localhost:8080/#) in __init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, name, **config_kwargs)
    298 
    299         if data_files is not None and not isinstance(data_files, DataFilesDict):
--> 300             data_files = DataFilesDict.from_local_or_remote(
    301                 sanitize_patterns(data_files), base_path=base_path, use_auth_token=use_auth_token
    302             )

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    794         for key, patterns_for_key in patterns.items():
    795             out[key] = (
--> 796                 DataFilesList.from_local_or_remote(
    797                     patterns_for_key,
    798                     base_path=base_path,

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
    762     ) -> "DataFilesList":
    763         base_path = base_path if base_path is not None else str(Path().resolve())
--> 764         data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    765         origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
    766         return cls(data_files, origin_metadata)

[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
    357     data_files = []
    358     for pattern in patterns:
--> 359         if is_remote_url(pattern):
    360             data_files.append(Url(pattern))
    361         else:

[/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in is_remote_url(url_or_filename)
     62 
     63 def is_remote_url(url_or_filename: str) -> bool:
---> 64     parsed = urlparse(url_or_filename)
     65     return parsed.scheme in ("http", "https", "s3", "gs", "hdfs", "ftp")
     66 

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in urlparse(url, scheme, allow_fragments)
    373     Note that we don't break the components up in smaller bits
    374     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 375     url, scheme, _coerce_result = _coerce_args(url, scheme)
    376     splitresult = urlsplit(url, scheme, allow_fragments)
    377     scheme, netloc, url, query, fragment = splitresult

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _coerce_args(*args)
    125     if str_input:
    126         return args + (_noop,)
--> 127     return _decode_args(args) + (_encode_result,)
    128 
    129 # Result objects are more helpful than simple tuples

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _decode_args(args, encoding, errors)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in <genexpr>(.0)
    109 def _decode_args(args, encoding=_implicit_encoding,
    110                        errors=_implicit_errors):
--> 111     return tuple(x.decode(encoding, errors) if x else '' for x in args)
    112 
    113 def _coerce_args(*args):

AttributeError: 'PosixPath' object has no attribute 'decode'

@mariosasko Should I create a new issue?

@avinashsai
Copy link
Contributor

@mariosasko I would like to take this issue up.

@mariosasko
Copy link
Collaborator

@avinashsai Hi, I've assigned you the issue.

@tranhd95 Yes, feel free to report this in a new issue.

@Unknown3141592
Copy link
Contributor

@avinashsai Are you still working on this? If not I would like to give it a try.

@UNCANNY69
Copy link

@mariosasko I would like to take this issue up!

@ojuschugh1
Copy link

Hi @tranhd95 @mariosasko ,I hope you all are doing well.

I am interested in this issue, is this still open and unresolved ?

Thanks and Regards

@hritikranjan1
Copy link

@mariosasko I would like to take this issue up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

8 participants