-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using "Sequence" instead of "List" #5354
Comments
Hi! Linking a comment to provide more info on the issue: https://stackoverflow.com/a/39458225. This means we should replace all (most of) the occurrences of @tranhd95 Would you be interested in submitting a PR? |
Hi all! I tried to reproduce this issue and didn't work for me. Also in your example i noticed that the variables have different names: #I found random data in parquet format:
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata1.parquet"
!wget "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"
#Then i try reproduce
list_of_files = ["userdata1.parquet", "userdata2.parquet"]
ds = Dataset.from_parquet(list_of_files) My output: WARNING:datasets.builder:Using custom data configuration default-e287d097dc54e046
Downloading and preparing dataset parquet/default to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 100%
1/1 [00:00<00:00, 40.38it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 23.43it/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/default-e287d097dc54e046/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data. P.S. This is my first experience with open source. So do not judge strictly if I do not understand something) |
@dantema There is indeed a typo in variable names. Nevertheless, I'm sorry if I was not clear but the output is from |
However, I found out that the type annotation is actually misleading. The Code from pathlib import Path
list_of_filenames = [Path("foo.parquet"), Path("bar.parquet")]
ds = Dataset.from_parquet(list_of_filenames) Output [/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in from_parquet(path_or_paths, split, features, cache_dir, keep_in_memory, columns, **kwargs)
1071 from .io.parquet import ParquetDatasetReader
1072
-> 1073 return ParquetDatasetReader(
1074 path_or_paths,
1075 split=split,
[/usr/local/lib/python3.8/dist-packages/datasets/io/parquet.py](https://localhost:8080/#) in __init__(self, path_or_paths, split, features, cache_dir, keep_in_memory, streaming, **kwargs)
35 path_or_paths = path_or_paths if isinstance(path_or_paths, dict) else {self.split: path_or_paths}
36 hash = _PACKAGED_DATASETS_MODULES["parquet"][1]
---> 37 self.builder = Parquet(
38 cache_dir=cache_dir,
39 data_files=path_or_paths,
[/usr/local/lib/python3.8/dist-packages/datasets/builder.py](https://localhost:8080/#) in __init__(self, cache_dir, config_name, hash, base_path, info, features, use_auth_token, repo_id, data_files, data_dir, name, **config_kwargs)
298
299 if data_files is not None and not isinstance(data_files, DataFilesDict):
--> 300 data_files = DataFilesDict.from_local_or_remote(
301 sanitize_patterns(data_files), base_path=base_path, use_auth_token=use_auth_token
302 )
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
794 for key, patterns_for_key in patterns.items():
795 out[key] = (
--> 796 DataFilesList.from_local_or_remote(
797 patterns_for_key,
798 base_path=base_path,
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in from_local_or_remote(cls, patterns, base_path, allowed_extensions, use_auth_token)
762 ) -> "DataFilesList":
763 base_path = base_path if base_path is not None else str(Path().resolve())
--> 764 data_files = resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
765 origin_metadata = _get_origin_metadata_locally_or_by_urls(data_files, use_auth_token=use_auth_token)
766 return cls(data_files, origin_metadata)
[/usr/local/lib/python3.8/dist-packages/datasets/data_files.py](https://localhost:8080/#) in resolve_patterns_locally_or_by_urls(base_path, patterns, allowed_extensions)
357 data_files = []
358 for pattern in patterns:
--> 359 if is_remote_url(pattern):
360 data_files.append(Url(pattern))
361 else:
[/usr/local/lib/python3.8/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in is_remote_url(url_or_filename)
62
63 def is_remote_url(url_or_filename: str) -> bool:
---> 64 parsed = urlparse(url_or_filename)
65 return parsed.scheme in ("http", "https", "s3", "gs", "hdfs", "ftp")
66
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in urlparse(url, scheme, allow_fragments)
373 Note that we don't break the components up in smaller bits
374 (e.g. netloc is a single string) and we don't expand % escapes."""
--> 375 url, scheme, _coerce_result = _coerce_args(url, scheme)
376 splitresult = urlsplit(url, scheme, allow_fragments)
377 scheme, netloc, url, query, fragment = splitresult
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _coerce_args(*args)
125 if str_input:
126 return args + (_noop,)
--> 127 return _decode_args(args) + (_encode_result,)
128
129 # Result objects are more helpful than simple tuples
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in _decode_args(args, encoding, errors)
109 def _decode_args(args, encoding=_implicit_encoding,
110 errors=_implicit_errors):
--> 111 return tuple(x.decode(encoding, errors) if x else '' for x in args)
112
113 def _coerce_args(*args):
[/usr/lib/python3.8/urllib/parse.py](https://localhost:8080/#) in <genexpr>(.0)
109 def _decode_args(args, encoding=_implicit_encoding,
110 errors=_implicit_errors):
--> 111 return tuple(x.decode(encoding, errors) if x else '' for x in args)
112
113 def _coerce_args(*args):
AttributeError: 'PosixPath' object has no attribute 'decode' @mariosasko Should I create a new issue? |
@mariosasko I would like to take this issue up. |
@avinashsai Hi, I've assigned you the issue. @tranhd95 Yes, feel free to report this in a new issue. |
@avinashsai Are you still working on this? If not I would like to give it a try. |
@mariosasko I would like to take this issue up! |
Hi @tranhd95 @mariosasko ,I hope you all are doing well. I am interested in this issue, is this still open and unresolved ? Thanks and Regards |
@mariosasko I would like to take this issue up. |
Feature request
Hi, please consider using
Sequence
type annotation instead ofList
in function arguments such as inDataset.from_parquet()
. It leads to type checking errors, see below.How to reproduce
Expected mypy output:
Actual mypy output:
Env: mypy 0.991, Python 3.10.0, datasets 2.7.1
The text was updated successfully, but these errors were encountered: