Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importing a dataset via an URL #1

Open
dkorenci opened this issue Aug 2, 2024 · 4 comments
Open

importing a dataset via an URL #1

dkorenci opened this issue Aug 2, 2024 · 4 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@dkorenci
Copy link

dkorenci commented Aug 2, 2024

Hi, using the [v1.5.4], I can't make the importing of data via URL to work using the Service.import_data_url method.
The error received is a generic server error without any specific details, and the logs don't contain any data on the error.
The error occurs for a range of sizes - 100, 1000, 10000 texts.
I'm confident the setup and the invocation of the method are done properly, since I have an operational
instance of a single-project meganno setup, workable with through both notebook and py modules,
and the import from a dataframe works (but not for large datasets), and this data can be viewed and annotated.
Any suggestion would be most helpful. Thanks!

@rafaellichen rafaellichen added the help wanted Extra attention is needed label Aug 2, 2024
@horseno
Copy link
Contributor

horseno commented Aug 2, 2024

Thanks for your interest and reporting the issue with detailed context!
You should be able to import with an additional parameter file_type (currently only 'csv' supported).
Example <service>.import_data_url(url = url, file_type="csv", column_mapping=<column_mapping>

Some extra precaution, just in case. The url needs to be a direct downloadable csv file. If you host your table with Google spreadsheet, use File-> Share -> Publish to Web button. While generating the link, make sure to select csv format and make the link publicly available.

Thanks for catching this bug, we already fixed the instructions in the Colab notebook, and tested with import of 13k rows. We'll revise the default value and improve the error messaging in the next release.

Let us know if you have further questions

@dkorenci
Copy link
Author

dkorenci commented Aug 8, 2024

Thank you for the reply and for the information.
I've tried with the correct file_type value.

However, I wanted to automate corpus creation so I was running the code via python interpreter, not in the notebook.
The schema creation was successful. The data import was attempted using the file:// URL pointing to a csv file.
The api container was configured to be able to access the file (a shared folder) and container-local URL was used.
Due to the lack of the error message, I cannot say where exactly this operation goes wrong.

The idea was to automatically download a huggingface dataset, export it
as CSV, and load it into meganno, without needing to do everything by hand every time.

I will try to do the import via the notebook.
But I guess that, in any of the above cases, I need to host the created CSV file on a local or a remote web server?

From the end user perspective, the simplest way would be to make the method for importing from
a dataframe work for larger datasets, if this is technically possible.

Thanks again.

@horseno
Copy link
Contributor

horseno commented Aug 9, 2024

Thanks for your information.
First of all, the url download option meant to simpify the procedure rather than making it more complex. It provides an option to avoid the need to download the data if a downloadable csv is available. In the case of huggingface dataset, if it is available in csv format, try Files and Versions -> Copy download link.

If you have already downloaded the data and loaded in a dataframe, or you need to pre-process the data, it's probably better to load from your local dataframe. We set the limit to avoid passing huge object in a single REST call, but large dataframes can be processed through batching. We'll consider adding it in the next release. But a
simple client-side hack is to create your own batches:

# use batch_size=1000
def batch_process_dataframe(df, batch_size):
    num_batches = len(df) // batch_size + (1 if len(df) % batch_size != 0 else 0)
    
    for i in range(num_batches):
        start = i * batch_size
        end = start + batch_size
        chunk = df.iloc[start:end]
        # import chunk
        demo.import_data_df(chunk, column_mapping)

@dkorenci
Copy link
Author

Ok, thank you, I'll try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants