importing a dataset via an URL #1

dkorenci · 2024-08-02T16:42:43Z

Hi, using the [v1.5.4], I can't make the importing of data via URL to work using the Service.import_data_url method.
The error received is a generic server error without any specific details, and the logs don't contain any data on the error.
The error occurs for a range of sizes - 100, 1000, 10000 texts.
I'm confident the setup and the invocation of the method are done properly, since I have an operational
instance of a single-project meganno setup, workable with through both notebook and py modules,
and the import from a dataframe works (but not for large datasets), and this data can be viewed and annotated.
Any suggestion would be most helpful. Thanks!

horseno · 2024-08-02T21:32:01Z

Thanks for your interest and reporting the issue with detailed context!
You should be able to import with an additional parameter file_type (currently only 'csv' supported).
Example <service>.import_data_url(url = url, file_type="csv", column_mapping=<column_mapping>

Some extra precaution, just in case. The url needs to be a direct downloadable csv file. If you host your table with Google spreadsheet, use File-> Share -> Publish to Web button. While generating the link, make sure to select csv format and make the link publicly available.

Thanks for catching this bug, we already fixed the instructions in the Colab notebook, and tested with import of 13k rows. We'll revise the default value and improve the error messaging in the next release.

Let us know if you have further questions

dkorenci · 2024-08-08T09:27:39Z

Thank you for the reply and for the information.
I've tried with the correct file_type value.

However, I wanted to automate corpus creation so I was running the code via python interpreter, not in the notebook.
The schema creation was successful. The data import was attempted using the file:// URL pointing to a csv file.
The api container was configured to be able to access the file (a shared folder) and container-local URL was used.
Due to the lack of the error message, I cannot say where exactly this operation goes wrong.

The idea was to automatically download a huggingface dataset, export it
as CSV, and load it into meganno, without needing to do everything by hand every time.

I will try to do the import via the notebook.
But I guess that, in any of the above cases, I need to host the created CSV file on a local or a remote web server?

From the end user perspective, the simplest way would be to make the method for importing from
a dataframe work for larger datasets, if this is technically possible.

Thanks again.

horseno · 2024-08-09T21:33:52Z

Thanks for your information.
First of all, the url download option meant to simpify the procedure rather than making it more complex. It provides an option to avoid the need to download the data if a downloadable csv is available. In the case of huggingface dataset, if it is available in csv format, try Files and Versions -> Copy download link.

If you have already downloaded the data and loaded in a dataframe, or you need to pre-process the data, it's probably better to load from your local dataframe. We set the limit to avoid passing huge object in a single REST call, but large dataframes can be processed through batching. We'll consider adding it in the next release. But a
simple client-side hack is to create your own batches:

# use batch_size=1000
def batch_process_dataframe(df, batch_size):
    num_batches = len(df) // batch_size + (1 if len(df) % batch_size != 0 else 0)
    
    for i in range(num_batches):
        start = i * batch_size
        end = start + batch_size
        chunk = df.iloc[start:end]
        # import chunk
        demo.import_data_df(chunk, column_mapping)

dkorenci · 2024-08-12T08:58:03Z

Ok, thank you, I'll try it

rafaellichen assigned horseno Aug 2, 2024

rafaellichen added the help wanted Extra attention is needed label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importing a dataset via an URL #1

importing a dataset via an URL #1

dkorenci commented Aug 2, 2024

horseno commented Aug 2, 2024

dkorenci commented Aug 8, 2024

horseno commented Aug 9, 2024

dkorenci commented Aug 12, 2024

importing a dataset via an URL #1

importing a dataset via an URL #1

Comments

dkorenci commented Aug 2, 2024

horseno commented Aug 2, 2024

dkorenci commented Aug 8, 2024

horseno commented Aug 9, 2024

dkorenci commented Aug 12, 2024