Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide the data with unprocessed code for the 28563 examples? #5

Open
Jun-jie-Huang opened this issue Mar 31, 2022 · 1 comment

Comments

@Jun-jie-Huang
Copy link

Hi, thanks for your wonderful dataset and repo.

I'm developing a model for code summarization and want to take notebookCDG as one of my tasks. But I couldn't find the original dataset.

I follow the instructions in README. I download the data from your link the fully processed data (as a pkl file) can be downloaded [HERE](https://ibm.biz/Bdfpk6) and also checked the data in huggingface dataset. I only find the data with processed code string in code.seq file. But the punctuation marks are removed.

So could you provide the data with original code without any tokenization or removing punctuations for search? I'd appreciate it if you could release the dataset!

Thanks,
Junjie

@xuyeliu
Copy link
Owner

xuyeliu commented Nov 18, 2022

Readme already update the raw notebook and raw pair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants