Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citing this resource #414

Open
yuvalkirstain opened this issue Apr 26, 2022 · 4 comments
Open

Citing this resource #414

yuvalkirstain opened this issue Apr 26, 2022 · 4 comments

Comments

@yuvalkirstain
Copy link

Hello,
We are using this resource to filter pretraining data for our current project, and we would love to know if and how it should be cited.
Thanks :)

@ggdupont
Copy link
Member

Hi Yuval,
There is no paper describing this repository (yet... it's in progress).
In the meantime, you can refer to the official website presenting the project (https://bigscience.huggingface.co/) or to the communication workshop that will be held in ACL very soon: https://bigscience.huggingface.co/acl-2022

@ggdupont
Copy link
Member

BTW we are working on "tiding-up" the tools provided in the projact and would be interested to know more about how you used these. Any chance to geet extra details on your project?

@yuvalkirstain
Copy link
Author

yuvalkirstain commented Apr 27, 2022

Sure, we are pretraining transformer encoder-decoder models using large corpora (the Pile, Wikipedia, and RealNews), and used the modification and filtering tools to clean up the data (English only, not multi-lingual).

@albertvillanova
Copy link
Member

@yuvalkirstain, we are planning to get a Zenodo DOI for this GitHub repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants