Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Croissant vocabulary for crawled datasets #762

Open
3 tasks
wumpus opened this issue Nov 3, 2024 · 1 comment
Open
3 tasks

Croissant vocabulary for crawled datasets #762

wumpus opened this issue Nov 3, 2024 · 1 comment

Comments

@wumpus
Copy link

wumpus commented Nov 3, 2024

Related to #738 I would like to create any necessary new controlled language necessary to describe a crawled dataset.

I propose:

  • I will write up a single common crawl in croissant
  • I'll use existing language, and leave a list of things that apparently needs new language
  • An actual croissant expert should go back and forth with me at this point.

I have other interested users -- the ARDC (Alliance for Responsible Data Collection) would like to mandate a machine-readable metadata format for its users. This will serve a role similar to Croissant-RAI.

@benjelloun
Copy link
Contributor

Can some or all of these crawls be thought of as different versions of the same dataset? If so, Croissant has support for representing versions, so you could model them that way. However, there is no mechanism currently available to enumerate all existing versions of a dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants