Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: training dataset maintenance #49

Open
2 tasks done
josh-chamberlain opened this issue Mar 12, 2024 · 5 comments
Open
2 tasks done

Feature: training dataset maintenance #49

josh-chamberlain opened this issue Mar 12, 2024 · 5 comments

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Mar 12, 2024

Context

Now that we've done it a few times, let's be systematic about how we update the base training dataset in Hugging Face.

Requirements

Docs

  • update this repo's readme with reference to the script and action
@maxachis
Copy link
Collaborator

  • check for new URLs in our database which aren't already in the training-urls dataset via the API

Where will these new raw URLs be hosted? At the moment, my common_crawler pr #45 simply stores new URLs in the repository, which obviously isn't a sustainable long-term option.

@josh-chamberlain
Copy link
Contributor Author

@maxachis this is a good point—in general, I think we should use hugging face datasets.

I added detail to this issue: #40

@josh-chamberlain josh-chamberlain changed the title Feature: hugging face dataset maintenance Feature: training dataset maintenance Mar 13, 2024
@maxachis
Copy link
Collaborator

@josh-chamberlain To make sure I fully understand the workflow:

  1. Pull URLs from database
  2. Pull URLs from training-urls dataset
  3. Get all URLs from 1 which are not in 2
  4. Run HTML tag collector on results from 3
  5. Insert these results into LabelStudio (need confirmation especially on this step)
  6. Take results from LabelStudio
  7. Merge with URLs from training-urls dataset pulled in 2. Update the last_updated property of all new entries (or of the entire dataset?)
  8. Put the results of 7 into training-urls dataset

Additionally, when we are talking about training-urls dataset, does this dataset currently exist? In PDAP's Hugging Face, I do not currently see a dataset named training-urls:

image

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented Mar 26, 2024

@maxachis

5. we don't need to insert into LabelStudio—to be clear, we are checking LabelStudio for newly labeled URLs which aren't already in our training data.

training-urls doesn't currently exist, creating the dataset + strategy for managing batches of URLs within it.

@josh-chamberlain
Copy link
Contributor Author

josh-chamberlain commented May 21, 2024

I updated the readme for this repo and tweaked this issue slightly—I think using Hugging Face as a database for un-labeled URLs is not needed. We can track batches by ID in github, but we don't need to put them in hugging face before they're labeled. Hopefully this is much simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Reference
Development

No branches or pull requests

2 participants