Feature: training dataset maintenance #49

josh-chamberlain · 2024-03-12T18:28:41Z

Context

Now that we've done it a few times, let's be systematic about how we update the base training dataset in Hugging Face.

Requirements

test annotation workflow (1 batch) #68
establish training-urls as the canonical dataset in Hugging Face (HF), which we will use Pull Requests in to maintain. We will likely remove the other ones.
- let's only add URLs when we get them labeled in Label Studio
- let's keep the batch ID in hugging face, so we can easily remove a batch from the training data without complex lookups
manual workflow: generate batches #88
automated workflow: update training data from label studio #89

Docs

update this repo's readme with reference to the script and action

The text was updated successfully, but these errors were encountered:

maxachis · 2024-03-13T11:38:52Z

check for new URLs in our database which aren't already in the training-urls dataset via the API

Where will these new raw URLs be hosted? At the moment, my common_crawler pr #45 simply stores new URLs in the repository, which obviously isn't a sustainable long-term option.

josh-chamberlain · 2024-03-13T17:05:08Z

@maxachis this is a good point—in general, I think we should use hugging face datasets.

I added detail to this issue: #40

maxachis · 2024-03-21T15:59:09Z

@josh-chamberlain To make sure I fully understand the workflow:

Pull URLs from database
Pull URLs from training-urls dataset
Get all URLs from 1 which are not in 2
Run HTML tag collector on results from 3
Insert these results into LabelStudio (need confirmation especially on this step)
Take results from LabelStudio
Merge with URLs from training-urls dataset pulled in 2. Update the last_updated property of all new entries (or of the entire dataset?)
Put the results of 7 into training-urls dataset

Additionally, when we are talking about training-urls dataset, does this dataset currently exist? In PDAP's Hugging Face, I do not currently see a dataset named training-urls:

josh-chamberlain · 2024-03-26T20:01:46Z

@maxachis

~~5. we don't need to insert into LabelStudio—to be clear, we are checking LabelStudio for newly labeled URLs which aren't already in our training data.~~

training-urls doesn't currently exist, creating the dataset + strategy for managing batches of URLs within it.

josh-chamberlain · 2024-05-21T13:45:14Z

I updated the readme for this repo and tweaked this issue slightly—I think using Hugging Face as a database for un-labeled URLs is not needed. We can track batches by ID in github, but we don't need to put them in hugging face before they're labeled. Hopefully this is much simpler.

josh-chamberlain mentioned this issue Mar 13, 2024

use NLP model to generate name and description for data sources #43

Open

josh-chamberlain changed the title ~~Feature: hugging face dataset maintenance~~ Feature: training dataset maintenance Mar 13, 2024

This was referenced May 22, 2024

Add keyword extraction logic and rename source text collector directory #58

Draft

Set up active learning with label studio #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: training dataset maintenance #49

Feature: training dataset maintenance #49

josh-chamberlain commented Mar 12, 2024 •

edited

Loading

maxachis commented Mar 13, 2024

josh-chamberlain commented Mar 13, 2024

maxachis commented Mar 21, 2024

josh-chamberlain commented Mar 26, 2024 •

edited

Loading

josh-chamberlain commented May 21, 2024 •

edited

Loading

Feature: training dataset maintenance #49

Feature: training dataset maintenance #49

Comments

josh-chamberlain commented Mar 12, 2024 • edited Loading

Context

Requirements

Docs

maxachis commented Mar 13, 2024

josh-chamberlain commented Mar 13, 2024

maxachis commented Mar 21, 2024

josh-chamberlain commented Mar 26, 2024 • edited Loading

josh-chamberlain commented May 21, 2024 • edited Loading

josh-chamberlain commented Mar 12, 2024 •

edited

Loading

josh-chamberlain commented Mar 26, 2024 •

edited

Loading

josh-chamberlain commented May 21, 2024 •

edited

Loading