Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicating OnDisk Corpus #2434

Open
domenukk opened this issue Jul 23, 2024 · 0 comments
Open

Deduplicating OnDisk Corpus #2434

domenukk opened this issue Jul 23, 2024 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@domenukk
Copy link
Member

Instead of having multiple files for the same content, the corpus should behave as follows:

  • On add: Create a file where filename == hash(contents)
  • Create a second hidden file with the same name (starting with a dot). This file should contain a simple counter. Accessing this file should happen with exclusive access (FLOCK/file lock).
  • When we add a testcase where the correct file already exists, increase the counter
  • When we remove this from the corpus, reduce the counter.
  • If the counter reaches 0, remove the file before dropping the lock on the shadow/counter file.
  • When the file contents change: instead, remove & create a new file using the algorithm above.

For reading, this shouldn't have much overhead, and it'll get rid of duplicate files with the same content! Great success :)

@domenukk domenukk added enhancement New feature or request help wanted Extra attention is needed labels Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant