Deduplicating OnDisk Corpus #2434

domenukk · 2024-07-23T08:16:58Z

Instead of having multiple files for the same content, the corpus should behave as follows:

On add: Create a file where filename == hash(contents)
Create a second hidden file with the same name (starting with a dot). This file should contain a simple counter. Accessing this file should happen with exclusive access (FLOCK/file lock).
When we add a testcase where the correct file already exists, increase the counter
When we remove this from the corpus, reduce the counter.
If the counter reaches 0, remove the file before dropping the lock on the shadow/counter file.
When the file contents change: instead, remove & create a new file using the algorithm above.

For reading, this shouldn't have much overhead, and it'll get rid of duplicate files with the same content! Great success :)

domenukk added enhancement New feature or request help wanted Extra attention is needed labels Jul 23, 2024

Provide feedback