Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Exclude auto-generated files #47

Open
iuliaturc opened this issue Oct 2, 2024 · 2 comments
Open

Feature request: Exclude auto-generated files #47

iuliaturc opened this issue Oct 2, 2024 · 2 comments
Labels
🔥 difficulty: 2 Quite easy good first issue Good for newcomers hacktoberfest Open to Hacktoberfest contributions

Comments

@iuliaturc
Copy link
Contributor

Some repos have hundreds of auto-generated files, with limited utility in actually understanding the repo.

We'll have to implement some heuristics to detect such files (based on filename and content), and offer an option to exclude them from indexing. When they dominate the repo, they can really damage retrieval quality.

@mihail911
Copy link
Contributor

What file types are most prominently problematic @iuliaturc?

This can be implemented by just having a set of default exclusions that we augment with the the exclusion-files parameter correct?

@iuliaturc iuliaturc added good first issue Good for newcomers hacktoberfest Open to Hacktoberfest contributions labels Oct 4, 2024
@iuliaturc
Copy link
Contributor Author

To provide more context for new contributors:

When indexing the codebase, we allow the user to specify "inclusion" and "exclusion" files. A sample exclusion file is sample-exclude.txt. Each line starts with one of these directives: ext for extensions, dir for directories and file for files. For instance, ext:.png instructs the indexing script to not include .png files in the vector database. The method that filters files based on inclusion/exclusion arguments is _should_include.

To exclude auto-generated files, we could do either or both of the following:

  1. Allow patterns for these directives (not just exact string matching). For instance, file:*_auto*.* would exclude a file like scraper_auto_generated.py.
  2. Allow exclusions based on file content. We could add a content directive. For instance, content:THIS FILE WAS AUTOGENERATED would exclude any file that contains this string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔥 difficulty: 2 Quite easy good first issue Good for newcomers hacktoberfest Open to Hacktoberfest contributions
Projects
None yet
Development

No branches or pull requests

2 participants