Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo Integrity Mismatch #236

Open
henrikplate opened this issue May 11, 2023 · 2 comments
Open

Repo Integrity Mismatch #236

henrikplate opened this issue May 11, 2023 · 2 comments
Labels
question Further information is requested

Comments

@henrikplate
Copy link

This check resembles very much what we have attempted a few years back, that is, to compare the (Python) files in a PyPI package with the corresponding files in the source code repo. In more detail, we tried to identify individual lines and checked whether they contain suspicious Python calls.

However, my take-away of our experiments was that there are many differences, which render such checks very noisy.

From the paper LastPyMile: identifying the discrepancy between sources and packages: "Figure 5 shows that 65% of artifacts and 22% of files present in PyPI have changes with respect to the source code repository."

Would it possible to share your feedback on the check's precision?

Cheers, Henrik

PS: You can find the PDF also on Google Scholar.

@christophetd
Copy link
Contributor

Hello, thanks for the great question!

We did find the check noisy at first, which is why we only take into account more opinionated use-cases:

@vdeturckheim was the original implementer, in case he wants to give more context. Overall we acknowledge that this check is an heuristic and by no means perfect, but your feedback/thoughts are welcome!

@christophetd christophetd added the question Further information is requested label May 11, 2023
@henrikplate
Copy link
Author

It would make sense to combine the checks you already have to further reduce noise. For example, you could run your semgrep rules (maybe even more relaxed ones) only on those files that differ between package and repo. Using the line number info from semgrep results, you could filter only those findings that concern code only existing in the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants