Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files in git #681

Closed
MartinThoma opened this issue Mar 25, 2022 · 8 comments
Closed

Large files in git #681

MartinThoma opened this issue Mar 25, 2022 · 8 comments

Comments

@MartinThoma
Copy link
Collaborator

I've noticed some astonishingly large files in this repository (in the git history):

81ebba90f411  1,4MiB buildJvm/build_53_King_kotlin/build/install/build_53_King_kotlin/lib/kotlin-stdlib-1.6.0.jar -- added via fe4219dc5a847f919d35b965789026ffdc3b40b3
eeb85bc8ea11  4,4MiB 53_King/kotlin/king.jar -- added via fe4219dc5a847f919d35b965789026ffdc3b40b3
7c624e47ff0f   25MiB 89_Tic-Tac-Toe/python/TicTacToe_exe/TicTacToe.exe -- added via 3efe6e3ae260800beae7c28ed095960939344581
65c61eb509d6   34MiB 39 Golf/csharp/compiled/linux_x86/golf -- added via 0dbb491ff9b2b22744cb7b5ddf6e6241938b70f0
5222483d2f8c   66MiB 39 Golf/csharp/compiled/windows_x86/golf.exe -- added via 0dbb491ff9b2b22744cb7b5ddf6e6241938b70f0

They are all already deleted, but still in the git history. This makes this repository way bigger than necessary. I've just seen a warning by Github about which I was confused ... and especially about the executable in 89_Tic-Tac-Toe/python. I initially thought I had done something wrong.

Re-write history

We could remove those from the git history: https://stackoverflow.com/a/2158271/562769
However, I have to admit that re-writing git history always feels scary to me

Prevent it in future

I would like to do two things:

  1. Add the pre-commit hook check-added-large-files
  2. Check in the CI if a large file was added (the simplest way to do it would be to run pre-commit in CI, but I would like to avoid that ... I need to check how to do this)

Does this sound like a good idea to you?

What would be a good maximum file size?

@coding-horror
Copy link
Owner

coding-horror commented Mar 25, 2022 via email

@MartinThoma
Copy link
Collaborator Author

MartinThoma commented Mar 26, 2022

Git History Rewrite

I propose to do the following:

  1. Clarify if we need the buildJvm directory (discussion) and maybe adjust what gets deleted from the git history
  2. Merge all current branches - or ask the people to copy their changes
  3. Delete all local branches: git branch -D $(git branch)
  4. Create backup: git bundle create backup.bundle --all
  5. Rewrite the history (see below - I recommend the automatic way with this list of files to delete)
  6. Clean up git repo: git gc --aggressive --prune=now
  7. Force-push!!!! Now everybody needs to fork / clone again! ⚠️
  8. Ask contributors to (1) delete their fork / repository (2) fork again (3) clone again

Maybe we should also disable force-pushes on main after that in the Github settings?

History Rewrite Option 1: Manual

git filter-branch --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '39 Golf/csharp/compiled/windows_x86/golf.exe'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '39 Golf/csharp/compiled/linux_x86/golf'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '89_Tic-Tac-Toe/python/TicTacToe_exe/TicTacToe.exe'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '53_King/kotlin/king.jar'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch 'buildJvm/build_53_King_kotlin/build/install/build_53_King_kotlin/lib/kotlin-stdlib-1.6.0.jar'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '08 Batnum/vbnet/.vs/batnum/DesignTimeBuild/.dtbcache.v2'" \
  --tag-name-filter cat -- --all

git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch '89_Tic-Tac-Toe/python/TicTacToe_exe/assets/tie.png'" \
  --tag-name-filter cat -- --all

History Rewrite Option 2: Automatic

Delete all files from the history that are no longer there. I'm uncertain what happens with files that were moved!

# Install git-filter repo; several options: https://github.com/newren/git-filter-repo/blob/main/INSTALL.md
pip install git-filter-repo

# Run it
git filter-repo --analyze

# Create a list of all files that should get deleted
tail +3 .git/filter-repo/analysis/path-deleted-sizes.txt \
    | tr -s ' ' \
    | cut -d ' ' -f 5- \
    > .git/filter-repo/analysis/path-deleted.txt

# Before you do this, you can check if there are things you want to keep
git filter-repo --invert-paths --paths-from-file .git/filter-repo/analysis/path-deleted.txt --force

rm -rf .git/filter-repo

Alternatively, you could use this path-deleted.txt. I've removed all lines that match

(\.pl|\.py|\.cs|\.java|\.md|\.bas|\.rb|\.gitignore|\.rs|\.js|\.vb|\.sln|\.txt|\.html|\.csproj|\.vbproj|\.kt)$

@MartinThoma
Copy link
Collaborator Author

MartinThoma commented Mar 26, 2022

The effect of this:

$ ls -lh
  74M  backup.bundle
 3,4M  repo-after-history-rewrite-and-clean-manual-deletion.bundle
 3,1M  repo-after-history-rewrite-and-clean-automatic-deletion.bundle

When you now execute this snippet you can see:

2e07bb068f29   41KiB 01_Acey_Ducey/rust/target/debug/incremental/rust-8frg64vi8djd/s-g737sgtzl9-gc3nmb-ydny6jjnqtbz/dep-graph.bin
6557ca9f5546   42KiB 00_Alternate_Languages/88_3-D_Tic-Tac-Toe/csharp/Qubic.cs
595db6616bd6   43KiB 88_3-D_Tic-Tac-Toe/csharp/Qubic.cs
a01944e199e2   43KiB 88_3-D_Tic-Tac-Toe/csharp/Qubic.cs
513651d6a5e4   44KiB 88_3-D_Tic-Tac-Toe/csharp/Qubic.cs
9a4573c62152   48KiB 84_Super_Star_Trek/javascript/superstartrek.mjs
b8ecb9343b12   49KiB 84_Super_Star_Trek/java/SuperStarTrekGame.java
213b2a0b8d38   49KiB 84_Super_Star_Trek/java/SuperStarTrekGame.java
7454180f2ae8   58KiB buildJvm/gradle/wrapper/gradle-wrapper.jar
ea71f1cf7903   58KiB 00_Alternate_Languages/01_Acey_Ducey/elm/package-lock.json
6fe43fa79007  115KiB 75_Roulette/perl/roulette-test.t
d2946a198344  125KiB 00_Alternate_Languages/01_Acey_Ducey/elm/docs/app.js
ccf0500fd14b  196KiB buildJvm/build_53_King_kotlin/build/install/build_53_King_kotlin/lib/kotlin-stdlib-common-1.6.0.jar

In case we don't need the buildJvm folder, we could reduce the allowed file size from 1MB to 200 KB.

@coding-horror
Copy link
Owner

I think we have definitely decided we don't want any build or IDE specific files in the repo, so proceed accordingly there...

@MartinThoma
Copy link
Collaborator Author

If I should do it, I would give @AlaaSarhan time until Monday to finish #404 . If possible, I would like to prevent open PRs.

Should I do it? Does it sound good to wait until #404 is merged / Monday (whatever is earlier)?

@coding-horror
Copy link
Owner

Sure! Sounds good to me.

MartinThoma added a commit that referenced this issue Mar 28, 2022
@MartinThoma
Copy link
Collaborator Author

The history re-write is done. Please clone from the current repository

@coding-horror
Copy link
Owner

OK! I cloned the repo from scratch and it looks good to me! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants