Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: use Tweag's self-hosted runners #2037

Merged
merged 2 commits into from
Oct 9, 2024
Merged

Conversation

yannham
Copy link
Member

@yannham yannham commented Sep 12, 2024

Try to use our self hosted runners, which should be beefier, don't need to install Nix again and again, and should be able to re-use the local store as a cache mechanism as well.

Copy link
Contributor

github-actions bot commented Sep 12, 2024

🐰 Bencher Report

Branch2037/merge
Testbedubuntu-latest

⚠️ WARNING: The following Measure does not have a Threshold. Without a Threshold, no Alerts will ever be generated!

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds CLI flag.

Click to view all benchmark results
BenchmarkLatencynanoseconds (ns)
fibonacci 10📈 view plot
⚠️ NO THRESHOLD
485,970.00
pidigits 100📈 view plot
⚠️ NO THRESHOLD
3,209,400.00
product 30📈 view plot
⚠️ NO THRESHOLD
833,890.00
scalar 10📈 view plot
⚠️ NO THRESHOLD
1,524,100.00
sum 30📈 view plot
⚠️ NO THRESHOLD
827,870.00
🐰 View full continuous benchmarking report in Bencher

@yannham
Copy link
Member Author

yannham commented Sep 13, 2024

Thanks for your precious help @YorikSar . It seems the macOS workflow is failing because of some security stuff. We have several possibilities, I guess, from here:

  1. Revert back to use GH hosted runners (but would be sad)
  2. Revert the macOS runner to be a GH runner (but then the whole approach is moot, because we have to wait for the slowest workflow to finish)
  3. Disable Cachix action on the self-hosted runners, and have separate runners (say, every upgrade, or every week) that build and upload artifacts to Cachix from master (honestly it might be even dubious to upload store paths built from PRs to Cachix)
  4. Disabe Cachix action but only on the macOS runner. Either not upload macOS artifacts at all, or upload them in a separate workflow as in 3.
  5. Dig deeper to try to fix the Cachix action issue

@YorikSar
Copy link
Member

I think we still need to investigate results on our Linux runner because even there the job seems to spend 24m when a random different run on master finished in 3m.

As for macOS, I think we could bring this up to Cachix maintainers. Calling out to security and failing in rather simple hosted runner would probably not be something expected.

@yannham
Copy link
Member Author

yannham commented Sep 13, 2024

I think we still need to investigate results on our Linux runner because even there the job seems to spend 24m when a random different run on master finished in 3m.

I think the continuous integration for merge queue (the one you linked) is different - I don't know what happens exactly but some caching comes into play. If you take the CI on a random PR, it's usually rather around 20+ minutes so 24 minutes don't stand out, I mean, it does because it should still be quite faster on our own runner. To be honest we seem to recompile a lot of stuff on each run, and we've not really been able to entirely understand if Cachix is being hit every time, if crane doesn't really play well with something else, causing from scratch recompilation, but it does look like we're doing quite a lot of work on each new PR. My plan is to take a deeper look at that, but moving the runners to self hosted has been the first move.

As for macOS, I think we could bring this up to Cachix maintainers. Calling out to security and failing in rather simple hosted runner would probably not be something expected.

Fair. I can take care of looking for a similar issue, and if not, opening one.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Issue opened, and Domen suggest to bump Nixpkgs: cachix/cachix-action#190 (comment) (@YorikSar). I guess this means the nixpkgs used by the build machines, as cachix will use that one, not the input of the Nickel flake.

@YorikSar
Copy link
Member

@yannham it’s used from this flake’s inputs:

nix build --print-out-paths --inputs-from . nixpkgs#cachix.bin

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Oh, my bad.

@YorikSar
Copy link
Member

@yannham Looks like jobs are not getting triggered. Maybe because of the conflicts?

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Ok, Cachix seems to work fine now, but we get a meson error when trying to build nix from the latest unstable.

@YorikSar
Copy link
Member

From logs from macOS:

error: builder for '/nix/store/qfjwky2wmfjhnn15nk5lq28spgncdci0-nix-store-2.25.0pre.drv' failed with exit code 1;
       last 10 log lines:
       >                            ^^^^^^^^^^
       >   File "/nix/store/9pj4rzx5pbynkkxq1srzwjhywmcfxws3-python3-3.12.5/lib/python3.12/pathlib.py", line 875, in is_dir
       >     return S_ISDIR(self.stat().st_mode)
       >                    ^^^^^^^^^^^
       >   File "/nix/store/9pj4rzx5pbynkkxq1srzwjhywmcfxws3-python3-3.12.5/lib/python3.12/pathlib.py", line 840, in stat
       >     return os.stat(self, follow_symlinks=follow_symlinks)
       >            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       > PermissionError: [Errno 1] Operation not permitted: '/nix/var'
       >
       > ERROR: Unhandled python OSError. This is probably not a Meson bug, but an issue with your build environment.
       For full logs, run 'nix log /nix/store/qfjwky2wmfjhnn15nk5lq28spgncdci0-nix-store-2.25.0pre.drv'.
error: 1 dependencies of derivation '/nix/store/crpmjh1glrbknz2dlkd0p8m6a8c784iq-nickel-lang-clippy-1.8.0.drv' failed to build

It seems this derivation wants to access something outside its sandbox... Although there is also this line at the top that might be related:

warning: Ignoring setting 'impure-env' because experimental feature 'configurable-impure-env' is not enabled

What does this nix-store derivation do and where does it come from?..

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

There is an experimental feature to evaluate a Nix expression in Nickel. It's been written before the Nix C API, so it binds directly to the Nix C++ API. Before we just used the default package of the github:nixos/nix flake and everything worked well. I guess some refactoring has been done in between, because after updating the flake.lock Nickel (well pkg-config to be precise) couldn't find the libraries nix-expr, nix-store, etc. any more.

I saw that the nixos/nix flake is exporting additional outputs, that match precisely the lib that we are using, so I added those to the buildInputs of the corresponding derivation and it built fine on my system again.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Maybe one solution would be to try to migrate that to use the Nix C API (we only really need to evaluate expressions, which should be quite straightforward), but I initially didn't want to go this route for this PR that has nothing to do with that, so I tried to get it working with minimal tweaking first.

@YorikSar
Copy link
Member

I would assume something was merged related to using meson in nixos/nix repo. Our macOS runner has sandbox enabled in Nix which is disabled by default, which is probably why this issue didn't come up before on hosted runners.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Right, it seems at the beginning of this summer, they started to switch to meson. I guess we're up for an issue.

flake.lock Outdated
"owner": "nixos",
"repo": "nix",
"rev": "0ab9369572f64b1ab70a8db29f79ae730ff31ab6",
"rev": "2c42a9dbaa805f4f29561d9a1c10b41dfe98dcfa",
"type": "github"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannham I see you've updated all inputs here. Maybe rollback nixos/nix input to the previous value to see if this helps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we can do that of course since Nix isn't taken from Nixpkgs. Good catch, I'll try that

@YorikSar
Copy link
Member

Another option would be to revert changes to flake.lock and just add a new input for nixpkgs from which we'll install cachix. We could also install it from cachix repo directly instead. I think it should exclude issues that have risen because of this update.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

On the other hand we haven't been updating the nixpkgs input for some time, and I think it's good that we do - although I can cherry-pick the changes into a separate PR. But the thing is that now the error seems related to the extra Nix input that should have been reverted to a previous version. And I can't get it to not do the checks for some reason.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

Ah, but the inputs of nix-input follows nixpkgs, so that is still a difference with master. Apparently they fixed like a few hours ago the meson build on Darwin, so let me try to update even more nix, and if that doesn't work, maybe your approach is the most sensible one.

@yannham
Copy link
Member Author

yannham commented Sep 16, 2024

John ask if it's possible that build02 identifies as something else than macos: NixOS/nix#2503 (comment). I have no idea.

@YorikSar
Copy link
Member

Even though Linux job failed to build Nix this time, it looks like macOS one did something for 6 minutes, so we can assume it didn't break on Nix build. I think we should make that weird step output logs always with something like trap. By the way, I see NixOS/nix#8949 mentioned there, but from the discussion there it seems like piping into cat should work just as well as into file. Maybe we should we do that instead? Another option would be to run tail -f logs & before starting the build.

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

It's really annoying - I'm basically trying many different tags but either Boehm or Nix checks fail locally (so not even on MacOS). It's surprising that something as central as Nix is not reproducible to build on a pretty standard config (Debian unstable with Nix as a package manager).

@yannham
Copy link
Member Author

yannham commented Sep 24, 2024

Oh, I think this is because we follow the nixpkgs inputs from the main flake. It seems the Boehmgc update 8.2.4 -> 8.2.6 made a Nix-specific patch obsolete. The version of Nix I'm trying to build are anterior to this patch removal but I'm using latest Nixpkgs, I guess.

@yannham
Copy link
Member Author

yannham commented Sep 26, 2024

Oh, interesting. I think that explains the test failure.

@YorikSar
Copy link
Member

Does it? How so?

@yannham
Copy link
Member Author

yannham commented Sep 26, 2024

I just took a quick look, but it seems the test is supposed to error out with the same error message as us on Darwin because MacOS do unicode normalization. I think the test tries to create two files with byte-different names but representing the same "unicode strings", if that means anything. MacOS just thinks they're the same file so it doesn't allow it (file already exists). For some reason the tests expect that Linux does not, and so should accept to create both files as separate. However, it seems that this assumption is false with ZFS+normalization option (which implies utf8only option).

@yannham
Copy link
Member Author

yannham commented Sep 26, 2024

The test was added 3 weeks ago.

@YorikSar
Copy link
Member

@yannham Yeah, makes sense. I could disable normalisation for /nix on the server. If it would remove some difference with the upstream, it might be really needed for development.

@yannham
Copy link
Member Author

yannham commented Sep 27, 2024

Ahah! it seems to build Nix correctly (there is a clippy error now), using my fork of Nix. Let me see if I can get everything green, and then upstream the patch.

@YorikSar
Copy link
Member

YorikSar commented Sep 27, 2024

You did it! Congrats! :)
Now let’s hope we manage to fix macOS builder. But in the mean time you can switch at least the Linux build.

@yannham yannham marked this pull request as ready for review September 30, 2024 05:43
flake.nix Outdated Show resolved Hide resolved
Copy link
Member

@YorikSar YorikSar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to be much more than just switching to the self-hosted runners. Maybe it would be worth splitting out other changes and landing them separately?

.github/workflows/continuous-integration.yml Show resolved Hide resolved
@yannham
Copy link
Member Author

yannham commented Sep 30, 2024

This PR seems to be much more than just switching to the self-hosted runners. Maybe it would be worth splitting out other changes and landing them separately?

I crawled my way until it worked, but now that it is (more or less), it's fair. I'll split the update flake lock and fix breakages in another PR.

Copy link
Member

@YorikSar YorikSar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yannham I've finished restoring our macOS builder, and re-run check on this PR. It complained that find and xargs were not found, so I took a liberty of attempting to fix this in a separate commit here and rebasing on top of master.

@YorikSar
Copy link
Member

YorikSar commented Oct 9, 2024

Ah, I forgot to remove file when switched command line to find's {}. The error from nickel typecheck was very, very weird though https://github.com/tweag/nickel/actions/runs/11253749213/job/31289680441 (it ran nickel typecheck file a/b/c.ncl e/f/g.ncl ...):

error: unbound identifier `import'Nickel`
  ┌─ <generated main>:1:27
  │
1 │ ((((((((((((((((((((((((((import'Nickel "file")
  │                           ^^^^^^^^^^^^^ this identifier is unbound

It is not installed by default on our self-hosted runners.
@yannham
Copy link
Member Author

yannham commented Oct 9, 2024

Thanks for all the help @YorikSar ! We'll see how it goes for new PRs, but I have good hope that this will speed up the CI.

@yannham yannham merged commit 35c98d1 into master Oct 9, 2024
5 checks passed
@yannham yannham deleted the ci/self-hosted-runners branch October 9, 2024 12:33
YorikSar added a commit that referenced this pull request Oct 9, 2024
By default GitHub lists all arguments in brackets after job name.
This gets unwieldy when we include runner labels in
#2037. For example, Linux job is now
"build-and-test (self-hosted, Linux, X64, stable)".
Shorten it to "build-and-test (linux, stable)" instead.
YorikSar added a commit that referenced this pull request Oct 9, 2024
By default GitHub lists all arguments in brackets after job name.
This gets unwieldy when we include runner labels in
#2037. For example, Linux job is now
"build-and-test (self-hosted, Linux, X64, stable)".
Shorten it to "build-and-test (linux)" instead.
Also, remove rust_channel as it wasn't used.
github-merge-queue bot pushed a commit that referenced this pull request Oct 9, 2024
By default GitHub lists all arguments in brackets after job name.
This gets unwieldy when we include runner labels in
#2037. For example, Linux job is now
"build-and-test (self-hosted, Linux, X64, stable)".
Shorten it to "build-and-test (linux)" instead.
Also, remove rust_channel as it wasn't used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants