LinuxEmulation: Implement support for seccomp #3628

Sonicadvance1 · 2024-05-13T22:14:10Z

Seccomp is a relatively complex feature that was added to Linux back in
2005, and was further extended in 2013 to support BPF based protections.
Once seccomp is enabled, you can no longer disable seccomp but
additional protections can be placed on top of existing seccomp filters.
Additionally seccomp filters are inherited in child processes, which
ensures the process tree can't escape from the secure computing
environment through child processes.

The basis of this feature is a shim that lives between userspace and the
kernel at the syscall entrypoint.
In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
When in "filter" mode, a BPF filter is run on syscall entrypoint and
returns state about if the syscall should be allowed or not. Multiple
filters can be installed in this mode, all of which get executed. The
result that is the most restricted is the action that occurs at the end.

There are some significant limitations in filter mode that must be
adhered to which makes executing this code inside of kernel space a
non-issue and effectively limits how much cpu time is spent in the filters.
Although these filters are free to do basically anything with the
provided data, just can't do any loops.

FEX needs to implement seccomp because there are multiple applications
using the feature, the primary one being Chromium which some games embed
without disabling the sandbox. WINE also uses seccomp for capturing
games that do raw Windows system calls. Apparently Red Dead Redemption
is one of the games that requires this.

While FEX implements seccomp, it is not yet all encompassing, which is
one of the reasons why it isn't enabled by default and requires a config
option.

seccomp_unotify is not implemented
This is a relatively new feature for seccomp which lets the seccomp
filter signal an FD for multiple things. Luckily Chromium and WINE don't
use this. This will be tricky to implement under FEX since it
requires ioctl trapping and some other behaviour

ptrace isn't supported
One feature of seccomp is that it can raise ptrace events. Since FEX
doesn't support ptrace at all, this isn't handled. Again Chromium and
WINE don't use this.

kill-thread not quite correct
This isn't directly related to seccomp but more about how we do thread
shutdown in FEX. This will require some more changes around thread state
tracking before fully supporting this. Chromium and WINE don't use this.
kill-process also falls under this

Features that are supported:

Strict mode and seccomp-bpf mode supported
All BFP instructions that seccomp-bpf understands
Inheriting seccomp through execve
- This means we serialize and deserialize the calling thread's
  seccomp filters
- An execve that escapes FEX will also escape seccomp. Not much we
  can do about it
TSync - Allowing post-mortem seccomp insertion which allows threads to
synchronize seccomp filters after the fact

Features that are not supported:

Different arch qualifiers depending on syscall entrypoint
- Just like our syscall handler, we are hardcoded to the arch that the
  application starts with
user_notif
ptrace
Runtime code cache invalidation when seccomp is installed
- Currently we must ensure all syscalls go through the frontend
  syscall handler
- Runtime invalidation of code cache with inline syscalls will get
  fixed in the future.

This currently isn't enabled by default because of the minor feature
problems that haven't been resolved. Currently the Linux Kernel's test
application works for the features that FEX supports, and WINE's usage
can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
but it only fails due to features unrelated to seccomp.

Having this open for merging now so we can work to resolve the remaining
issues without this bitrotting.

Source/Tools/LinuxEmulation/LinuxSyscalls/SeccompEmulator.cpp

alyssarosenzweig · 2024-05-15T18:21:09Z

Is 64-bit bpf interesting for FEX? Seems potentially better to do now than later given how much churn it would be. (Unless it's never going to be supported for some good reason)

Sonicadvance1 · 2024-05-15T18:28:36Z

Is 64-bit bpf interesting for FEX? Seems potentially better to do now than later given how much churn it would be. (Unless it's never going to be supported for some good reason)

I haven't seen any usage of BPF outside of seccomp currently. Not even sure if the bpf syscall is even typically accessible outside of root. Also haven't checked if we can just forward the syscall, I don't know if they expose an architecture or anything inside the VM. So it's just kind of eeh.

alyssarosenzweig · 2024-05-15T19:58:25Z

Also haven't checked if we can just forward the syscall

Didn't we talk about this in the meeting last week? I thought there was some reason FEX had to do these shenanigans instead of simply forwarding like @neobrain hoped...? I'd rather not merge 1kloc of new JIT code if we could just... not 😅

Sonicadvance1 · 2024-05-15T20:00:55Z

Also haven't checked if we can just forward the syscall

Didn't we talk about this in the meeting last week? I thought there was some reason FEX had to do these shenanigans instead of simply forwarding like @neobrain hoped...? I'd rather not merge 1kloc of new JIT code if we could just... not 😅

Sorry, "bpf" the syscall is different from seccomp. Seccomp uses a highly restricted form of BPF which exposes an architecture constant which means we can't forward it.
"bpf" might not have that same architecture specific data and /might/ be able to get forwarded, unsure and needs investigation. Ideally that syscall is still restricted in userspace and can continue to get ignored.

Sonicadvance1 · 2024-05-16T23:20:09Z

jetson-orin-1 is still on Clang 12.x which doesn't support the C++20 feature of lambda expressions in unevaluated operands. Clang 13.x added support for this. Time to raise the minspec to Clang-13.

neobrain · 2024-05-17T08:38:39Z

Sorry, "bpf" the syscall is different from seccomp. Seccomp uses a highly restricted form of BPF which exposes an architecture constant which means we can't forward it.
"bpf" might not have that same architecture specific data and /might/ be able to get forwarded, unsure and needs investigation. Ideally that syscall is still restricted in userspace and can continue to get ignored.

I'm also interested in picking up this conversation again. Currently the PR description is very sparse on details, but I hope some background on this change can be provided once the WIP label is dropped to enable a sensible review.

Sonicadvance1 · 2024-08-07T17:52:11Z

Poking around at Chrome's sandbox with this PR

Seems to work successfully
Chrome hits two unrelated bugs when running the sandbox that are unrelated to this PR
- CLONE_NEWNET is currently unsupported in FEX, as it loses connection to FEXServer. I have a WIP branch locally to support this
- FEX hits some EDEADLK futex locks for some reason, potentially due to forking Chrome does. With some massaging of our various mutexes, I was able to still get Chrome rendering with the sandbox enabled.

alyssarosenzweig · 2024-09-02T13:34:37Z

Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/BPFEmitter.cpp

+template<bool CalculateSize>
+uint64_t BPFEmitter::HandleLoad(uint32_t BPFIP, const sock_filter* Inst) {
+  [[maybe_unused]] size_t OpSize {};
+  switch (BPF_SIZE(Inst->code)) {


This switch would be probably better as:

if (BPF_SIZE() != W) { RET_ERR(EINVAL); }

Used the VALIDATE define

alyssarosenzweig · 2024-09-02T13:37:54Z

Could we get a macro:

#define VALIDATE(cond) if (!(cond)) RETURN_ERROR(-EINVAL)

Then porting the file to this would let us write straightline code instead which should be a lot shorter and imho simpler. e.g.:

+  case BPF_JA: {
+    // Only BPF_K supported on JA.
+    VALIDATE((SrcType != BPF_X);
+
+    // BPF IP register is effectively only 32-bit. Treat k constant like a signed integer.
+    // This allows it to jump anywhere in the program.
+    // But! Loops are EXPLICITLY disallowed inside of BPF programs.
+    // This is to prevent DOS style attacks through BPF programs.
+    uint64_t Target = BPFIP + Inst->k + 1;
+
+    // Can't jump past the end.
+    VALIDATE (Target < NumInst);
+
+    fextl::unordered_map<uint32_t, ARMEmitter::ForwardLabel>::iterator TargetLabel {};
+
+    if constexpr (!CalculateSize) {
+      TargetLabel = JumpLabels.try_emplace(Target, ARMEmitter::ForwardLabel {}).first;
+    }
+
+    EMIT_INST(b(&TargetLabel->second));
+    break;
+  }

Could even take a string as a second arg to match our logman asserts.

Seccomp emulator uses lambda expressions in an unevaluated operand, which was only added in clang-13

Sonicadvance1 · 2024-09-02T14:49:34Z

Could we get a macro:

#define VALIDATE(cond) if (!(cond)) RETURN_ERROR(-EINVAL)

Then porting the file to this would let us write straightline code instead which should be a lot shorter and imho simpler. e.g.:

+  case BPF_JA: {
+    // Only BPF_K supported on JA.
+    VALIDATE((SrcType != BPF_X);
+
+    // BPF IP register is effectively only 32-bit. Treat k constant like a signed integer.
+    // This allows it to jump anywhere in the program.
+    // But! Loops are EXPLICITLY disallowed inside of BPF programs.
+    // This is to prevent DOS style attacks through BPF programs.
+    uint64_t Target = BPFIP + Inst->k + 1;
+
+    // Can't jump past the end.
+    VALIDATE (Target < NumInst);
+
+    fextl::unordered_map<uint32_t, ARMEmitter::ForwardLabel>::iterator TargetLabel {};
+
+    if constexpr (!CalculateSize) {
+      TargetLabel = JumpLabels.try_emplace(Target, ARMEmitter::ForwardLabel {}).first;
+    }
+
+    EMIT_INST(b(&TargetLabel->second));
+    break;
+  }

Could even take a string as a second arg to match our logman asserts.

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

alyssarosenzweig · 2024-09-02T20:27:17Z

Those switches should probably all have a default: EINVAL on them

alyssarosenzweig · 2024-09-02T20:29:01Z

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

The problem is that

+  // Larger than scratch space size.
+  VALIDATE(Inst->k < 16);

is very confusing. In some ways the comment makes this more confusing. since we validate the opposite. Clearer would be

+  // Must be smaller than scratch space size.
+  VALIDATE(Inst->k < 16);

or

+  VALIDATE(Inst->k < 16, "Larger than scratch space size");

alyssarosenzweig · 2024-09-02T20:29:21Z

(That goes for all the validation I think)

Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/BPFEmitter.cpp

Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/Dumper.cpp

alyssarosenzweig · 2024-09-02T20:44:14Z

Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/Dumper.cpp

+    case BPF_LDX: Parse_Class_LD(i, Inst); break;
+    case BPF_ST:
+    case BPF_STX: Parse_Class_ST(i, Inst); break;
+    case BPF_ALU: break;


Why are we dumping load/store/jumps but not alu?

True, I didn't finish that off. Added the two handlers.

alyssarosenzweig · 2024-09-02T21:00:36Z

All the non-JIT stuff seems reasonable, though I'm obviously less familiar

Seccomp is a relatively complex feature that was added to Linux back in 2005, and was further extended in 2013 to support BPF based protections. Once seccomp is enabled, you can no longer disable seccomp but additional protections can be placed on top of existing seccomp filters. Additionally seccomp filters are inherited in child processes, which ensures the process tree can't escape from the secure computing environment through child processes. The basis of this feature is a shim that lives between userspace and the kernel at the syscall entrypoint. In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function. When in "filter" mode, a BPF filter is run on syscall entrypoint and returns state about if the syscall should be allowed or not. Multiple filters can be installed in this mode, all of which get executed. The result that is the most restricted is the action that occurs at the end. There are some significant limitations in filter mode that must be adhered to which makes executing this code inside of kernel space a non-issue and effectively limits how much cpu time is spent in the filters. Although these filters are free to do basically anything with the provided data, just can't do any loops. FEX needs to implement seccomp because there are multiple applications using the feature, the primary one being Chromium which some games embed without disabling the sandbox. WINE also uses seccomp for capturing games that do raw Windows system calls. Apparently Red Dead Redemption is one of the games that requires this. While FEX implements seccomp, it is not yet all encompassing, which is one of the reasons why it isn't enabled by default and requires a config option. **seccomp_unotify is not implemented** This is a relatively new feature for seccomp which lets the seccomp filter signal an FD for multiple things. Luckily Chromium and WINE don't use this. This will be tricky to implement under FEX since it requires ioctl trapping and some other behaviour **ptrace isn't supported** One feature of seccomp is that it can raise ptrace events. Since FEX doesn't support ptrace at all, this isn't handled. Again Chromium and WINE don't use this. **kill-thread not quite correct** This isn't directly related to seccomp but more about how we do thread shutdown in FEX. This will require some more changes around thread state tracking before fully supporting this. Chromium and WINE don't use this. kill-process also falls under this Features that are supported: - Strict mode and seccomp-bpf mode supported - All BFP instructions that seccomp-bpf understands - Inheriting seccomp through execve - This means we serialize and deserialize the calling thread's seccomp filters - An execve that escapes FEX will also escape seccomp. Not much we can do about it - TSync - Allowing post-mortem seccomp insertion which allows threads to synchronize seccomp filters after the fact Features that are not supported: - Different arch qualifiers depending on syscall entrypoint - Just like our syscall handler, we are hardcoded to the arch that the application starts with - user_notif - ptrace - Runtime code cache invalidation when seccomp is installed - Currently we must ensure all syscalls go through the frontend syscall handler - Runtime invalidation of code cache with inline syscalls will get fixed in the future. This currently isn't enabled by default because of the minor feature problems that haven't been resolved. Currently the Linux Kernel's test application works for the features that FEX supports, and WINE's usage can be handled by FEX. Chromium's sandbox doesn't yet work with this PR, but it only fails due to features unrelated to seccomp. Having this open for merging now so we can work to resolve the remaining issues without this bitrotting.

Sonicadvance1 · 2024-09-02T21:09:35Z

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

The problem is that
+  // Larger than scratch space size.
+  VALIDATE(Inst->k < 16);
is very confusing. In some ways the comment makes this more confusing. since we validate the opposite. Clearer would be
+  // Must be smaller than scratch space size.
+  VALIDATE(Inst->k < 16);
or
+  VALIDATE(Inst->k < 16, "Larger than scratch space size");

Updated the comments around the VALIDATES

Sonicadvance1 · 2024-09-02T21:10:07Z

Those switches should probably all have a default: EINVAL on them

default:RETURN_ERROR(-EINVAL); added to the switches that missed them.

Sonicadvance1 force-pushed the seccomp branch 2 times, most recently from 0bc116c to 4c5c0fe Compare May 13, 2024 22:29

alyssarosenzweig reviewed May 15, 2024

View reviewed changes

Source/Tools/LinuxEmulation/LinuxSyscalls/SeccompEmulator.cpp Outdated Show resolved Hide resolved

Sonicadvance1 force-pushed the seccomp branch 12 times, most recently from 2828a1f to f0d6725 Compare May 16, 2024 23:11

Sonicadvance1 force-pushed the seccomp branch 2 times, most recently from 05bb35d to ee69ed6 Compare May 16, 2024 23:42

Sonicadvance1 force-pushed the seccomp branch 7 times, most recently from f217521 to aca37b9 Compare May 17, 2024 22:21

Sonicadvance1 force-pushed the seccomp branch from c2b69a1 to 79a0b16 Compare August 1, 2024 22:27

Sonicadvance1 force-pushed the seccomp branch 6 times, most recently from 40000d7 to 03b25d4 Compare August 21, 2024 23:20

Sonicadvance1 force-pushed the seccomp branch from 03b25d4 to 7e63d11 Compare August 23, 2024 23:17