Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinuxEmulation: Implement support for seccomp #3628

Merged
merged 2 commits into from
Sep 3, 2024

Conversation

Sonicadvance1
Copy link
Member

@Sonicadvance1 Sonicadvance1 commented May 13, 2024

Seccomp is a relatively complex feature that was added to Linux back in
2005, and was further extended in 2013 to support BPF based protections.
Once seccomp is enabled, you can no longer disable seccomp but
additional protections can be placed on top of existing seccomp filters.
Additionally seccomp filters are inherited in child processes, which
ensures the process tree can't escape from the secure computing
environment through child processes.

The basis of this feature is a shim that lives between userspace and the
kernel at the syscall entrypoint.
In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
When in "filter" mode, a BPF filter is run on syscall entrypoint and
returns state about if the syscall should be allowed or not. Multiple
filters can be installed in this mode, all of which get executed. The
result that is the most restricted is the action that occurs at the end.

There are some significant limitations in filter mode that must be
adhered to which makes executing this code inside of kernel space a
non-issue and effectively limits how much cpu time is spent in the filters.
Although these filters are free to do basically anything with the
provided data, just can't do any loops.

FEX needs to implement seccomp because there are multiple applications
using the feature, the primary one being Chromium which some games embed
without disabling the sandbox. WINE also uses seccomp for capturing
games that do raw Windows system calls. Apparently Red Dead Redemption
is one of the games that requires this.

While FEX implements seccomp, it is not yet all encompassing, which is
one of the reasons why it isn't enabled by default and requires a config
option.

seccomp_unotify is not implemented
This is a relatively new feature for seccomp which lets the seccomp
filter signal an FD for multiple things. Luckily Chromium and WINE don't
use this. This will be tricky to implement under FEX since it
requires ioctl trapping and some other behaviour

ptrace isn't supported
One feature of seccomp is that it can raise ptrace events. Since FEX
doesn't support ptrace at all, this isn't handled. Again Chromium and
WINE don't use this.

kill-thread not quite correct
This isn't directly related to seccomp but more about how we do thread
shutdown in FEX. This will require some more changes around thread state
tracking before fully supporting this. Chromium and WINE don't use this.
kill-process also falls under this

Features that are supported:

  • Strict mode and seccomp-bpf mode supported
  • All BFP instructions that seccomp-bpf understands
  • Inheriting seccomp through execve
    • This means we serialize and deserialize the calling thread's
      seccomp filters
    • An execve that escapes FEX will also escape seccomp. Not much we
      can do about it
  • TSync - Allowing post-mortem seccomp insertion which allows threads to
    synchronize seccomp filters after the fact

Features that are not supported:

  • Different arch qualifiers depending on syscall entrypoint
    • Just like our syscall handler, we are hardcoded to the arch that the
      application starts with
  • user_notif
  • ptrace
  • Runtime code cache invalidation when seccomp is installed
    • Currently we must ensure all syscalls go through the frontend
      syscall handler
    • Runtime invalidation of code cache with inline syscalls will get
      fixed in the future.

This currently isn't enabled by default because of the minor feature
problems that haven't been resolved. Currently the Linux Kernel's test
application works for the features that FEX supports, and WINE's usage
can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
but it only fails due to features unrelated to seccomp.

Having this open for merging now so we can work to resolve the remaining
issues without this bitrotting.

@Sonicadvance1 Sonicadvance1 force-pushed the seccomp branch 2 times, most recently from 0bc116c to 4c5c0fe Compare May 13, 2024 22:29
@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented May 15, 2024

Is 64-bit bpf interesting for FEX? Seems potentially better to do now than later given how much churn it would be. (Unless it's never going to be supported for some good reason)

@Sonicadvance1
Copy link
Member Author

Is 64-bit bpf interesting for FEX? Seems potentially better to do now than later given how much churn it would be. (Unless it's never going to be supported for some good reason)

I haven't seen any usage of BPF outside of seccomp currently. Not even sure if the bpf syscall is even typically accessible outside of root. Also haven't checked if we can just forward the syscall, I don't know if they expose an architecture or anything inside the VM. So it's just kind of eeh.

@alyssarosenzweig
Copy link
Collaborator

Also haven't checked if we can just forward the syscall

Didn't we talk about this in the meeting last week? I thought there was some reason FEX had to do these shenanigans instead of simply forwarding like @neobrain hoped...? I'd rather not merge 1kloc of new JIT code if we could just... not 😅

@Sonicadvance1
Copy link
Member Author

Also haven't checked if we can just forward the syscall

Didn't we talk about this in the meeting last week? I thought there was some reason FEX had to do these shenanigans instead of simply forwarding like @neobrain hoped...? I'd rather not merge 1kloc of new JIT code if we could just... not 😅

Sorry, "bpf" the syscall is different from seccomp. Seccomp uses a highly restricted form of BPF which exposes an architecture constant which means we can't forward it.
"bpf" might not have that same architecture specific data and /might/ be able to get forwarded, unsure and needs investigation. Ideally that syscall is still restricted in userspace and can continue to get ignored.

@Sonicadvance1 Sonicadvance1 force-pushed the seccomp branch 12 times, most recently from 2828a1f to f0d6725 Compare May 16, 2024 23:11
@Sonicadvance1
Copy link
Member Author

jetson-orin-1 is still on Clang 12.x which doesn't support the C++20 feature of lambda expressions in unevaluated operands. Clang 13.x added support for this. Time to raise the minspec to Clang-13.

@Sonicadvance1 Sonicadvance1 force-pushed the seccomp branch 2 times, most recently from 05bb35d to ee69ed6 Compare May 16, 2024 23:42
@neobrain
Copy link
Member

Sorry, "bpf" the syscall is different from seccomp. Seccomp uses a highly restricted form of BPF which exposes an architecture constant which means we can't forward it.
"bpf" might not have that same architecture specific data and /might/ be able to get forwarded, unsure and needs investigation. Ideally that syscall is still restricted in userspace and can continue to get ignored.

I'm also interested in picking up this conversation again. Currently the PR description is very sparse on details, but I hope some background on this change can be provided once the WIP label is dropped to enable a sensible review.

@Sonicadvance1 Sonicadvance1 force-pushed the seccomp branch 7 times, most recently from f217521 to aca37b9 Compare May 17, 2024 22:21
@Sonicadvance1
Copy link
Member Author

Poking around at Chrome's sandbox with this PR

  • Seems to work successfully
  • Chrome hits two unrelated bugs when running the sandbox that are unrelated to this PR
    • CLONE_NEWNET is currently unsupported in FEX, as it loses connection to FEXServer. I have a WIP branch locally to support this
    • FEX hits some EDEADLK futex locks for some reason, potentially due to forking Chrome does. With some massaging of our various mutexes, I was able to still get Chrome rendering with the sandbox enabled.

template<bool CalculateSize>
uint64_t BPFEmitter::HandleLoad(uint32_t BPFIP, const sock_filter* Inst) {
[[maybe_unused]] size_t OpSize {};
switch (BPF_SIZE(Inst->code)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch would be probably better as:

if (BPF_SIZE() != W) {
   RET_ERR(EINVAL);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the VALIDATE define

@alyssarosenzweig
Copy link
Collaborator

Could we get a macro:

#define VALIDATE(cond) if (!(cond)) RETURN_ERROR(-EINVAL)

Then porting the file to this would let us write straightline code instead which should be a lot shorter and imho simpler. e.g.:

+  case BPF_JA: {
+    // Only BPF_K supported on JA.
+    VALIDATE((SrcType != BPF_X);
+
+    // BPF IP register is effectively only 32-bit. Treat k constant like a signed integer.
+    // This allows it to jump anywhere in the program.
+    // But! Loops are EXPLICITLY disallowed inside of BPF programs.
+    // This is to prevent DOS style attacks through BPF programs.
+    uint64_t Target = BPFIP + Inst->k + 1;
+
+    // Can't jump past the end.
+    VALIDATE (Target < NumInst);
+
+    fextl::unordered_map<uint32_t, ARMEmitter::ForwardLabel>::iterator TargetLabel {};
+
+    if constexpr (!CalculateSize) {
+      TargetLabel = JumpLabels.try_emplace(Target, ARMEmitter::ForwardLabel {}).first;
+    }
+
+    EMIT_INST(b(&TargetLabel->second));
+    break;
+  }

Could even take a string as a second arg to match our logman asserts.

Seccomp emulator uses lambda expressions in an unevaluated operand,
which was only added in clang-13
@Sonicadvance1
Copy link
Member Author

Could we get a macro:

#define VALIDATE(cond) if (!(cond)) RETURN_ERROR(-EINVAL)

Then porting the file to this would let us write straightline code instead which should be a lot shorter and imho simpler. e.g.:

+  case BPF_JA: {
+    // Only BPF_K supported on JA.
+    VALIDATE((SrcType != BPF_X);
+
+    // BPF IP register is effectively only 32-bit. Treat k constant like a signed integer.
+    // This allows it to jump anywhere in the program.
+    // But! Loops are EXPLICITLY disallowed inside of BPF programs.
+    // This is to prevent DOS style attacks through BPF programs.
+    uint64_t Target = BPFIP + Inst->k + 1;
+
+    // Can't jump past the end.
+    VALIDATE (Target < NumInst);
+
+    fextl::unordered_map<uint32_t, ARMEmitter::ForwardLabel>::iterator TargetLabel {};
+
+    if constexpr (!CalculateSize) {
+      TargetLabel = JumpLabels.try_emplace(Target, ARMEmitter::ForwardLabel {}).first;
+    }
+
+    EMIT_INST(b(&TargetLabel->second));
+    break;
+  }

Could even take a string as a second arg to match our logman asserts.

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

@alyssarosenzweig
Copy link
Collaborator

Those switches should probably all have a default: EINVAL on them

@alyssarosenzweig
Copy link
Collaborator

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

The problem is that

+  // Larger than scratch space size.
+  VALIDATE(Inst->k < 16);

is very confusing. In some ways the comment makes this more confusing. since we validate the opposite. Clearer would be

+  // Must be smaller than scratch space size.
+  VALIDATE(Inst->k < 16);

or

+  VALIDATE(Inst->k < 16, "Larger than scratch space size");

@alyssarosenzweig
Copy link
Collaborator

(That goes for all the validation I think)

case BPF_LDX: Parse_Class_LD(i, Inst); break;
case BPF_ST:
case BPF_STX: Parse_Class_ST(i, Inst); break;
case BPF_ALU: break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we dumping load/store/jumps but not alu?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I didn't finish that off. Added the two handlers.

@alyssarosenzweig
Copy link
Collaborator

All the non-JIT stuff seems reasonable, though I'm obviously less familiar

Seccomp is a relatively complex feature that was added to Linux back in
2005, and was further extended in 2013 to support BPF based protections.
Once seccomp is enabled, you can no longer disable seccomp but
additional protections can be placed on top of existing seccomp filters.
Additionally seccomp filters are inherited in child processes, which
ensures the process tree can't escape from the secure computing
environment through child processes.

The basis of this feature is a shim that lives between userspace and the
kernel at the syscall entrypoint.
In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
When in "filter" mode, a BPF filter is run on syscall entrypoint and
returns state about if the syscall should be allowed or not. Multiple
filters can be installed in this mode, all of which get executed. The
result that is the most restricted is the action that occurs at the end.

There are some significant limitations in filter mode that must be
adhered to which makes executing this code inside of kernel space a
non-issue and effectively limits how much cpu time is spent in the filters.
Although these filters are free to do basically anything with the
provided data, just can't do any loops.

FEX needs to implement seccomp because there are multiple applications
using the feature, the primary one being Chromium which some games embed
without disabling the sandbox. WINE also uses seccomp for capturing
games that do raw Windows system calls. Apparently Red Dead Redemption
is one of the games that requires this.

While FEX implements seccomp, it is not yet all encompassing, which is
one of the reasons why it isn't enabled by default and requires a config
option.

**seccomp_unotify is not implemented**
This is a relatively new feature for seccomp which lets the seccomp
filter signal an FD for multiple things. Luckily Chromium and WINE don't
use this. This will be tricky to implement under FEX since it
requires ioctl trapping and some other behaviour

**ptrace isn't supported**
One feature of seccomp is that it can raise ptrace events. Since FEX
doesn't support ptrace at all, this isn't handled. Again Chromium and
WINE don't use this.

**kill-thread not quite correct**
This isn't directly related to seccomp but more about how we do thread
shutdown in FEX. This will require some more changes around thread state
tracking before fully supporting this. Chromium and WINE don't use this.
kill-process also falls under this

Features that are supported:
- Strict mode and seccomp-bpf mode supported
- All BFP instructions that seccomp-bpf understands
- Inheriting seccomp through execve
   - This means we serialize and deserialize the calling thread's
     seccomp filters
   - An execve that escapes FEX will also escape seccomp. Not much we
     can do about it
- TSync - Allowing post-mortem seccomp insertion which allows threads to
  synchronize seccomp filters after the fact

Features that are not supported:
- Different arch qualifiers depending on syscall entrypoint
  - Just like our syscall handler, we are hardcoded to the arch that the
    application starts with
- user_notif
- ptrace
- Runtime code cache invalidation when seccomp is installed
  - Currently we must ensure all syscalls go through the frontend
    syscall handler
  - Runtime invalidation of code cache with inline syscalls will get
    fixed in the future.

This currently isn't enabled by default because of the minor feature
problems that haven't been resolved. Currently the Linux Kernel's test
application works for the features that FEX supports, and WINE's usage
can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
but it only fails due to features unrelated to seccomp.

Having this open for merging now so we can work to resolve the remaining
issues without this bitrotting.
@Sonicadvance1
Copy link
Member Author

Added a VALIDATE define. No need for logging because these are non-fatal errors, they are ensuring that seccomp-bpf applications don't try to do something invalid.

The problem is that

+  // Larger than scratch space size.
+  VALIDATE(Inst->k < 16);

is very confusing. In some ways the comment makes this more confusing. since we validate the opposite. Clearer would be

+  // Must be smaller than scratch space size.
+  VALIDATE(Inst->k < 16);

or

+  VALIDATE(Inst->k < 16, "Larger than scratch space size");

Updated the comments around the VALIDATES

@Sonicadvance1
Copy link
Member Author

Those switches should probably all have a default: EINVAL on them

default:RETURN_ERROR(-EINVAL); added to the switches that missed them.

@alyssarosenzweig alyssarosenzweig merged commit b368223 into FEX-Emu:main Sep 3, 2024
12 checks passed
@Sonicadvance1 Sonicadvance1 deleted the seccomp branch September 3, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants