-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LinuxEmulation: Implement support for seccomp #3628
Conversation
0bc116c
to
4c5c0fe
Compare
Is 64-bit bpf interesting for FEX? Seems potentially better to do now than later given how much churn it would be. (Unless it's never going to be supported for some good reason) |
I haven't seen any usage of BPF outside of seccomp currently. Not even sure if the bpf syscall is even typically accessible outside of root. Also haven't checked if we can just forward the syscall, I don't know if they expose an architecture or anything inside the VM. So it's just kind of eeh. |
Didn't we talk about this in the meeting last week? I thought there was some reason FEX had to do these shenanigans instead of simply forwarding like @neobrain hoped...? I'd rather not merge 1kloc of new JIT code if we could just... not 😅 |
Sorry, "bpf" the syscall is different from seccomp. Seccomp uses a highly restricted form of BPF which exposes an architecture constant which means we can't forward it. |
2828a1f
to
f0d6725
Compare
jetson-orin-1 is still on Clang 12.x which doesn't support the C++20 feature of lambda expressions in unevaluated operands. Clang 13.x added support for this. Time to raise the minspec to Clang-13. |
05bb35d
to
ee69ed6
Compare
I'm also interested in picking up this conversation again. Currently the PR description is very sparse on details, but I hope some background on this change can be provided once the WIP label is dropped to enable a sensible review. |
f217521
to
aca37b9
Compare
Poking around at Chrome's sandbox with this PR
|
40000d7
to
03b25d4
Compare
template<bool CalculateSize> | ||
uint64_t BPFEmitter::HandleLoad(uint32_t BPFIP, const sock_filter* Inst) { | ||
[[maybe_unused]] size_t OpSize {}; | ||
switch (BPF_SIZE(Inst->code)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch would be probably better as:
if (BPF_SIZE() != W) {
RET_ERR(EINVAL);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the VALIDATE
define
Could we get a macro:
Then porting the file to this would let us write straightline code instead which should be a lot shorter and imho simpler. e.g.:
Could even take a string as a second arg to match our logman asserts. |
Seccomp emulator uses lambda expressions in an unevaluated operand, which was only added in clang-13
7e63d11
to
5d05efd
Compare
Added a |
5d05efd
to
a5e908e
Compare
Those switches should probably all have a default: EINVAL on them |
The problem is that
is very confusing. In some ways the comment makes this more confusing. since we validate the opposite. Clearer would be
or
|
(That goes for all the validation I think) |
Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/BPFEmitter.cpp
Outdated
Show resolved
Hide resolved
Source/Tools/LinuxEmulation/LinuxSyscalls/Seccomp/BPFEmitter.cpp
Outdated
Show resolved
Hide resolved
case BPF_LDX: Parse_Class_LD(i, Inst); break; | ||
case BPF_ST: | ||
case BPF_STX: Parse_Class_ST(i, Inst); break; | ||
case BPF_ALU: break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we dumping load/store/jumps but not alu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I didn't finish that off. Added the two handlers.
All the non-JIT stuff seems reasonable, though I'm obviously less familiar |
Seccomp is a relatively complex feature that was added to Linux back in 2005, and was further extended in 2013 to support BPF based protections. Once seccomp is enabled, you can no longer disable seccomp but additional protections can be placed on top of existing seccomp filters. Additionally seccomp filters are inherited in child processes, which ensures the process tree can't escape from the secure computing environment through child processes. The basis of this feature is a shim that lives between userspace and the kernel at the syscall entrypoint. In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function. When in "filter" mode, a BPF filter is run on syscall entrypoint and returns state about if the syscall should be allowed or not. Multiple filters can be installed in this mode, all of which get executed. The result that is the most restricted is the action that occurs at the end. There are some significant limitations in filter mode that must be adhered to which makes executing this code inside of kernel space a non-issue and effectively limits how much cpu time is spent in the filters. Although these filters are free to do basically anything with the provided data, just can't do any loops. FEX needs to implement seccomp because there are multiple applications using the feature, the primary one being Chromium which some games embed without disabling the sandbox. WINE also uses seccomp for capturing games that do raw Windows system calls. Apparently Red Dead Redemption is one of the games that requires this. While FEX implements seccomp, it is not yet all encompassing, which is one of the reasons why it isn't enabled by default and requires a config option. **seccomp_unotify is not implemented** This is a relatively new feature for seccomp which lets the seccomp filter signal an FD for multiple things. Luckily Chromium and WINE don't use this. This will be tricky to implement under FEX since it requires ioctl trapping and some other behaviour **ptrace isn't supported** One feature of seccomp is that it can raise ptrace events. Since FEX doesn't support ptrace at all, this isn't handled. Again Chromium and WINE don't use this. **kill-thread not quite correct** This isn't directly related to seccomp but more about how we do thread shutdown in FEX. This will require some more changes around thread state tracking before fully supporting this. Chromium and WINE don't use this. kill-process also falls under this Features that are supported: - Strict mode and seccomp-bpf mode supported - All BFP instructions that seccomp-bpf understands - Inheriting seccomp through execve - This means we serialize and deserialize the calling thread's seccomp filters - An execve that escapes FEX will also escape seccomp. Not much we can do about it - TSync - Allowing post-mortem seccomp insertion which allows threads to synchronize seccomp filters after the fact Features that are not supported: - Different arch qualifiers depending on syscall entrypoint - Just like our syscall handler, we are hardcoded to the arch that the application starts with - user_notif - ptrace - Runtime code cache invalidation when seccomp is installed - Currently we must ensure all syscalls go through the frontend syscall handler - Runtime invalidation of code cache with inline syscalls will get fixed in the future. This currently isn't enabled by default because of the minor feature problems that haven't been resolved. Currently the Linux Kernel's test application works for the features that FEX supports, and WINE's usage can be handled by FEX. Chromium's sandbox doesn't yet work with this PR, but it only fails due to features unrelated to seccomp. Having this open for merging now so we can work to resolve the remaining issues without this bitrotting.
a5e908e
to
ac32876
Compare
Updated the comments around the VALIDATES |
|
Seccomp is a relatively complex feature that was added to Linux back in
2005, and was further extended in 2013 to support BPF based protections.
Once seccomp is enabled, you can no longer disable seccomp but
additional protections can be placed on top of existing seccomp filters.
Additionally seccomp filters are inherited in child processes, which
ensures the process tree can't escape from the secure computing
environment through child processes.
The basis of this feature is a shim that lives between userspace and the
kernel at the syscall entrypoint.
In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
When in "filter" mode, a BPF filter is run on syscall entrypoint and
returns state about if the syscall should be allowed or not. Multiple
filters can be installed in this mode, all of which get executed. The
result that is the most restricted is the action that occurs at the end.
There are some significant limitations in filter mode that must be
adhered to which makes executing this code inside of kernel space a
non-issue and effectively limits how much cpu time is spent in the filters.
Although these filters are free to do basically anything with the
provided data, just can't do any loops.
FEX needs to implement seccomp because there are multiple applications
using the feature, the primary one being Chromium which some games embed
without disabling the sandbox. WINE also uses seccomp for capturing
games that do raw Windows system calls. Apparently Red Dead Redemption
is one of the games that requires this.
While FEX implements seccomp, it is not yet all encompassing, which is
one of the reasons why it isn't enabled by default and requires a config
option.
seccomp_unotify is not implemented
This is a relatively new feature for seccomp which lets the seccomp
filter signal an FD for multiple things. Luckily Chromium and WINE don't
use this. This will be tricky to implement under FEX since it
requires ioctl trapping and some other behaviour
ptrace isn't supported
One feature of seccomp is that it can raise ptrace events. Since FEX
doesn't support ptrace at all, this isn't handled. Again Chromium and
WINE don't use this.
kill-thread not quite correct
This isn't directly related to seccomp but more about how we do thread
shutdown in FEX. This will require some more changes around thread state
tracking before fully supporting this. Chromium and WINE don't use this.
kill-process also falls under this
Features that are supported:
seccomp filters
can do about it
synchronize seccomp filters after the fact
Features that are not supported:
application starts with
syscall handler
fixed in the future.
This currently isn't enabled by default because of the minor feature
problems that haven't been resolved. Currently the Linux Kernel's test
application works for the features that FEX supports, and WINE's usage
can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
but it only fails due to features unrelated to seccomp.
Having this open for merging now so we can work to resolve the remaining
issues without this bitrotting.