Skip to content

Commit

Permalink
LinuxEmulation: Implement support for seccomp
Browse files Browse the repository at this point in the history
Seccomp is a relatively complex feature that was added to Linux back in
2005, and was further extended in 2013 to support BPF based protections.
Once seccomp is enabled, you can no longer disable seccomp but
additional protections can be placed on top of existing seccomp filters.
Additionally seccomp filters are inherited in child processes, which
ensures the process tree can't escape from the secure computing
environment through child processes.

The basis of this feature is a shim that lives between userspace and the
kernel at the syscall entrypoint.
In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
When in "filter" mode, a BPF filter is run on syscall entrypoint and
returns state about if the syscall should be allowed or not. Multiple
filters can be installed in this mode, all of which get executed. The
result that is the most restricted is the action that occurs at the end.

There are some significant limitations in filter mode that must be
adhered to which makes executing this code inside of kernel space a
non-issue and effectively limits how much cpu time is spent in the filters.
Although these filters are free to do basically anything with the
provided data, just can't do any loops.

FEX needs to implement seccomp because there are multiple applications
using the feature, the primary one being Chromium which some games embed
without disabling the sandbox. WINE also uses seccomp for capturing
games that do raw Windows system calls. Apparently Red Dead Redemption
is one of the games that requires this.

While FEX implements seccomp, it is not yet all encompassing, which is
one of the reasons why it isn't enabled by default and requires a config
option.

**seccomp_unotify is not implemented**
This is a relatively new feature for seccomp which lets the seccomp
filter signal an FD for multiple things. Luckily Chromium and WINE don't
use this. This will be tricky to implement under FEX since it
requires ioctl trapping and some other behaviour

**ptrace isn't supported**
One feature of seccomp is that it can raise ptrace events. Since FEX
doesn't support ptrace at all, this isn't handled. Again Chromium and
WINE don't use this.

**kill-thread not quite correct**
This isn't directly related to seccomp but more about how we do thread
shutdown in FEX. This will require some more changes around thread state
tracking before fully supporting this. Chromium and WINE don't use this.
kill-process also falls under this

Features that are supported:
- Strict mode and seccomp-bpf mode supported
- All BFP instructions that seccomp-bpf understands
- Inheriting seccomp through execve
   - This means we serialize and deserialize the calling thread's
     seccomp filters
   - An execve that escapes FEX will also escape seccomp. Not much we
     can do about it
- TSync - Allowing post-mortem seccomp insertion which allows threads to
  synchronize seccomp filters after the fact

Features that are not supported:
- Different arch qualifiers depending on syscall entrypoint
  - Just like our syscall handler, we are hardcoded to the arch that the
    application starts with
- user_notif
- ptrace
- Runtime code cache invalidation when seccomp is installed
  - Currently we must ensure all syscalls go through the frontend
    syscall handler
  - Runtime invalidation of code cache with inline syscalls will get
    fixed in the future.

This currently isn't enabled by default because of the minor feature
problems that haven't been resolved. Currently the Linux Kernel's test
application works for the features that FEX supports, and WINE's usage
can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
but it only fails due to features unrelated to seccomp.

Having this open for merging now so we can work to resolve the remaining
issues without this bitrotting.
  • Loading branch information
Sonicadvance1 committed Sep 2, 2024
1 parent fc677ea commit ac32876
Show file tree
Hide file tree
Showing 18 changed files with 1,642 additions and 26 deletions.
7 changes: 7 additions & 0 deletions FEXCore/Source/Interface/Config/Config.json.in
Original file line number Diff line number Diff line change
Expand Up @@ -517,6 +517,13 @@
"Desc": [
"Override for a FEXServer socket path. Only useful for chroots."
]
},
"NeedsSeccomp": {
"Type": "bool",
"Default": "false",
"Desc": [
"Disables inline syscalls in order to support seccomp handling"
]
}
}
},
Expand Down
3 changes: 3 additions & 0 deletions Source/Tools/FEXLoader/FEXLoader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ int main(int argc, char** argv, char** const envp) {

ExecutedWithFD = getauxval(AT_EXECFD) != 0;
int FEXFD {StealFEXFDFromEnv("FEX_EXECVEFD")};
int FEXSeccompFD {StealFEXFDFromEnv("FEX_SECCOMPFD")};

LogMan::Throw::InstallHandler(AssertHandler);
LogMan::Msg::InstallHandler(MsgHandler);
Expand Down Expand Up @@ -517,6 +518,8 @@ int main(int argc, char** argv, char** const envp) {
CTX->AppendThunkDefinitions(FEX::VDSO::GetVDSOThunkDefinitions());
SignalDelegation->SetVDSOSigReturn();

SyscallHandler->DeserializeSeccompFD(ParentThread, FEXSeccompFD);

FEXCore::Context::ExitReason ShutdownReason = FEXCore::Context::ExitReason::EXIT_SHUTDOWN;

// There might already be an exit handler, leave it installed
Expand Down
3 changes: 3 additions & 0 deletions Source/Tools/LinuxEmulation/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ set (SRCS
LinuxSyscalls/FileManagement.cpp
LinuxSyscalls/LinuxAllocator.cpp
LinuxSyscalls/NetStream.cpp
LinuxSyscalls/Seccomp/SeccompEmulator.cpp
LinuxSyscalls/Seccomp/BPFEmitter.cpp
LinuxSyscalls/Seccomp/Dumper.cpp
LinuxSyscalls/SignalDelegator.cpp
LinuxSyscalls/Syscalls.cpp
LinuxSyscalls/SyscallsSMCTracking.cpp
Expand Down
Loading

0 comments on commit ac32876

Please sign in to comment.