Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinuxEmulation: Implement support for seccomp #3628

Merged
merged 2 commits into from
Sep 3, 2024

Commits on Sep 2, 2024

  1. CMake: Update minimum clang version to 13

    Seccomp emulator uses lambda expressions in an unevaluated operand,
    which was only added in clang-13
    Sonicadvance1 committed Sep 2, 2024
    Configuration menu
    Copy the full SHA
    fc677ea View commit details
    Browse the repository at this point in the history
  2. LinuxEmulation: Implement support for seccomp

    Seccomp is a relatively complex feature that was added to Linux back in
    2005, and was further extended in 2013 to support BPF based protections.
    Once seccomp is enabled, you can no longer disable seccomp but
    additional protections can be placed on top of existing seccomp filters.
    Additionally seccomp filters are inherited in child processes, which
    ensures the process tree can't escape from the secure computing
    environment through child processes.
    
    The basis of this feature is a shim that lives between userspace and the
    kernel at the syscall entrypoint.
    In "strict" mode, seccomp only allows read, write, exit, exit_group, and {rt_,}sigreturn to function.
    When in "filter" mode, a BPF filter is run on syscall entrypoint and
    returns state about if the syscall should be allowed or not. Multiple
    filters can be installed in this mode, all of which get executed. The
    result that is the most restricted is the action that occurs at the end.
    
    There are some significant limitations in filter mode that must be
    adhered to which makes executing this code inside of kernel space a
    non-issue and effectively limits how much cpu time is spent in the filters.
    Although these filters are free to do basically anything with the
    provided data, just can't do any loops.
    
    FEX needs to implement seccomp because there are multiple applications
    using the feature, the primary one being Chromium which some games embed
    without disabling the sandbox. WINE also uses seccomp for capturing
    games that do raw Windows system calls. Apparently Red Dead Redemption
    is one of the games that requires this.
    
    While FEX implements seccomp, it is not yet all encompassing, which is
    one of the reasons why it isn't enabled by default and requires a config
    option.
    
    **seccomp_unotify is not implemented**
    This is a relatively new feature for seccomp which lets the seccomp
    filter signal an FD for multiple things. Luckily Chromium and WINE don't
    use this. This will be tricky to implement under FEX since it
    requires ioctl trapping and some other behaviour
    
    **ptrace isn't supported**
    One feature of seccomp is that it can raise ptrace events. Since FEX
    doesn't support ptrace at all, this isn't handled. Again Chromium and
    WINE don't use this.
    
    **kill-thread not quite correct**
    This isn't directly related to seccomp but more about how we do thread
    shutdown in FEX. This will require some more changes around thread state
    tracking before fully supporting this. Chromium and WINE don't use this.
    kill-process also falls under this
    
    Features that are supported:
    - Strict mode and seccomp-bpf mode supported
    - All BFP instructions that seccomp-bpf understands
    - Inheriting seccomp through execve
       - This means we serialize and deserialize the calling thread's
         seccomp filters
       - An execve that escapes FEX will also escape seccomp. Not much we
         can do about it
    - TSync - Allowing post-mortem seccomp insertion which allows threads to
      synchronize seccomp filters after the fact
    
    Features that are not supported:
    - Different arch qualifiers depending on syscall entrypoint
      - Just like our syscall handler, we are hardcoded to the arch that the
        application starts with
    - user_notif
    - ptrace
    - Runtime code cache invalidation when seccomp is installed
      - Currently we must ensure all syscalls go through the frontend
        syscall handler
      - Runtime invalidation of code cache with inline syscalls will get
        fixed in the future.
    
    This currently isn't enabled by default because of the minor feature
    problems that haven't been resolved. Currently the Linux Kernel's test
    application works for the features that FEX supports, and WINE's usage
    can be handled by FEX. Chromium's sandbox doesn't yet work with this PR,
    but it only fails due to features unrelated to seccomp.
    
    Having this open for merging now so we can work to resolve the remaining
    issues without this bitrotting.
    Sonicadvance1 committed Sep 2, 2024
    Configuration menu
    Copy the full SHA
    ac32876 View commit details
    Browse the repository at this point in the history