Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable hardware stack zeroing. #111

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Enable hardware stack zeroing. #111

wants to merge 1 commit into from

Conversation

davidchisnall
Copy link
Collaborator

This replaces the software stack zeroing path with offload to a hardware engine, when available. The hardware zeroing pipeline is controlled by the ZTOP SCR (currently assumed to be number 27). This zeroes from the top of the capability in ZTOP to the bottom.

We assume that short-lived cross-compartment calls or returns from deeply nested calls will result in ranges where each is a subset of another, so we try to coalesce. We could probably do better aggregating overlapping but not subset ranges.

beq \top, \scratch, 2f
// If the current zeroing range is a subset of range that we're about to
// zero, don't bother waiting, just zero from the start.
ctestsubset \top, c\base, c\scratch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the only use of ctestsubset in the RTOS. I think at one point we were thinking of removing the instruction in hopes of simplifying the (micro)architecture?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do it with checks on the top and bottom, since we already have checks that CSP is sane (correct permissions, correctly aligned top and bottom) on entry into the compartment switcher. I'm not sure if we could do that without adding another temporary register, which would require some small tweaks to the users of this macro.

sdk/core/switcher/entry.S Outdated Show resolved Hide resolved
sdk/core/switcher/entry.S Show resolved Hide resolved
sdk/core/switcher/entry.S Outdated Show resolved Hide resolved
sdk/core/switcher/entry.S Outdated Show resolved Hide resolved
@nwf-msr
Copy link
Contributor

nwf-msr commented Sep 27, 2023

This runs the compartment-call-benchmark test, generating

#board     stack size      full call       call    return
ibex.json 0x100   186     127     59
ibex.json 0x200   186     127     59
ibex.json 0x400   188     129     59
ibex.json 0x800   188     129     59
ibex.json 0x1000  188     129     59

We'll see if it survives the full test harness!

@nwf-msr
Copy link
Contributor

nwf-msr commented Sep 27, 2023

It does survive the full test harness, huzzah! Reported cycle numbers for 452d404 + this PR, without and with the zeroizer support enabled:

Test Without With
Static sealing 449592 449425
Crash recovery 7438235 7438011
Compartment calls 1378192 1378099
Stack exhaustion in the switcher 2129788 2162198
Thread pool 867080 865635
Global constructors 85663 85522
Queue 6053459 6052851
Futex 2943148 2943848
Locks 5001834 5001183
Multiwaiter 4623924 4623760
Allocator 7858154 7688631
All tests 40007035 39868625

// Derive the capability to the range that should be zeroed.
// This is stored in base, leaving top as a second scratch register to use
csub \top, c\top, c\base
csetboundsexact c\base, c\base, \top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussions out of band, this should probably be

Suggested change
csetboundsexact c\base, c\base, \top
csetbounds c\base, c\base, \top
cincoffset c\base, c\base, \top
  • We need to set the address to the top of the region to be zeroed, since the state machine counts down.
  • Using inexact bounds allows us to tolerate large (> 2KiB) stacks without risk, and...
    • Because the stack is guaranteed to be a superset representable region, we'll never try to grow beyond it
    • Because the state machine counts down from the (precise) address, it won't zeroize parts of the stack in use
    • It might zero a little bit past the HWM, but that's fine.

EXPORT_ASSEMBLY_OFFSET(TrustedStack, mshwmb, (17 * 8) + 4)
#ifdef CHERIOT_HAS_ZTOP
EXPORT_ASSEMBLY_OFFSET(TrustedStack, ztop, 16 * 8)
# define TSTACK_HAS_ZTOP 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely minor, but the differing "types" of CHERIOT_HAS_ZTOP and TSTACK_HAS_ZTOP is bugging me. Maybe TSTACK_ZTOP_WORDS or such?

This replaces the software stack zeroing path with offload to a hardware
engine, when available.  The hardware zeroing pipeline is controlled by
the ZTOP SCR (currently assumed to be number 27).  This zeroes from the
top of the capability in ZTOP to the bottom.

We assume that short-lived cross-compartment calls or returns from
deeply nested calls will result in ranges where each is a subset of
another, so we try to coalesce.

If we're about to zero a region that is a superset of the current
zeroing region, we skip it, otherwise we wait for the previous zeroing
region to finish.  This is a fairly simple approach but was chosen
because adding a lot of instructions in this path can rapidly offset the
benefits.

0.3% speedup on Sonata.
2% speedup on the Ibex SAFE simulator.

The difference here is largely due to the UART at 115,200 b/s being
*very* slow in comparison to the CPU and so the test suite is mostly
waiting to write debug messages.  The simulator UART can write one
character per cycle.  The allocator portion of the test suite (which
does a lot of work per message) is around 1% faster.

The test suite crashes on the Arty A7, which may be a bug in the A7
version but may be a bug in this code and so we shouldn't merge it yet.
@davidchisnall
Copy link
Collaborator Author

Test failure is interesting. The SHWM is reporting a lot more stack usage in this mode, but it shouldn't. There may be a hardware bug in the SHWM's interaction with the zeroing engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants