Enable hardware stack zeroing. #111

davidchisnall · 2023-09-25T11:18:11Z

This replaces the software stack zeroing path with offload to a hardware engine, when available. The hardware zeroing pipeline is controlled by the ZTOP SCR (currently assumed to be number 27). This zeroes from the top of the capability in ZTOP to the bottom.

We assume that short-lived cross-compartment calls or returns from deeply nested calls will result in ranges where each is a subset of another, so we try to coalesce. We could probably do better aggregating overlapping but not subset ranges.

nwf-msr · 2023-09-25T12:33:45Z

sdk/core/switcher/entry.S

+	beq                \top, \scratch, 2f
+	// If the current zeroing range is a subset of range that we're about to
+	// zero, don't bother waiting, just zero from the start.
+	ctestsubset        \top, c\base, c\scratch


This might be the only use of ctestsubset in the RTOS. I think at one point we were thinking of removing the instruction in hopes of simplifying the (micro)architecture?

We could do it with checks on the top and bottom, since we already have checks that CSP is sane (correct permissions, correctly aligned top and bottom) on entry into the compartment switcher. I'm not sure if we could do that without adding another temporary register, which would require some small tweaks to the users of this macro.

sdk/core/switcher/entry.S

nwf-msr · 2023-09-27T02:58:43Z

This runs the compartment-call-benchmark test, generating

#board     stack size      full call       call    return
ibex.json 0x100   186     127     59
ibex.json 0x200   186     127     59
ibex.json 0x400   188     129     59
ibex.json 0x800   188     129     59
ibex.json 0x1000  188     129     59

We'll see if it survives the full test harness!

nwf-msr · 2023-09-27T18:40:16Z

It does survive the full test harness, huzzah! Reported cycle numbers for 452d404 + this PR, without and with the zeroizer support enabled:

Test	Without	With
Static sealing	449592	449425
Crash recovery	7438235	7438011
Compartment calls	1378192	1378099
Stack exhaustion in the switcher	2129788	2162198
Thread pool	867080	865635
Global constructors	85663	85522
Queue	6053459	6052851
Futex	2943148	2943848
Locks	5001834	5001183
Multiwaiter	4623924	4623760
Allocator	7858154	7688631
All tests	40007035	39868625

nwf-msr · 2023-09-28T17:18:33Z

sdk/core/switcher/entry.S

+	// Derive the capability to the range that should be zeroed.
+	// This is stored in base, leaving top as a second scratch register to use
+	csub               \top, c\top, c\base
+	csetboundsexact    c\base, c\base, \top


As per discussions out of band, this should probably be

Suggested change

csetboundsexact c\base, c\base, \top

csetbounds c\base, c\base, \top

cincoffset c\base, c\base, \top

We need to set the address to the top of the region to be zeroed, since the state machine counts down.

Using inexact bounds allows us to tolerate large (> 2KiB) stacks without risk, and...

Because the stack is guaranteed to be a superset representable region, we'll never try to grow beyond it

Because the state machine counts down from the (precise) address, it won't zeroize parts of the stack in use

It might zero a little bit past the HWM, but that's fine.

nwf-msr · 2023-09-28T18:12:50Z

sdk/core/switcher/trusted-stack-assembly.h

-EXPORT_ASSEMBLY_OFFSET(TrustedStack, mshwmb, (17 * 8) + 4)
+#ifdef CHERIOT_HAS_ZTOP
+EXPORT_ASSEMBLY_OFFSET(TrustedStack, ztop, 16 * 8)
+#	define TSTACK_HAS_ZTOP 1


This is extremely minor, but the differing "types" of CHERIOT_HAS_ZTOP and TSTACK_HAS_ZTOP is bugging me. Maybe TSTACK_ZTOP_WORDS or such?

This replaces the software stack zeroing path with offload to a hardware engine, when available. The hardware zeroing pipeline is controlled by the ZTOP SCR (currently assumed to be number 27). This zeroes from the top of the capability in ZTOP to the bottom. We assume that short-lived cross-compartment calls or returns from deeply nested calls will result in ranges where each is a subset of another, so we try to coalesce. If we're about to zero a region that is a superset of the current zeroing region, we skip it, otherwise we wait for the previous zeroing region to finish. This is a fairly simple approach but was chosen because adding a lot of instructions in this path can rapidly offset the benefits. 0.3% speedup on Sonata. 2% speedup on the Ibex SAFE simulator. The difference here is largely due to the UART at 115,200 b/s being *very* slow in comparison to the CPU and so the test suite is mostly waiting to write debug messages. The simulator UART can write one character per cycle. The allocator portion of the test suite (which does a lot of work per message) is around 1% faster. The test suite crashes on the Arty A7, which may be a bug in the A7 version but may be a bug in this code and so we shouldn't merge it yet.

davidchisnall · 2024-06-07T10:43:27Z

Test failure is interesting. The SHWM is reporting a lot more stack usage in this mode, but it shouldn't. There may be a hardware bug in the SHWM's interaction with the zeroing engine.

davidchisnall requested review from nwf-msr and kliuMsft September 25, 2023 11:18

nwf-msr reviewed Sep 25, 2023

View reviewed changes

davidchisnall force-pushed the stkclr branch from 1f5aefd to 8f22cdc Compare September 26, 2023 08:44

nwf-msr reviewed Sep 27, 2023

View reviewed changes

sdk/core/switcher/entry.S Outdated Show resolved Hide resolved

davidchisnall force-pushed the stkclr branch from 8f22cdc to cfcc489 Compare September 28, 2023 10:25

nwf-msr reviewed Sep 28, 2023

View reviewed changes

rmn30 mentioned this pull request Sep 29, 2023

Add a variant of compartment call benchmark that actually uses some s… #63

Merged

davidchisnall force-pushed the stkclr branch 2 times, most recently from 3b4b670 to c1d69b5 Compare January 26, 2024 16:16

davidchisnall force-pushed the stkclr branch from c1d69b5 to 55cfc3e Compare June 3, 2024 15:09

davidchisnall force-pushed the stkclr branch from 55cfc3e to e4710f6 Compare June 3, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable hardware stack zeroing. #111

Enable hardware stack zeroing. #111

davidchisnall commented Sep 25, 2023

nwf-msr Sep 25, 2023

davidchisnall Sep 25, 2023

nwf-msr commented Sep 27, 2023

nwf-msr commented Sep 27, 2023

nwf-msr Sep 28, 2023

nwf-msr Sep 28, 2023

davidchisnall commented Jun 7, 2024

	csetboundsexact c\base, c\base, \top
	csetbounds c\base, c\base, \top
	cincoffset c\base, c\base, \top

Enable hardware stack zeroing. #111

Are you sure you want to change the base?

Enable hardware stack zeroing. #111

Conversation

davidchisnall commented Sep 25, 2023

nwf-msr Sep 25, 2023

Choose a reason for hiding this comment

davidchisnall Sep 25, 2023

Choose a reason for hiding this comment

nwf-msr commented Sep 27, 2023

nwf-msr commented Sep 27, 2023

nwf-msr Sep 28, 2023

Choose a reason for hiding this comment

nwf-msr Sep 28, 2023

Choose a reason for hiding this comment

davidchisnall commented Jun 7, 2024