-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191
Conversation
Thanks for your PR. I'm positive to have this integrated into mirage-crypto. There are a couple of questions, though: Audit, code review
What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?
At least for me this will need some time, esp. with arrays and reference cells being introduced. Costs (pre-computation, memory)
Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?
Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services. Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that? Benchmark
Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)? I found a patch that I developed quite some time ago which may be useful hannesm@c512a0a - which adds ECDSA (sign & verify) to the bench/speed.ml. Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU ( |
Thanks for the comments! Audit, code review
The threat model here is local side-channel attacks, such as FLUSH+RELOAD used here or here.
I wonder if it would be more efficient to first rewrite it in C, see how that performs, and rather review that (since it's slightly easier to review the timing security without the OCaml abstractions). In any way I plan to do that in the coming days, so maybe you want to hold off review until I come back with results. Costs (pre-computation, memory)
I consistently get the following pre-computation times on my CPU (on the
Which is consistent with the size of the pre-computation for each curve (see below).
The current algorithm stores
All in all roughly half a megabyte if we precompute everything.
That's a good idea! I'll add a commit to that effect, with maybe a new API call to force pre-computation (per curve) for applications who want predictable/consistent performance during other operations. Benchmark
The pre-computation was indeed part of the timer, but quite hidden by the number of iterations. Here's what it looks like with varying number of iterations, lazily pre-computing P256 only:
So even on a single sign, we're still (slightly) above the old algorithm even on the
That's great! Here's what I get with your patch (with
I'm much in favor of including it in the main branch
I have a similar discrepancy: 57347 sign/s with |
Thanks a lot for your answers.
Do you by chance have developed or used a tool to figure out whether this code is vulnerable? Of course, reading code (and/or assembly) leads to a certain amount of assurance. But then, you'd need to do that for each architecture, OCaml compiler and C compiler that you're interested in. (I remember Eqaf received a lot of assembly reading, and a new OCaml compiler did just something differently). To move forward:
|
I did not (and I'm not confident that if I did my efforts would be representative of a "good" attack), so I'm indeed relying on reading code for now. I've some assurance that it is constant-time as expected since the operations that differ depending on the input are extremely localized (the table selection), and always access every value of the tables in the same order. I've not yet looked at assembly output for the relevant operations.
Great! Let's take the opportunity to include ECDH and X25519 as well, yes
I think it would be the easiest path, yes. I'll open a new PR addressing the comments, and let's do a new round of review
Yes! By far the most expensive operation is the scalar multiplication (83% of the time spent), with inversion (already in C) spending much of the rest. Within the basic primitives, the addition is the costlier, so any algorithm that reduces the number of additions performs very well. I did try to rewrite some of the functions in #109 in C for comparison ( |
I'm working on that right now. |
FYI, a quick rewrite in C of the algorithm (+ the precomputation) yields a performance of about 8500 sign/s for P-256 with |
Nice! only a factor of 7 left ;) The bytes EC commit has been rebased by @palainp palainp@559a176 -- maybe a good starting point. |
I think that specific commit should be ok (at least the tests are green) :) |
I got to run some benchmarks on the Go crypto library. It yielded 48000 signs/s, but I found out than for common CPU architectures, it "cheats" and uses a custom ASM implementation for P256 (completely ignoring Forcing the use of the For reference:
|
@Firobe thanks for putting these numbers into context. The question is about the aim here. On the one side, I'm not keen to adapt huge amounts of assembly code into the repository for getting more performance. On the other side, we should be honest about it: the amount of time spent on this library is less than e.g. OpenSSL. Its performance won't be on par with it. Maybe we should have a paragraph about that in the README, together with how benchmarks can be (re)produced ( Gathering your numbers from above (on P256 sign operations per second), on your computer (architecture? OS?):
other implementations:
|
The numbers are right (for Go crypto the table, which is from their official benchmark, lists nanoseconds per operation, so 76744 ns per sign is In this PR I was careful to select an algorithm that's not too complex to implement, review and maintain. I wholeheartedly agree that the review and maintenance effort of bringing more complex algorithms (like wNAF) or assembly while maintaining security is too much, and we should be honest about the tradeoff. |
Yes, highly appreciated. I suspect you meant that's not too complex. Thanks for the clarification of go. And thanks for your work on this, I'm pretty sure we'll have this merged and released rather sooner than later. |
Using a sliding window method with pre-computed values of multiples of the generator point, obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm leaks no less information about its inputs than the current Montgomery ladder.
a11267f
to
9f767a9
Compare
For performance. This implies the need to get generator points from C as well. The pre-computed tables are stored in static memory, and computed lazily.
Now that #146 is merged, I rebased this PR, and pushed the rewrite in C. A few considerations:
New benchmarks on my machine:
|
Thanks for your rebase.
Hmm, so what about the approach we use for curve25519_tables.h (taken from BoringSSL where "make_curve25519_tables.py" is provided)? If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime? The tables won't change over time. We could write a small OCaml program using the Mirage_crypto_ec API which emits the table(s) as C code? This way we wouldn't need to carry around the generator into C, and would not need to mess around with lazy initialization. WDYT? In my opinion, these tables would then be part of the repository and release -- and maybe there are rules in the |
I'm in favor of that, the only drawback being that this increases executable size (vs. uninitialized memory). This makes the OCaml API the source of truth. And indeed, the tables won't change enough (or at all) to justify re-generating them as part of the build phase. |
Should we expose just enough primitives in the public API to allow table generation in a separate program, or rather just an opaque function computing the tables internally and returning them? 🤔 |
I think a single function is sufficient. Exposing the internals would likely mean to expose lots of types and internal stuff that may change one day -- for no good purpose. |
889e7f8
to
011036b
Compare
011036b
to
8aa43b5
Compare
I just pushed a new commit for hard-coded tables, which should be ready for review! I also addressed @palainp comment (and replaced tabs in _stubs.c files by spaces, if that's OK). |
c76a53c
to
fb7b87c
Compare
cadc4fc
to
669fe5a
Compare
It seems I had wrongly assumed it was easy to obtain the 32-bit tables from the 64-bit ones (i.e. the field element representation is the same in memory). It seems true for P256 and P384, but not P224/P521, e.g. for P224, G_x is represented in 32-bit as
while in 64-bit:
So, I have to either figure out how/if it's possible to correctly use the 32-bit interface from a 64-bit host, or include the table generation as part of the compilation process (so only the table for the hosting architecture is generated, but it implies potential separating part of the code as a separate library (or conditional compilation shenanigans) to avoid the circular dependency between Or, keep the code but generate the 32-bit table from a 32-bit machine, and accept that the table is not re-generable from a 64-bit host. Edit: or serialize to octets instead and de-serialize at startup, but that's hardly different from computing the whole table at startup, and I'm not even sure the serialization is compatible between 32- and 64-bit |
669fe5a
to
dab05dc
Compare
From my point of view, it would be best to be able to generate the tables on a 64bit system. On the other hand, since they'll be generated once and only when we add new curves we'll need additional tables, it is fine to generate them on a 32bit system as well. [Also, it's not clear to me how many moons will pass until we remove the 32bit support here. So far I haven't seen many users, but quite some burden on maintenance.] Thanks for your work, as said - if you can pre-compute and add the 32bit tables to make CI pass, that'd be neat. |
Alright, thank you for your thoughts. I think we're good now. Both tables are in the tree, and |
Thanks @Firobe I can confirm a big speed improvement (5x to 9x on current head vs this pr)!
|
Thanks for your work. I added two comments, apart from that this is ready to be merged. |
This partly reverts commit 28abf53.
Thank you for your review! Your comments should be addressed. If they are, feel free to squash/merge :) |
Thanks a lot! I'll merge when CI is happy (or only shows temporary failures). |
CHANGES: * mirage-crypto, mirage-crypto-rng{,lwt,mirage}: support CL.EXE compiler (mirage/mirage-crypto#137 @jonahbeckford) - mirage-crypto-pk not yet due to gmp dependency, mirage-crypto-ec doesn't pass testsuite * mirage-crypto-ec: use simpler square root for ed25519 - saving 3 multiplications and 2 squarings, details https://mailarchive.ietf.org/arch/msg/cfrg/qlKpMBqxXZYmDpXXIx6LO3Oznv4/ (mirage/mirage-crypto#196 @hannesm) * mirage-crypto-ec: use sliding window method with pre-computed calues of multiples of the generator point for NIST curves, speedup around 4x for P-256 sign (mirage/mirage-crypto#191 @Firobe, review @palainp @hannesm) * mirage-crypto-ec: documentation: warn about power timing analysis on `k` in Dsa.sign (mirage/mirage-crypto#195 @hannesm, as proposed by @edwintorok) * mirage-crypto-ec: replace internal Cstruct.t by string (speedup up to 2.5x) (mirage/mirage-crypto#146 @dinosaure @hannesm @reynir, review @Firobe @palainp @hannesm @reynir) * bench/speed: add EC (ECDSA & EdDSA generate/sign/verify, ECDH secret/share) operations (mirage/mirage-crypto#192 @hannesm) * mirage-crypto-rng: use rdtime instead of rdcycle on RISC-V (rdcycle is privileged since Linux kernel 6.6) (mirage/mirage-crypto#194 @AdrianBunk, review by @edwintorok) * mirage-crypto-rng: support Loongarch (mirage/mirage-crypto#190 @fangyaling, review @loongson-zn) * mirage-crypto-rng: support NetBSD (mirage/mirage-crypto#189 @drchrispinnock) * mirage-crypto-rng: allocate less in Fortuna when feeding (mirage/mirage-crypto#188 @hannesm, reported by @palainp) * mirage-crypto-ec: avoid mirage-crypto-pk and asn1-combinators test dependency (instead, craft our own asn.1 decoder -- mirage/mirage-crypto#200 @hannesm) ### Performance differences between v0.11.2 and v0.11.3 and OpenSSL The overall result is promising: P-256 sign operation improved 9.4 times, but is still a 4.9 times slower than OpenSSL. Numbers in operations per second (apart from speedup, which is a factor v0.11.3 / v0.11.2), gathered on a Intel i7-5600U CPU 2.60GHz using FreeBSD 14.0, OCaml 4.14.1, and OpenSSL 3.0.12. #### P224 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 1160 | 20609 | 17.8 | | | sign | 931 | 8169 | 8.8 | 21319 | | verify | 328 | 1606 | 4.9 | 10719 | | dh-sec | 1011 | 12595 | 12.5 | | | dh-kex | 992 | 2021 | 2.0 | 16691 | #### P256 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 990 | 19365 | 19.6 | | | sign | 792 | 7436 | 9.4 | 36182 | | verify | 303 | 1488 | 4.9 | 13383 | | dh-sec | 875 | 11508 | 13.2 | | | dh-kex | 895 | 1861 | 2.1 | 17742 | #### P384 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 474 | 6703 | 14.1 | | | sign | 349 | 3061 | 8.8 | 900 | | verify | 147 | 544 | 3.7 | 1062 | | dh-sec | 378 | 4405 | 11.7 | | | dh-kex | 433 | 673 | 1.6 | 973 | #### P521 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 185 | 1996 | 10.8 | | | sign | 137 | 438 | 3.2 | 2737 | | verify | 66 | 211 | 3.2 | 1354 | | dh-sec | 180 | 1535 | 8.5 | | | dh-kex | 201 | 268 | 1.3 | 2207 | #### 25519 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 23271 | 22345 | 1.0 | | | sign | 11228 | 10985 | 1.0 | 21794 | | verify | 8149 | 8029 | 1.0 | 7729 | | dh-sec | 14075 | 13968 | 1.0 | | | dh-kex | 13487 | 14079 | 1.0 | 24824 |
… P-curves (mirage#191) * [ec] Use windowed algorithm for base scalar mult Using a sliding window method with pre-computed values of multiples of the generator point, obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm leaks no less information about its inputs than the current Montgomery ladder. * [ec] Rewrite scalar_mult_base in C For performance. This implies the need to get generator points from C as well. The pre-computed tables are stored in static memory, and computed lazily. * Generate pre-tables AOT and hardcode them * Separate 64/32 tables * Add 32-bit tables
Using a sliding window method with pre-computed values of multiples of the generator point, we can obtain far more efficient performance for the special case where G = P in the scalar multiplication kP.
By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm should leak no less information about its inputs than the current Montgomery ladder.
This change is at the cost of an algorithm that is slightly less simple than the current one, but I'd say it is still relatively easy to audit, in particular for constant-time behavior. In particular, this is not a NAF-based algorithm. The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations. It also does need more memory to store the values.
As mentioned in the code, the actual implementation is largely inspired from Go's own crypto library
All tests seem to pass.
Benchmark
On a simple benchmark repeatedly signing messages, and then generating keys, on 10_000 iterations each, with OCaml 4.14.1.
Cstruct
Current
main
branch:On this branch:
Bytes
I tried @dinosaure's branch replacing internal
cstruct
bybytes
, which turns out to greatly improve performance as well on this benchmark (and visibly relieves the GC).On the PR's base branch:
A merge of this branch and the PR's:
Purely on P-256 signatures, the combination of Cstruct replacement and this new algorithm increases the performance by a factor of 6 :)
Further work for more performance