[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191

Firobe · 2024-02-02T01:50:56Z

Using a sliding window method with pre-computed values of multiples of the generator point, we can obtain far more efficient performance for the special case where G = P in the scalar multiplication kP.
By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm should leak no less information about its inputs than the current Montgomery ladder.

This change is at the cost of an algorithm that is slightly less simple than the current one, but I'd say it is still relatively easy to audit, in particular for constant-time behavior. In particular, this is not a NAF-based algorithm. The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations. It also does need more memory to store the values.

As mentioned in the code, the actual implementation is largely inspired from Go's own crypto library

All tests seem to pass.

Benchmark

On a simple benchmark repeatedly signing messages, and then generating keys, on 10_000 iterations each, with OCaml 4.14.1.

Cstruct

Current main branch:

P256: 1085.8 sign/s ; 1166.4 gen/s
P384: 419.4 sign/s ; 442.7 gen/s
P521: 156.4 sign/s ; 164.8 gen/s
P224: 1151.6 sign/s ; 1226.3 gen/s

On this branch:

P256: 1971.7 sign/s ; 2250.6 gen/s
P384: 1055.9 sign/s ; 1230.1 gen/s
P521: 296.7 sign/s ; 606.1 gen/s
P224: 2149.2 sign/s ; 2404.5 gen/s

Bytes

I tried @dinosaure's branch replacing internal cstruct by bytes, which turns out to greatly improve performance as well on this benchmark (and visibly relieves the GC).

On the PR's base branch:

P256: 1795.8 sign/s ; 2039.9 gen/s
P384: 548.6 sign/s ; 595.8 gen/s
P521: 152.4 sign/s ; 203.4 gen/s
P224: 1901.7 sign/s ; 2142.9 gen/s

A merge of this branch and the PR's:

P256: 5957.1 sign/s ; 9504.9 gen/s
P384: 2408.8 sign/s ; 3555.8 gen/s
P521: 930.6 sign/s ; 1341.4 gen/s
P224: 6479.4 sign/s ; 10208.7 gen/s

Purely on P-256 signatures, the combination of Cstruct replacement and this new algorithm increases the performance by a factor of 6 :)

Further work for more performance

Implement a similar algorithm for the non-base case, as the Go library does it
Potentially rewrite to C, see EC improve performance #109

hannesm · 2024-02-02T09:25:53Z

Thanks for your PR. I'm positive to have this integrated into mirage-crypto. There are a couple of questions, though:

Audit, code review

constant-time behavior

What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?

relatively easy to audit

At least for me this will need some time, esp. with arrays and reference cells being introduced.

Costs (pre-computation, memory)

The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations.

Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?

It also does need more memory to store the values.

Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.

Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?

Benchmark

On a simple benchmark repeatedly signing messages

Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?

I found a patch that I developed quite some time ago which may be useful hannesm@c512a0a - which adds ECDSA (sign & verify) to the bench/speed.ml.

Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (openssl speed ecdsap256). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).

Firobe · 2024-02-06T15:37:30Z

Thanks for the comments!

Audit, code review

What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?

The threat model here is local side-channel attacks, such as FLUSH+RELOAD used here or here.

At least for me this will need some time, esp. with arrays and reference cells being introduced.

I wonder if it would be more efficient to first rewrite it in C, see how that performs, and rather review that (since it's slightly easier to review the timing security without the OCaml abstractions). In any way I plan to do that in the coming days, so maybe you want to hold off review until I come back with results.

Costs (pre-computation, memory)

Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?

I consistently get the following pre-computation times on my CPU (on the bytes branch):

	P-256	P-224	P-384	P-521
Time (s)	0.001	0.001	0.003	0.011

Which is consistent with the size of the pre-computation for each curve (see below).

Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.

The current algorithm stores 15 * fe_length points per curve (so for example 480 points for P256). Each point is 3 field elements of size fe_length on the heap, so excluding array/struct overhead, that's:

	P-256	P-224	P-384	P-521
Memory (KB)	46	46	103	233

All in all roughly half a megabyte if we precompute everything.

Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?

That's a good idea! I'll add a commit to that effect, with maybe a new API call to force pre-computation (per curve) for applications who want predictable/consistent performance during other operations.

Benchmark

Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?

The pre-computation was indeed part of the timer, but quite hidden by the number of iterations. Here's what it looks like with varying number of iterations, lazily pre-computing P256 only:

	1 sign	10 signs	100 signs	1000 signs	10000 signs
sign/s	1850	4630	5540	5720	5930

So even on a single sign, we're still (slightly) above the old algorithm even on the cstruct -> bytes branch, if we pre-compute only what we use.

I found a patch that I developed quite some time ago which may be useful hannesm@c512a0a - which adds ECDSA (sign & verify) to the bench/speed.ml.

That's great! Here's what I get with your patch (with bytes):

* [ecdsa-sign]
  P224:  6269.457 ops per second (63211 iters in 10.082)
  P256:  5693.522 ops per second (57012 iters in 10.013)
  P384:  2372.992 ops per second (18155 iters in 7.651)
  P521:  502.243 ops per second (5199 iters in 10.352)

I'm much in favor of including it in the main branch

Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (openssl speed ecdsap256). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).

I have a similar discrepancy: 57347 sign/s with openssl, though OpenSSL does use a different, more complex algorithm. Another baseline I have is BoringSSL which performs about 28000 sign/s (again, with a different algorithm). I'll try in the coming days to do a benchmark with Go. In any way, I think we'll still definitely be below that.

hannesm · 2024-02-06T16:06:56Z

Thanks a lot for your answers.

FLUSH+RELOAD

Do you by chance have developed or used a tool to figure out whether this code is vulnerable? Of course, reading code (and/or assembly) leads to a certain amount of assurance. But then, you'd need to do that for each architecture, OCaml compiler and C compiler that you're interested in. (I remember Eqaf received a lot of assembly reading, and a new OCaml compiler did just something differently).

To move forward:

I opened add EC to bench/speed #192 for the speed ECDSA additions (should ECDH and X25519/Curve25519 be as well put in that tool?)
Since you seem to have nicer performance with Replace the internal usage of Cstruct.t by the bytes type #146 (cstruct -> bytes), should we get that one merged before this one (looks like it needs some rebasing and going through the comments, maybe another round of review)? Please feel free to work on that (either in that PR or in a fresh one, keeping the existing commits as is). This should stay a separate PR from your "use windowed algorithm".
Did you look into perf output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at EC improve performance #109 with that perf information will result in what to re-do in C).

Firobe · 2024-02-06T16:22:31Z

Do you by chance have developed or used a tool to figure out whether this code is vulnerable? [...]

I did not (and I'm not confident that if I did my efforts would be representative of a "good" attack), so I'm indeed relying on reading code for now. I've some assurance that it is constant-time as expected since the operations that differ depending on the input are extremely localized (the table selection), and always access every value of the tables in the same order. I've not yet looked at assembly output for the relevant operations.

I opened #192 for the speed ECDSA additions (should ECDH and X25519/Curve25519 be as well put in that tool?)

Great! Let's take the opportunity to include ECDH and X25519 as well, yes

[...] should we get that one merged before this one (looks like it needs some rebasing and going through the comments, maybe another round of review)?

I think it would be the easiest path, yes. I'll open a new PR addressing the comments, and let's do a new round of review

Did you look into perf output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at #109 with that perf information will result in what to re-do in C).

Yes! By far the most expensive operation is the scalar multiplication (83% of the time spent), with inversion (already in C) spending much of the rest. Within the basic primitives, the addition is the costlier, so any algorithm that reduces the number of additions performs very well.

I did try to rewrite some of the functions in #109 in C for comparison (Dsa.sign in particular), but the improvement was very marginal.

hannesm · 2024-02-06T16:26:36Z

ECDH and X25519

I'm working on that right now.

Firobe · 2024-02-06T18:21:12Z

FYI, a quick rewrite in C of the algorithm (+ the precomputation) yields a performance of about 8500 sign/s for P-256 with bytes. I think this justifies the rewrite :)

hannesm · 2024-02-06T18:34:49Z

Nice! only a factor of 7 left ;)

The bytes EC commit has been rebased by @palainp palainp@559a176 -- maybe a good starting point.

palainp · 2024-02-06T19:04:08Z

I think that specific commit should be ok (at least the tests are green) :)
But I failed somewhere after when I tried to remove more Big-arrays :(

Firobe · 2024-02-06T19:41:39Z

Nice! only a factor of 7 left ;)

I got to run some benchmarks on the Go crypto library. It yielded 48000 signs/s, but I found out than for common CPU architectures, it "cheats" and uses a custom ASM implementation for P256 (completely ignoring fiat primitives).

Forcing the use of the fiat-based windowed algorithm, we get about 13000 sign/s as a baseline, which makes our results here look much more reasonable :)

For reference:

BenchmarkSign/P256   	  15625	    76744 ns/op	   2400 B/op	     34 allocs/op
BenchmarkSign/P384   	   5571	   212753 ns/op	   2608 B/op	     34 allocs/op
BenchmarkSign/P521   	   1995	   600929 ns/op	   2976 B/op	     35 allocs/op
BenchmarkVerify/P256 	   5703	   209437 ns/op	    416 B/op	     12 allocs/op
BenchmarkVerify/P384 	   1800	   664876 ns/op	    592 B/op	     12 allocs/op
BenchmarkVerify/P521 	    582	  2057598 ns/op	    912 B/op	     12 allocs/op

hannesm · 2024-02-07T13:01:13Z

@Firobe thanks for putting these numbers into context. The question is about the aim here. On the one side, I'm not keen to adapt huge amounts of assembly code into the repository for getting more performance. On the other side, we should be honest about it: the amount of time spent on this library is less than e.g. OpenSSL. Its performance won't be on par with it.

Maybe we should have a paragraph about that in the README, together with how benchmarks can be (re)produced (openssl speed ecdsap256)

Gathering your numbers from above (on P256 sign operations per second), on your computer (architecture? OS?):

1085.8 main branch
1971.7 this PR
1795.8 Replace the internal usage of Cstruct.t by the bytes type #146
5957.1 (or 6269.457 with bench/speed) Replace the internal usage of Cstruct.t by the bytes type #146 + this PR
8500 this PR + Replace the internal usage of Cstruct.t by the bytes type #146 + scalar-mult in C

other implementations:

13000 Go crypo no assembly (I don't understand the table and your comment diverging in the number)
28000 BoringSSL
48000 Go crypto with assembly
57347 OpenSSL

Firobe · 2024-02-07T13:22:15Z

The numbers are right (for Go crypto the table, which is from their official benchmark, lists nanoseconds per operation, so 76744 ns per sign is 1 / (76744 / 10^9) ~= 13030 sign/s). The first column is the number if iterations performed. All benchmarks were run on an AMD Ryzen 7 3700X with Arch Linux (and always on a single core).

In this PR I was careful to select an algorithm that's not too complex to implement, review and maintain. I wholeheartedly agree that the review and maintenance effort of bringing more complex algorithms (like wNAF) or assembly while maintaining security is too much, and we should be honest about the tradeoff.

hannesm · 2024-02-07T13:26:06Z

In this PR I was careful to select an algorithm that's too complex to implement, review and maintain.

Yes, highly appreciated. I suspect you meant that's not too complex. Thanks for the clarification of go. And thanks for your work on this, I'm pretty sure we'll have this merged and released rather sooner than later.

Using a sliding window method with pre-computed values of multiples of the generator point, obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm leaks no less information about its inputs than the current Montgomery ladder.

For performance. This implies the need to get generator points from C as well. The pre-computed tables are stored in static memory, and computed lazily.

Firobe · 2024-02-13T11:49:49Z

Now that #146 is merged, I rebased this PR, and pushed the rewrite in C.

A few considerations:

it introduces lazy pre-computation of the tables, per curve, as well as a force_precomputation that does what its name says
the pre-computed table are stored as (static) globals. This means the RAM (half a megabyte, see above) is pre-allocated, even if the values are not pre-computed. If that's not something we want, I can switch to heap allocation
it doesn't use Make_scalar.to_octets to avoid the string reversing (the algorithm is adapted to expect a big-endian scalar)
pre-computing the tables in C means knowing about the generator point of the curve in C, which wasn't readily available before. I'm not entirely comfortable with the current solution (specifying and recomputing G for each curve in C, as we do in OCaml), which is pretty redundant with the OCaml code. Furthermore, it currently doesn't do sanity checks like checking G is on the curve (I'll add them if we commit to that solution).
Other solutions could be passing G from OCaml to C (but probably forcing a call to force_precomputation with G, to avoid passing it to subsequent calls where it's not useful), or pre-computing the tables entirely in OCaml, at a small single-time performance cost. Yet another solution is to generate the tables in advance and hardcode them in C. What do you think?

New benchmarks on my machine:

`main`

* [ecdsa-sign]
    P224:  1922.156 ops per second (17488 iters in 9.098)
    P256:  1837.106 ops per second (16644 iters in 9.060)
    P384:  557.228 ops per second (5584 iters in 10.021)
    P521:  167.405 ops per second (1668 iters in 9.964)
    Ed25519:  21103.479 ops per second (200803 iters in 9.515)

This branch

* [ecdsa-sign]
    P224:  9088.188 ops per second (92250 iters in 10.151)
    P256:  8289.074 ops per second (83333 iters in 10.053)
    P384:  2905.143 ops per second (29325 iters in 10.094)
    P521:  567.907 ops per second (5524 iters in 9.727)
    Ed25519:  21143.520 ops per second (205761 iters in 9.732)

hannesm · 2024-02-13T12:59:04Z

Thanks for your rebase.

the pre-computed table are stored as (static) globals. This means the RAM (half a megabyte, see above) is pre-allocated, even if the values are not pre-computed. If that's not something we want, I can switch to heap allocation
pre-computing the tables in C means knowing about the generator point of the curve in C, which wasn't readily available before.

Hmm, so what about the approach we use for curve25519_tables.h (taken from BoringSSL where "make_curve25519_tables.py" is provided)?

If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime? The tables won't change over time. We could write a small OCaml program using the Mirage_crypto_ec API which emits the table(s) as C code? This way we wouldn't need to carry around the generator into C, and would not need to mess around with lazy initialization. WDYT?

In my opinion, these tables would then be part of the repository and release -- and maybe there are rules in the ec/native Makefile to (re-)generate the tables!?

Firobe · 2024-02-13T13:28:22Z

If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime?

I'm in favor of that, the only drawback being that this increases executable size (vs. uninitialized memory). This makes the OCaml API the source of truth. And indeed, the tables won't change enough (or at all) to justify re-generating them as part of the build phase.

Firobe · 2024-02-13T13:59:57Z

Should we expose just enough primitives in the public API to allow table generation in a separate program, or rather just an opaque function computing the tables internally and returning them? 🤔

hannesm · 2024-02-13T14:03:36Z

I think a single function is sufficient. Exposing the internals would likely mean to expose lots of types and internal stuff that may change one day -- for no good purpose.

ec/native/p224_stubs.c

ec/native/point_operations.h

Firobe · 2024-02-14T20:48:56Z

I just pushed a new commit for hard-coded tables, which should be ready for review! I also addressed @palainp comment (and replaced tabs in _stubs.c files by spaces, if that's OK).

ec/gen_tables/gen_tables.ml

Firobe · 2024-02-15T16:09:58Z

It seems I had wrongly assumed it was easy to obtain the 32-bit tables from the 64-bit ones (i.e. the field element representation is the same in memory). It seems true for P256 and P384, but not P224/P521, e.g. for P224, G_x is represented in 32-bit as

bc905227, 6018bfaa, f22fe220, f96bec04, 6dd3af9b, a21b5e60, 92f5b516

while in 64-bit:

bc9052266d0a4aea, 852597366018bfaa, 6dd3af9bf96bec05, 00000000a21b5e60

So, I have to either figure out how/if it's possible to correctly use the 32-bit interface from a 64-bit host, or include the table generation as part of the compilation process (so only the table for the hosting architecture is generated, but it implies potential separating part of the code as a separate library (or conditional compilation shenanigans) to avoid the circular dependency between mirage-crypto-ec and the tables.

Or, keep the code but generate the 32-bit table from a 32-bit machine, and accept that the table is not re-generable from a 64-bit host.

Edit: or serialize to octets instead and de-serialize at startup, but that's hardly different from computing the whole table at startup, and I'm not even sure the serialization is compatible between 32- and 64-bit

hannesm · 2024-02-17T10:52:04Z

From my point of view, it would be best to be able to generate the tables on a 64bit system. On the other hand, since they'll be generated once and only when we add new curves we'll need additional tables, it is fine to generate them on a 32bit system as well. [Also, it's not clear to me how many moons will pass until we remove the 32bit support here. So far I haven't seen many users, but quite some burden on maintenance.]

Thanks for your work, as said - if you can pre-compute and add the 32bit tables to make CI pass, that'd be neat.

Firobe · 2024-02-17T17:11:12Z

Alright, thank you for your thoughts. I think we're good now. Both tables are in the tree, and gen_tables can generate both, if executed from the correct architecture. It's still reasonably easy to generate the 32-bit tables on a 64-bit host, by using a 32-bit OCaml switch (I used 4.11.2+32bit for example).

palainp · 2024-02-19T07:17:06Z

Thanks @Firobe I can confirm a big speed improvement (5x to 9x on current head vs this pr)!

* [ecdsa-generate]	* [ecdsa-generate]
    P224:  1458.572 ops per second (12667 iters in 8.685)	    P224:  11276.635 ops per second (99009 iters in 8.780)
    P256:  1465.086 ops per second (11918 iters in 8.135)	    P256:  10154.861 ops per second (95969 iters in 9.451)
    P384:  355.013 ops per second (3546 iters in 9.988)	    P384:  2926.304 ops per second (28360 iters in 9.691)
    P521:  122.818 ops per second (1229 iters in 10.007)	    P521:  922.291 ops per second (9030 iters in 9.791)
    Ed25519:  16974.709 ops per second (172413 iters in 10.157)	    Ed25519:  17102.883 ops per second (161290 iters in 9.431)
	
* [ecdsa-sign]	* [ecdsa-sign]
    P224:  1101.906 ops per second (10981 iters in 9.965)	    P224:  4708.182 ops per second (46816 iters in 9.944)
    P256:  1026.197 ops per second (10250 iters in 9.988)	    P256:  4340.524 ops per second (43591 iters in 10.043)
    P384:  320.084 ops per second (3201 iters in 10.000)	    P384:  1558.038 ops per second (15634 iters in 10.034)
    P521:  97.783 ops per second (974 iters in 9.961)	    P521:  317.579 ops per second (3198 iters in 10.070)
    Ed25519:  8371.510 ops per second (83333 iters in 9.954)	    Ed25519:  8432.891 ops per second (84317 iters in 9.999)
	
* [ecdsa-verify]	* [ecdsa-verify]
    P224:  406.745 ops per second (4070 iters in 10.006)	    P224:  929.569 ops per second (9278 iters in 9.981)
    P256:  379.167 ops per second (3799 iters in 10.019)	    P256:  863.855 ops per second (8668 iters in 10.034)
    P384:  114.519 ops per second (1143 iters in 9.981)	    P384:  264.519 ops per second (2641 iters in 9.984)
    P521:  40.434 ops per second (404 iters in 9.992)	    P521:  94.140 ops per second (940 iters in 9.985)
    Ed25519:  5550.770 ops per second (54288 iters in 9.780)	    Ed25519:  5553.066 ops per second (55493 iters in 9.993)
	
* [ecdh-secret]	* [ecdh-secret]
    P224:  1193.372 ops per second (11938 iters in 10.004)	    P224:  6963.960 ops per second (65876 iters in 9.460)
    P256:  1110.876 ops per second (11101 iters in 9.993)	    P256:  6453.460 ops per second (61050 iters in 9.460)
    P384:  338.256 ops per second (3380 iters in 9.992)	    P384:  2084.828 ops per second (20703 iters in 9.930)
    P521:  118.926 ops per second (1194 iters in 10.040)	    P521:  738.680 ops per second (7534 iters in 10.199)
    X25519:  9282.841 ops per second (92764 iters in 9.993)	    X25519:  9385.173 ops per second (92936 iters in 9.902)
	
* [ecdh-share]	* [ecdh-share]
    P224:  1264.362 ops per second (11933 iters in 9.438)	    P224:  1192.979 ops per second (11936 iters in 10.005)
    P256:  1109.354 ops per second (11059 iters in 9.969)	    P256:  1114.436 ops per second (11150 iters in 10.005)
    P384:  336.237 ops per second (3385 iters in 10.067)	    P384:  338.432 ops per second (3378 iters in 9.981)
    P521:  119.199 ops per second (1165 iters in 9.774)	    P521:  120.097 ops per second (1200 iters in 9.992)
    X25519:  9362.748 ops per second (93457 iters in 9.982)	    X25519:  9444.634 ops per second (90909 iters in 9.625)

ec/native/p224_stubs.c

ec/native/GNUmakefile

hannesm · 2024-02-19T11:36:36Z

Thanks for your work. I added two comments, apart from that this is ready to be merged.

This partly reverts commit 28abf53.

Firobe · 2024-02-19T13:27:37Z

Thank you for your review! Your comments should be addressed. If they are, feel free to squash/merge :)

hannesm · 2024-02-19T13:33:41Z

Thanks a lot! I'll merge when CI is happy (or only shows temporary failures).

@jonahbeckford

CHANGES: * mirage-crypto, mirage-crypto-rng{,lwt,mirage}: support CL.EXE compiler (mirage/mirage-crypto#137 @jonahbeckford) - mirage-crypto-pk not yet due to gmp dependency, mirage-crypto-ec doesn't pass testsuite * mirage-crypto-ec: use simpler square root for ed25519 - saving 3 multiplications and 2 squarings, details https://mailarchive.ietf.org/arch/msg/cfrg/qlKpMBqxXZYmDpXXIx6LO3Oznv4/ (mirage/mirage-crypto#196 @hannesm) * mirage-crypto-ec: use sliding window method with pre-computed calues of multiples of the generator point for NIST curves, speedup around 4x for P-256 sign (mirage/mirage-crypto#191 @Firobe, review @palainp @hannesm) * mirage-crypto-ec: documentation: warn about power timing analysis on `k` in Dsa.sign (mirage/mirage-crypto#195 @hannesm, as proposed by @edwintorok) * mirage-crypto-ec: replace internal Cstruct.t by string (speedup up to 2.5x) (mirage/mirage-crypto#146 @dinosaure @hannesm @reynir, review @Firobe @palainp @hannesm @reynir) * bench/speed: add EC (ECDSA & EdDSA generate/sign/verify, ECDH secret/share) operations (mirage/mirage-crypto#192 @hannesm) * mirage-crypto-rng: use rdtime instead of rdcycle on RISC-V (rdcycle is privileged since Linux kernel 6.6) (mirage/mirage-crypto#194 @AdrianBunk, review by @edwintorok) * mirage-crypto-rng: support Loongarch (mirage/mirage-crypto#190 @fangyaling, review @loongson-zn) * mirage-crypto-rng: support NetBSD (mirage/mirage-crypto#189 @drchrispinnock) * mirage-crypto-rng: allocate less in Fortuna when feeding (mirage/mirage-crypto#188 @hannesm, reported by @palainp) * mirage-crypto-ec: avoid mirage-crypto-pk and asn1-combinators test dependency (instead, craft our own asn.1 decoder -- mirage/mirage-crypto#200 @hannesm) ### Performance differences between v0.11.2 and v0.11.3 and OpenSSL The overall result is promising: P-256 sign operation improved 9.4 times, but is still a 4.9 times slower than OpenSSL. Numbers in operations per second (apart from speedup, which is a factor v0.11.3 / v0.11.2), gathered on a Intel i7-5600U CPU 2.60GHz using FreeBSD 14.0, OCaml 4.14.1, and OpenSSL 3.0.12. #### P224 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 1160 | 20609 | 17.8 | | | sign | 931 | 8169 | 8.8 | 21319 | | verify | 328 | 1606 | 4.9 | 10719 | | dh-sec | 1011 | 12595 | 12.5 | | | dh-kex | 992 | 2021 | 2.0 | 16691 | #### P256 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 990 | 19365 | 19.6 | | | sign | 792 | 7436 | 9.4 | 36182 | | verify | 303 | 1488 | 4.9 | 13383 | | dh-sec | 875 | 11508 | 13.2 | | | dh-kex | 895 | 1861 | 2.1 | 17742 | #### P384 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 474 | 6703 | 14.1 | | | sign | 349 | 3061 | 8.8 | 900 | | verify | 147 | 544 | 3.7 | 1062 | | dh-sec | 378 | 4405 | 11.7 | | | dh-kex | 433 | 673 | 1.6 | 973 | #### P521 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 185 | 1996 | 10.8 | | | sign | 137 | 438 | 3.2 | 2737 | | verify | 66 | 211 | 3.2 | 1354 | | dh-sec | 180 | 1535 | 8.5 | | | dh-kex | 201 | 268 | 1.3 | 2207 | #### 25519 | op | v0.11.2 | v0.11.3 | speedup | OpenSSL | |--------|---------|---------|---------|---------| | gen | 23271 | 22345 | 1.0 | | | sign | 11228 | 10985 | 1.0 | 21794 | | verify | 8149 | 8029 | 1.0 | 7729 | | dh-sec | 14075 | 13968 | 1.0 | | | dh-kex | 13487 | 14079 | 1.0 | 24824 |

… P-curves (mirage#191) * [ec] Use windowed algorithm for base scalar mult Using a sliding window method with pre-computed values of multiples of the generator point, obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm leaks no less information about its inputs than the current Montgomery ladder. * [ec] Rewrite scalar_mult_base in C For performance. This implies the need to get generator points from C as well. The pre-computed tables are stored in static memory, and computed lazily. * Generate pre-tables AOT and hardcode them * Separate 64/32 tables * Add 32-bit tables

palainp mentioned this pull request Feb 9, 2024

Evaluate performance of Cstruct (BigArray) vs Bytes/String #110

Closed

Firobe force-pushed the windowed-scalar-mult branch from a11267f to 9f767a9 Compare February 13, 2024 10:57

[ec] Rewrite scalar_mult_base in C

46d2b0a

For performance. This implies the need to get generator points from C as well. The pre-computed tables are stored in static memory, and computed lazily.

palainp reviewed Feb 14, 2024

View reviewed changes

ec/native/p224_stubs.c Outdated Show resolved Hide resolved

ec/native/point_operations.h Outdated Show resolved Hide resolved

address palainp review

28abf53

Firobe force-pushed the windowed-scalar-mult branch 2 times, most recently from 889e7f8 to 011036b Compare February 14, 2024 20:46

Generate pre-tables AOT and hardcode them

8aa43b5

Firobe force-pushed the windowed-scalar-mult branch from 011036b to 8aa43b5 Compare February 14, 2024 20:47

Support OCaml 4.08

810bace

palainp reviewed Feb 15, 2024

View reviewed changes

ec/gen_tables/gen_tables.ml Outdated Show resolved Hide resolved

Correct number of limbs in gen tables

fb7b87c

Firobe force-pushed the windowed-scalar-mult branch from c76a53c to fb7b87c Compare February 15, 2024 10:00

Try endianness shenanigans

dab05dc

Firobe force-pushed the windowed-scalar-mult branch 2 times, most recently from cadc4fc to 669fe5a Compare February 15, 2024 15:46

Firobe force-pushed the windowed-scalar-mult branch from 669fe5a to dab05dc Compare February 15, 2024 16:10

Firobe added 2 commits February 17, 2024 17:44

Separate 64/32 tables

91e32ca

Add 32-bit tables

b5954ef

hannesm reviewed Feb 19, 2024

View reviewed changes

ec/native/p224_stubs.c Outdated Show resolved Hide resolved

hannesm reviewed Feb 19, 2024

View reviewed changes

ec/native/GNUmakefile Show resolved Hide resolved

Firobe added 2 commits February 19, 2024 14:24

address hannes's review

81ebbc3

This partly reverts commit 28abf53.

Merge branch 'main' into windowed-scalar-mult

489c283

hannesm merged commit c9ef512 into mirage:main Feb 19, 2024
27 checks passed

hannesm mentioned this pull request Feb 26, 2024

[new release] mirage-crypto (8 packages) (0.11.3) ocaml/opam-repository#25349

Merged

hannesm mentioned this pull request Mar 19, 2024

EC improve performance #109

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191

[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191

Firobe commented Feb 2, 2024 •

edited

Loading

hannesm commented Feb 2, 2024

Firobe commented Feb 6, 2024 •

edited

Loading

hannesm commented Feb 6, 2024

Firobe commented Feb 6, 2024 •

edited

Loading

hannesm commented Feb 6, 2024

Firobe commented Feb 6, 2024

hannesm commented Feb 6, 2024

palainp commented Feb 6, 2024

Firobe commented Feb 6, 2024 •

edited

Loading

hannesm commented Feb 7, 2024 •

edited

Loading

Firobe commented Feb 7, 2024 •

edited

Loading

hannesm commented Feb 7, 2024

Firobe commented Feb 13, 2024

hannesm commented Feb 13, 2024

Firobe commented Feb 13, 2024

Firobe commented Feb 13, 2024

hannesm commented Feb 13, 2024

Firobe commented Feb 14, 2024

Firobe commented Feb 15, 2024 •

edited

Loading

hannesm commented Feb 17, 2024

Firobe commented Feb 17, 2024 •

edited

Loading

palainp commented Feb 19, 2024

hannesm commented Feb 19, 2024

Firobe commented Feb 19, 2024

hannesm commented Feb 19, 2024

[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191

[ec] Use windowed algorithm for base scalar mult on NIST P-curves #191

Conversation

Firobe commented Feb 2, 2024 • edited Loading

Benchmark

Cstruct

Bytes

Further work for more performance

hannesm commented Feb 2, 2024

Audit, code review

Costs (pre-computation, memory)

Benchmark

Firobe commented Feb 6, 2024 • edited Loading

Audit, code review

Costs (pre-computation, memory)

Benchmark

hannesm commented Feb 6, 2024

Firobe commented Feb 6, 2024 • edited Loading

hannesm commented Feb 6, 2024

Firobe commented Feb 6, 2024

hannesm commented Feb 6, 2024

palainp commented Feb 6, 2024

Firobe commented Feb 6, 2024 • edited Loading

hannesm commented Feb 7, 2024 • edited Loading

Firobe commented Feb 7, 2024 • edited Loading

hannesm commented Feb 7, 2024

Firobe commented Feb 13, 2024

main

This branch

hannesm commented Feb 13, 2024

Firobe commented Feb 13, 2024

Firobe commented Feb 13, 2024

hannesm commented Feb 13, 2024

Firobe commented Feb 14, 2024

Firobe commented Feb 15, 2024 • edited Loading

hannesm commented Feb 17, 2024

Firobe commented Feb 17, 2024 • edited Loading

palainp commented Feb 19, 2024

hannesm commented Feb 19, 2024

Firobe commented Feb 19, 2024

hannesm commented Feb 19, 2024

Firobe commented Feb 2, 2024 •

edited

Loading

Firobe commented Feb 6, 2024 •

edited

Loading

Firobe commented Feb 6, 2024 •

edited

Loading

Firobe commented Feb 6, 2024 •

edited

Loading

hannesm commented Feb 7, 2024 •

edited

Loading

Firobe commented Feb 7, 2024 •

edited

Loading

`main`

Firobe commented Feb 15, 2024 •

edited

Loading

Firobe commented Feb 17, 2024 •

edited

Loading