-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize compute_chunks_operands #294
Comments
Looks like the |
Hey @sragss I'm open to taking on this issue. |
Please! Feel free to ask questions here. |
@githubsands any update on this? |
Is this issue still open? |
Hi @sragss, I'm having difficulty in reproducing the 1.8% (I'm only getting .09%). Right now I run
and then getting stats in perfetto using SQL
So this is only getting .09%. I'm running on an 8-core M1 with 242 cycle count. So I had a couple of questions:
|
242 cycle count is too small to get an idea of relative asymptotic performance. Can you try a bigger example – maybe in the 128k-512k range? Also thought sha2-chain was around 3M cycles, did you modify it to fit in a smaller amount of RAM? |
Sorry, silly mistake. When I looked up how to find the cycle count, the thing it actually pointed me to was the battery cycle count 🤦 I didn't realize you were referring to trace length - 3,632,556 in my case. I ran it on the original sha2-chain. But re: the questions above, what are the steps for reproducing your benchmarks? Or possibly do you have any ideas why I might be getting a much lower %? |
Hi , can i take on this issue ? |
For a 64 core machine at a cycle count of ~16M, Jolt spends ~1.8% of its time in a segment called compute_chunks_opreands here.
This segment allocates and computes
C
different chunks for each instruction. For example for the EQ instruction we split the input operands X, Y into 4 8-bit chunks (WORD_SIZE / C). We then can compute EQ over each chunk individually.Idea for acceleration: Split
chunks_x
andchunks_y
into mutable slices. Iterate over each and compute the values in parallel writing the the slice indexes directly.It may be helpful to review the tracing strategy for performance testing.
The text was updated successfully, but these errors were encountered: