Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SPECIES_128 #41

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

Squiry
Copy link

@Squiry Squiry commented Feb 29, 2024

Should help with #9, the performance is still kind of low though (half of what jsoniter shows)

@piotrrzysko
Copy link
Member

Thanks for the contribution!

I'm a bit busy right now, working on a feature for the parser that I'll hopefully finish this month, so I can't promise when I'll be able to look at your PR, but I'll definitely do so. I believe that the most important thing is to make sure that this change doesn't affect the most common cases (256-bit and 512-bit registers).

@piotrrzysko
Copy link
Member

I've run the benchmarks on a machine with Neoverse-N1 CPU:

Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                ARM
  Model name:             Neoverse-N1
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             r3p1
    BogoMIPS:             243.75
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid 

and the results are indeed unsatisfactory:

Benchmark                                                                              Mode  Cnt    Score   Error  Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson                   thrpt    5  436.897 ± 1.512  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson                    thrpt    5  380.908 ± 0.816  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson                   thrpt    5  197.846 ± 0.894  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded             thrpt    5  199.902 ± 0.545  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  626.115 ± 1.175  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  463.471 ± 0.881  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  871.302 ± 4.688  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5  213.725 ± 0.452  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5  216.995 ± 0.329  ops/s

I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from (see results in README for Intel CPUs). Currently, I don't have space to investigate this. Would you like to do it, or would you like me to come back to it when I have time?

@Squiry
Copy link
Author

Squiry commented May 2, 2024

I'd like to understand where the disparity between 256/512-bit and 128-bit vectors comes from

The way I've implement that feature for 128bit is not the same as the arm64 implementation in original repo. They take a little bit different approach there, but I don't think we need that kind of details here anyway.

@piotrrzysko
Copy link
Member

I think your code looks good. By the disparity between 256/512-bit and 128-bit vectors I meant the difference in performance. As you can see in README for the (SchemaBased)ParseAndSelectBenchmark simdjson-java is typically 3-4 times faster than other libraries. However, based on the results I shared in my previous comment, it appears that for 128-bit vectors, the performance doesn't even match that of other libraries. I'm curious about the root cause of this difference. Could it simply be due to narrower registers? Or perhaps there's something else we're missing?

@Squiry
Copy link
Author

Squiry commented May 5, 2024

That's interesting. My MacBook with m1max gives me different result:

Benchmark                                                                              Mode  Cnt     Score    Error  Units
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  1874.904 ±  8.548  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  1044.073 ± 39.591  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  2153.209 ± 22.102  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5  1120.909 ± 16.372  ops/s
SchemaBasedParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5  1131.995 ± 42.193  ops/s

It's still bad, but not even close that bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants