Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SimdJsonParser2 base on bitindex #60

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

heykirby
Copy link

@heykirby heykirby commented Oct 4, 2024

issue: #59

@heykirby heykirby force-pushed the feature_simdjson2 branch 5 times, most recently from 5c92d47 to 3139b2c Compare October 7, 2024 09:53
@heykirby
Copy link
Author

heykirby commented Oct 19, 2024

@arouel thanks very much, I have fix the code based on your suggestion.
In the case of determining the parsing path, simdjsonParserWithFixPath provides better performance and supports compressing map and list type data into strings. It can quickly skip paths that do not require parsing and avoid creating instances of JSON nodes for each JSON node

Benchmark testing indicators. refer:
environment is Species[byte, 32, S_256_BIT]

Result "org.simdjson.AParseAndSelectFixPathBenchMark.parseMultiValuesForFixPaths_Jackson":
693.528 ±(99.9%) 18.073 ops/s [Average]
(min, avg, max) = (687.806, 693.528, 699.113), stdev = 4.694
CI (99.9%): [675.455, 711.601] (assumes normal distribution)

Result "org.simdjson.ParseAndSelectFixPathBenchMark.parseMultiValuesForFixPaths_SimdJson":
2258.495 ±(99.9%) 41.596 ops/s [Average]
(min, avg, max) = (2242.400, 2258.495, 2269.942), stdev = 10.802
CI (99.9%): [2216.899, 2300.091] (assumes normal distribution)

Result "org.simdjson.ParseAndSelectFixPathBenchMark.parseMultiValuesForFixPaths_SimdJsonParserWithFixPath":
4075.984 ±(99.9%) 104.804 ops/s [Average]
(min, avg, max) = (4029.568, 4075.984, 4100.273), stdev = 27.217
CI (99.9%): [3971.180, 4180.789] (assumes normal distribution)

@piotrrzysko
Copy link
Member

How is this different from On-Demand parsing available in the c++ simdjson version?

I introduced a form of on-demand parsing in #51 (see: org.simdjson.OnDemandJsonIterator). The API requires specifying a target class to which the JSON will be parsed. However, it should be relatively easy to extend this to support a DOM-like API (JsonValue, JsonIterator, etc.), which I believe is more intuitive than introducing syntax for accessing fields and then returning an array of strings with the corresponding values.

@arouel
Copy link

arouel commented Oct 21, 2024

@piotrrzysko I agree with you, a DOM-like API (JsonValue, JsonIterator, etc.) would be very helpful in use cases where only specific parts of the JSON are conditionally relevant, so that a mapping to an object would cause allocation that you want to avoid.

Can you guide us a bit, so that we can prepare a PR?

Copy link

@arouel arouel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heykirby I just want share some thoughts/questions:

With some minor API changes in simdjson-java, could we keep the SimdJsonParserWithFixPath in another codebase or it could life in a contribution module, because it is tailored for a very specific use case?

Isn't a record JsonNode sufficient compared to using lombok?

src/main/java/org/simdjson/SimdJsonParser.java Outdated Show resolved Hide resolved
@heykirby
Copy link
Author

heykirby commented Oct 22, 2024

How is this different from On-Demand parsing available in the c++ simdjson version?

I introduced a form of on-demand parsing in #51 (see: org.simdjson.OnDemandJsonIterator). The API requires specifying a target class to which the JSON will be parsed. However, it should be relatively easy to extend this to support a DOM-like API (JsonValue, JsonIterator, etc.), which I believe is more intuitive than introducing syntax for accessing fields and then returning an array of strings with the corresponding values.

@piotrrzysko hello, piotrrzysko, I think the idea of ​​reducing unused json node construction is similar with On-Demand parsing
There may be some performance advantages in getting multi-values ​​in multi-layer nested JSON.
for example:

{
	"statuses": [{
			"text": "@aym0566x \n\n名前:前田あゆみ\n第一印象:なんか怖っ!\n今の印象:とりあえずキモい。噛み合わない\n好きなところ:ぶすでキモいとこ😋✨✨\n思い出:んーーー、ありすぎ😊❤️\nLINE交換できる?:あぁ……ごめん✋\nトプ画をみて:照れますがな😘✨\n一言:お前は一生もんのダチ💖",
			"user": {
				"name": "AYUMI",
				"screen_name": "ayuu0123",
				"followers_count": 262,
				"friends_count": 252
			},
			"retweet_count": 0,
			"favorite_count": 0
		},
		{
			"text": "RT @KATANA77: えっそれは・・・(一同) http://t.co/PkCJAcSuYK",
			"user": {
				"name": "RT&ファボ魔のむっつんさっm",
				"screen_name": "yuttari1998",
				"followers_count": 95,
				"friends_count": 158
			},
			"retweet_count": 82,
			"favorite_count": 42
		}
  ],
  "search_metadata": {
    "count": 100,
  }
}
  1. In multi-level scenarios, avoid repeated construction of parent nodes
    if we want to get value for ['$.statues.[0]. user.name'],
    in the case of on-demand-parse,first need to construct jsonNode_statues, then construct jsonNode_statues[0], then construct jsonNode_statues[0]_user, then construct jsonNode_statues[0]_user_name, actually we only need the last jsonnode.
    in the case of SimdJsonParserWithFixPath, the parsing result array is only constructed once during initialization, and can be reused every time JSON data is parsed.

  2. When parsing multiple fields at one level, just scan the token list once to fill in all fields。
    if we want to get values for ['$.statues.[0]. user.name', '$.statues.[0]. user.screen_name', '$.statues.[0]. user.followers_count', '$.statues.[0]. user.friends_count'] ,
    in the case of on-demand-parse, we first need to travel to jsonNode_statues_[0]_user, and each time parse the fields under the jsonNode of user,we need to completely traverse the entire tokens list under the user. In this case, need to repeat the traversal four times.
    in the case of SimdJsonParserWithFixPath,just need to scan the tokens list once to fill all fields

In some specific scenarios, especially big data log cleaning scenarios, it can replace the json_tuple function of hive, which may be useful.

@heykirby
Copy link
Author

heykirby commented Oct 22, 2024

@heykirby I just want share some thoughts/questions:

With some minor API changes in simdjson-java, could we keep the SimdJsonParserWithFixPath in another codebase or it could life in a contribution module, because it is tailored for a very specific use case?

Isn't a record JsonNode sufficient compared to using lombok?

@arouel Thanks arouel,the unused imports has been removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants