Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte strings aren't able to compare UTF32 strings #177

Open
ate47 opened this issue Nov 8, 2022 · 0 comments
Open

Byte strings aren't able to compare UTF32 strings #177

ate47 opened this issue Nov 8, 2022 · 0 comments

Comments

@ate47
Copy link
Contributor

ate47 commented Nov 8, 2022

I've noticed that if we take characters with surrogate (To have UTF-32), for example this symbol: 𦳣 and another one without surrogate, for example this symbol: , we don't have the same results.

The code to reproduce is here, I've used the code points to get the strings.

String ss1 = new String(Character.toChars(0x26ce3)); // 𦳣
String ss2 = new String(Character.toChars(0xf4d1)); // 

CompactString b1 = new CompactString(ss1);
CompactString b2 = new CompactString(ss2);

assertEquals(ss1, b1.toString());
assertEquals(ss2, b2.toString());

// I clamp the value between -1 and 1 to have the same result
int cmpByte = Math.max(-1, Math.min(1, b1.compareTo(b2)));
int cmpStr = Math.max(-1, Math.min(1, b1.toString().compareTo(b2.toString())));

assertEquals(cmpStr, cmpByte);
// java.lang.AssertionError: 
// Expected :-1
// Actual   :1

It creates a bug with the generation of an HDT of a section of Wikidata

> .\rdf2hdt.bat .\chunk.nt.gz test.hdt
...
File converted in: 2 min 30 sec 463 ms 185 us
Total Triples: 49996305
Different subjects: 1206364
Different predicates: 3655
Different objects: 9917883
Common Subject/Object:603515
HDT saved to file in: 1 sec 242 ms 73 us

> .\hdtVerify.bat .\test.hdt
Checking subject entries
Checking predicate entries
Checking object entries
ERRA: "????"@zh-hant / "??"@lzh
ERRB: "????"@zh-hant / "??"@lzh
ERRA: "????"@zh-hant / "???"@lzh
ERRB: "????"@zh-hant / "???"@lzh
ERRA: "???????"@zh-hant / "?????"@got
ERRB: "???????"@zh-hant / "?????"@got
Checking shared entries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant