Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdf2hdt produces invalid UTF8 values? #270

Open
KonradHoeffner opened this issue Dec 14, 2022 · 1 comment
Open

rdf2hdt produces invalid UTF8 values? #270

KonradHoeffner opened this issue Dec 14, 2022 · 1 comment

Comments

@KonradHoeffner
Copy link

KonradHoeffner commented Dec 14, 2022

I'm working on an HDT library in Rust and use HDT files produced by hdt-cpp as test input.
Until now that worked fine but I get an error with the subject http://dbpedia.org/resource/Özgür_Özata, which is the last subject in the attached turtle file.
When looking at that file with a hex editor, "Özgür_Özata" is represented as 0xc3,0x96,0x7a,0x67,0xc3,0xbc,0x72,0x5f,0xc3,0x96,0x7a,0x61,0x74,0x61.
These hex values are for example a valid input for the Rust function std::str::from_utf8().
However rdf2hdt changes the first byte from 0xc3 to 0x9d, which causes std::str::from_utf8() to panic.
I'm not an UTF8 expert and can't say whether that is an alternative valid representation of an "Ö" character but given that the function panics I assume it is not so I want to want to submit this as a possible bug.

I'm using a Docker image built from newest commit of the develop branch.

persondata_en_10k.ttl.zip

docker build . -t hdt
docker run -it --entrypoint /bin/bash -v $PWD:/data hdt
cd data
rdf2hdt -f turtle persondata_en_10k.ttl persondata_en_10k.hdt
@KonradHoeffner
Copy link
Author

Screenshot from 2022-12-14 15-59-38

use std::str;

fn main() {
    let hex = [0xc3,0x96,0x7a,0x67,0xc3,0xbc,0x72,0x5f,0xc3,0x96,0x7a,0x61,0x74,0x61];
    let s = str::from_utf8(&hex[0..]).unwrap();
    println!("{s}");
    let hex = [0x9d,0x96,0x7a,0x67,0xc3,0xbc,0x72,0x5f,0xc3,0x96,0x7a,0x61,0x74,0x61];
    let s = str::from_utf8(&hex[0..]).unwrap();
    println!("{s}");     
}
 Compiling hex v0.1.0 (/home/konrad/tmp/hex)
    Finished dev [unoptimized + debuginfo] target(s) in 0.10s
     Running `target/debug/hex`
Özgür_Özata
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }', src/main.rs:8:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant