Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_compressor_model.py generates invalid .h files #27

Open
hansmaad opened this issue Feb 11, 2017 · 1 comment
Open

generate_compressor_model.py generates invalid .h files #27

hansmaad opened this issue Feb 11, 2017 · 1 comment

Comments

@hansmaad
Copy link

I'm using a quite large file (7mb), ASCII chars only. The words look like this:

hninetrakierf hnisnoitknuf hrettuf hcsnedobSuf htsrUf g hbeilnetrag htsag...

With default options, the tool generates a model file where each element in successor_ids_by_chr_id_and_chr_id looks like this:

static const int8_t successor_ids_by_chr_id_and_chr_id[32][32] = {
  {-1, 9, 3, 10, 14, -1, 4, 1, 6, 5, -1, 0, ... -1, -1, -1, -1, -1, None},
  {0, -1, 1, 7, 4, 6, 2, 3, 5, 8, 12, 10, 1..., -1, -1, -1, -1, -1, None},

If you have no time to fix this, do you have any idea how I could repair this file?
If i replace None with -1, the compression is a little bit better than the default shoco_model.h file, but maybe I could do better?

@EverydayApps
Copy link

Just for anyone else that stumbles upon this issue, I just ran this script, and got the same results.

I only have 30 unique characters in my data set, so I get two None's (Pythons null identifier), at each of those internal arrays. I'm going to guess that you have 31 unique characters in your dataset.

Those values are never read. So, anything between INT8_MIN and INT8_MAX (-128 and 127 maybe) should work fine. Or, just delete those slots and they should work also. Null char would probably work. Or junk. But, the loop that reads those values does short circuit if the value is less than zero, so if for some reason they were read (they won't be), you would want them to be less than zero.

There's probably a bug in the Python script that is printing the null identifier ("None") instead of the null value ('\0').

Count the number of elements inside chrs_by_chr_id[32]. For me, there's space for 32, but only thirty characters in the array. The last two are missing, and the compression loop breaks when it gets the null value, so it never gets to the inner part to read what the null value is. So, no optimization can be done.

Still, the Python script could be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants