Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding for C-struct syntax #22

Open
vallsv opened this issue Feb 15, 2021 · 2 comments
Open

Encoding for C-struct syntax #22

vallsv opened this issue Feb 15, 2021 · 2 comments

Comments

@vallsv
Copy link
Contributor

vallsv commented Feb 15, 2021

Sorry to spam, but this makes me think a lot :-)

I have some proposal with the way encoding is handled with the C syntax.

struct person 
{ 
    char name[50];
    char single_char;
};
  • a single char should be decoded like an array. I think it is not the case right now.
  • i think there is no real need to support signed char/unsigned char. As i saw in the documentation Parse source code files, it's probably not a very good C programing style to specify the sign of a char. For that there is the byte type. which is an int8. I think char should only be used for characters.

I also saw that the library uses utf-8 as default. It is probably not a good idea. Such thing could be be latin1, or a lot of other stuffs. It's also possible that encoding stay unknown until the structure is read.

Thinking about that, and following the way packing is handled, maybe this could be generalized for encoding, like using an own pragma.

So what it could be switched in the middle of the description if it is needed.

#pragma encoding("utf-8")
struct person 
{ 
    char name[50];
#pragma encoding("raw")
    char single_char;
};

This said, i don't have such problems for now.
So it's just proposals.

@midstar
Copy link
Owner

midstar commented Feb 20, 2021

Hi,

The introduction of interpreting unsigned/signed char as an array of numbers instead of utf-8 strings was introduced after suggestion in following issue:

#11

I think this is a nice "hack" to tell pycstruct what type of char array you would like to use. Thus I do think that pramas in the source code is unnecessary.

I'm not sure what you mean with "a single char should be decoded like an array.". A single char is decoded as an int8 and an "unsigned char" is decoded as an uint8. Note that there is no type in standard C language called 'byte'. The standard type for a byte is 'char'.

I do agree that it would be good to also support older encoding schemes for legacy systems (note that utf-8 is more or less standard nowadays). To support this i suggest:

  • Add more "string types" in StructDef add method. For example 'latin1' or whatever.
  • Add an argument to parse_file and parse_str where you specify the default encoding (char_array_encoding). My guess is that one system use the same encoding all over the place. Here it should also be possible to turn off char_array_encoding (=None) to tell the parser to not generate any strings at all, only arrays of signed or unsigned bytes.

@vallsv
Copy link
Contributor Author

vallsv commented Feb 20, 2021

Oups. I have probably mix some stuff together in my mind.

In fact i saw that a

char foo;

was deserialized as an int, while my C header documentation was expecting a char like E.

That is why I think it makes sens to decode it, but i understand it's not so easy if there is no dedicated type for strings/values.

Some set of extra configuration as you suggest sounds a good idea to know how to handle few cases, like which encoding to use, or how to handle a single char.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants