Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite regex and add extra check to variable names #74

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sinnbeck
Copy link

Currently the regex in the Writer class does not follow the SPSS naming rules. This should ensure that the rules are actually followed.

https://www.ibm.com/docs/en/spss-statistics/28.0.0?topic=SSLVMB_28.0.0/spss/base/syn_variables_variable_names.htm

@sinnbeck
Copy link
Author

sinnbeck commented Feb 13, 2024

The regex ensures that the name never ends on an underscore as that is recommended. But this does mean that the random string generator is not allowed to add _ anymore

Variable names ending in underscores should be avoided, since such names may conflict with names of variables automatically created by commands and procedures.

I will gladly add that to this PR, but I just want to make sure this is wanted :)

@lestcape
Copy link
Collaborator

The function of the library is to create the sav files. The library does not define or dictate what variable names that file will have and therefore should not take actions against the names that the user decided to give it, except in the case that this names are impossible to be used.

I am also inclined towards a library that always looks for a way to generate the file, even if it has to modify the exact order that was given by the user. For example, could be something like that:

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $pos = get_last_position_that_starts_with_prefix ($var->name)
    $var->name = $var->name."_".($pos);
}

However, it is not that it is wrong to throw an error in that case. Sometimes the variable names are important in the exact way the user defined it and then the user needs to know in time that this names could not be generated as he expects them. It's just that I think that in most cases, the user cares little about the names of the variables and what they are really interested in is having the data of that columns without matter about the name.

@sinnbeck
Copy link
Author

Hm yeah that might be the case. My own experience is working with survey data where the naming is very important as it comes from a survey system where the users define the names and therefor the names in the sav file needs to be an exact match.

But as I know to avoid these names, I don't see an issue with just adding a number to them. I will update the PR 👍

@sinnbeck
Copy link
Author

sinnbeck commented Feb 14, 2024

It now just adds _$n like requested. I also wrote some extra tests to ensure the regex works as expected.

@lestcape
Copy link
Collaborator

lestcape commented Feb 14, 2024

My own experience is working with survey data where the naming is very important as it comes from a survey system

I work making health surveys for the Mexican Government. That's why I understand your example, but I still see it as insufficient to throw errors.

Columns names can be important in a survey, because that names are linked to a question in the questionnaire. But what is in the game here is if you want the data of the survey with some incompatibilities in the names of the variables or you prefer not to have anything. I think you will always preferred something, that nothing. That's why I think the path you've taken now is the right one.

@lestcape
Copy link
Collaborator

It now just adds _$n like requested. I also wrote some extra tests to ensure the regex works as expected.

Just a fast review: What would happen if I now try to create a base with these two variable names: ('ALL', 'ALL_1')? I think you should take on account all the columns names, to dynamically detect the corresponding number instead of using a predefined list. That list have not on account what is occurring with other columns names that theoretically have not issues, but that will have conflict with the columns that you fixed his names.

@sinnbeck
Copy link
Author

sinnbeck commented Feb 15, 2024

Ah cool that you work in somewhat the same business as me. Our surveys are B2B :)

That is a good point. Perhaps for reserved names it would be better to make a name which is more likely to not be in use.

Something like ALL => ALL@safe (a hash could even be used to make it "safe")

Btw. It isn't allowed to have duplicate names so maybe adding a number isn't actually needed as the system already assumes you wont have duplicate names?

Something like

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $var->name = $var->name.'_'. uniqid();
}

@lestcape
Copy link
Collaborator

Btw. It isn't allowed to have duplicate names so maybe adding a number isn't actually needed as the system already assumes you wont have duplicate names?

I don't understand, what is the system? The idea here is that the library not fail in the creation of the sav. The sav database can not handled duplicate names. If we provide a duplicate name we will fail then. We need to avoid then a duplicate name, because we don't want to fail.

Something like ALL => ALL@safe (a hash could even be used to make it "safe")

The counterexample here is trying to create a database with these variable names: ('ALL', 'ALL@safe')? Like 'ALL@safe' is not a reserved name it will not be checked. Then we will convert 'ALL' in 'ALL@safe', but we already have a variable name 'ALL@safe', so we create a duplicate name and this will then fail, but we don´t want it to fail.

@sinnbeck
Copy link
Author

sinnbeck commented Feb 15, 2024

Ah I think I misunderstood you. So the plan is to go over every single variable name and ensure that there are no duplicates and also at the same time ensure that the new name they are given doesn't exist later in the set. Same for reserved names.

Think I will need a loop before the current one to get every single name (or an array_map) extracted

I will see what can come up with

@lestcape
Copy link
Collaborator

lestcape commented Feb 15, 2024

Allowing the library user to create a database with a duplicate column name would be, I think, being too flexible, because a column name duplicated by the user sounds more like a user error that should be fixed than a limitation of the library. Not being able to write reserved words is a limitation of the library and therefore it is something with which we could be somewhat flexible. The user does not have to know about library limitations in advance. The one who is creating the questionnaire will not take on account the limitation of this library, but he has to take into account, not giving the same names to different things.

@sinnbeck
Copy link
Author

One could argue the limitation is in spss, not this library. If we just allowed those names, I assume it would be spss that would throw an exception or error

Anyways. If the only requirement is that it's unique, I would think this should work in most cases. It could be checked against existing names if needed

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $var->name = uniqid($var->name);
} 

@lestcape
Copy link
Collaborator

lestcape commented Feb 15, 2024

Yes, it is a limitation of spss, and the library creates the spss, so it is also, implicitly, a limitation of the library and because the user uses the library is also by implication a limitation that is imposed on the user. In the other case, it is the user who is doing something wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants