Rewrite regex and add extra check to variable names #74

sinnbeck · 2024-02-13T14:14:46Z

Currently the regex in the Writer class does not follow the SPSS naming rules. This should ensure that the rules are actually followed.

https://www.ibm.com/docs/en/spss-statistics/28.0.0?topic=SSLVMB_28.0.0/spss/base/syn_variables_variable_names.htm

…aming rules

sinnbeck · 2024-02-13T14:35:26Z

The regex ensures that the name never ends on an underscore as that is recommended. But this does mean that the random string generator is not allowed to add _ anymore

Variable names ending in underscores should be avoided, since such names may conflict with names of variables automatically created by commands and procedures.

I will gladly add that to this PR, but I just want to make sure this is wanted :)

lestcape · 2024-02-13T22:22:32Z

The function of the library is to create the sav files. The library does not define or dictate what variable names that file will have and therefore should not take actions against the names that the user decided to give it, except in the case that this names are impossible to be used.

I am also inclined towards a library that always looks for a way to generate the file, even if it has to modify the exact order that was given by the user. For example, could be something like that:

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $pos = get_last_position_that_starts_with_prefix ($var->name)
    $var->name = $var->name."_".($pos);
}

However, it is not that it is wrong to throw an error in that case. Sometimes the variable names are important in the exact way the user defined it and then the user needs to know in time that this names could not be generated as he expects them. It's just that I think that in most cases, the user cares little about the names of the variables and what they are really interested in is having the data of that columns without matter about the name.

sinnbeck · 2024-02-14T06:47:12Z

Hm yeah that might be the case. My own experience is working with survey data where the naming is very important as it comes from a survey system where the users define the names and therefor the names in the sav file needs to be an exact match.

But as I know to avoid these names, I don't see an issue with just adding a number to them. I will update the PR 👍

sinnbeck · 2024-02-14T07:16:08Z

It now just adds _$n like requested. I also wrote some extra tests to ensure the regex works as expected.

lestcape · 2024-02-14T22:02:38Z

My own experience is working with survey data where the naming is very important as it comes from a survey system

I work making health surveys for the Mexican Government. That's why I understand your example, but I still see it as insufficient to throw errors.

Columns names can be important in a survey, because that names are linked to a question in the questionnaire. But what is in the game here is if you want the data of the survey with some incompatibilities in the names of the variables or you prefer not to have anything. I think you will always preferred something, that nothing. That's why I think the path you've taken now is the right one.

lestcape · 2024-02-14T22:17:19Z

It now just adds _$n like requested. I also wrote some extra tests to ensure the regex works as expected.

Just a fast review: What would happen if I now try to create a base with these two variable names: ('ALL', 'ALL_1')? I think you should take on account all the columns names, to dynamically detect the corresponding number instead of using a predefined list. That list have not on account what is occurring with other columns names that theoretically have not issues, but that will have conflict with the columns that you fixed his names.

sinnbeck · 2024-02-15T07:04:25Z

Ah cool that you work in somewhat the same business as me. Our surveys are B2B :)

That is a good point. Perhaps for reserved names it would be better to make a name which is more likely to not be in use.

Something like ALL => ALL@safe (a hash could even be used to make it "safe")

Btw. It isn't allowed to have duplicate names so maybe adding a number isn't actually needed as the system already assumes you wont have duplicate names?

Something like

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $var->name = $var->name.'_'. uniqid();
}

lestcape · 2024-02-15T16:21:02Z

Btw. It isn't allowed to have duplicate names so maybe adding a number isn't actually needed as the system already assumes you wont have duplicate names?

I don't understand, what is the system? The idea here is that the library not fail in the creation of the sav. The sav database can not handled duplicate names. If we provide a duplicate name we will fail then. We need to avoid then a duplicate name, because we don't want to fail.

Something like ALL => ALL@safe (a hash could even be used to make it "safe")

The counterexample here is trying to create a database with these variable names: ('ALL', 'ALL@safe')? Like 'ALL@safe' is not a reserved name it will not be checked. Then we will convert 'ALL' in 'ALL@safe', but we already have a variable name 'ALL@safe', so we create a duplicate name and this will then fail, but we don´t want it to fail.

sinnbeck · 2024-02-15T17:24:31Z

Ah I think I misunderstood you. So the plan is to go over every single variable name and ensure that there are no duplicates and also at the same time ensure that the new name they are given doesn't exist later in the set. Same for reserved names.

Think I will need a loop before the current one to get every single name (or an array_map) extracted

I will see what can come up with

lestcape · 2024-02-15T18:03:27Z

Allowing the library user to create a database with a duplicate column name would be, I think, being too flexible, because a column name duplicated by the user sounds more like a user error that should be fixed than a limitation of the library. Not being able to write reserved words is a limitation of the library and therefore it is something with which we could be somewhat flexible. The user does not have to know about library limitations in advance. The one who is creating the questionnaire will not take on account the limitation of this library, but he has to take into account, not giving the same names to different things.

sinnbeck · 2024-02-15T18:41:26Z

One could argue the limitation is in spss, not this library. If we just allowed those names, I assume it would be spss that would throw an exception or error

Anyways. If the only requirement is that it's unique, I would think this should work in most cases. It could be checked against existing names if needed

if (in_array($var->name, ['ALL', 'AND', 'BY', 'EQ', 'GE', 'GT', 'LE', 'LT', 'NE', 'NOT', 'OR', 'TO', 'WITH'])) {
    $var->name = uniqid($var->name);
}

lestcape · 2024-02-15T22:19:07Z

Yes, it is a limitation of spss, and the library creates the spss, so it is also, implicitly, a limitation of the library and because the user uses the library is also by implication a limitation that is imposed on the user. In the other case, it is the user who is doing something wrong.

rewrite regex and add extra check to ensure it follows current spss n…

668efe8

…aming rules

add tests

c2ebe1d

sinnbeck added 2 commits February 14, 2024 08:04

dont throw exception on reserved names

e9d395e

add tests for illegal characters in variable names

41af280

sinnbeck added 2 commits February 16, 2024 08:21

ensure unique names for reserved variable names

bc259f4

ensure variable names in tests ends with a letter

bfe88d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite regex and add extra check to variable names #74

Rewrite regex and add extra check to variable names #74

sinnbeck commented Feb 13, 2024

sinnbeck commented Feb 13, 2024 •

edited

Loading

lestcape commented Feb 13, 2024

sinnbeck commented Feb 14, 2024

sinnbeck commented Feb 14, 2024 •

edited

Loading

lestcape commented Feb 14, 2024 •

edited

Loading

lestcape commented Feb 14, 2024

sinnbeck commented Feb 15, 2024 •

edited

Loading

lestcape commented Feb 15, 2024

sinnbeck commented Feb 15, 2024 •

edited

Loading

lestcape commented Feb 15, 2024 •

edited

Loading

sinnbeck commented Feb 15, 2024

lestcape commented Feb 15, 2024 •

edited

Loading

Rewrite regex and add extra check to variable names #74

Are you sure you want to change the base?

Rewrite regex and add extra check to variable names #74

Conversation

sinnbeck commented Feb 13, 2024

sinnbeck commented Feb 13, 2024 • edited Loading

lestcape commented Feb 13, 2024

sinnbeck commented Feb 14, 2024

sinnbeck commented Feb 14, 2024 • edited Loading

lestcape commented Feb 14, 2024 • edited Loading

lestcape commented Feb 14, 2024

sinnbeck commented Feb 15, 2024 • edited Loading

lestcape commented Feb 15, 2024

sinnbeck commented Feb 15, 2024 • edited Loading

lestcape commented Feb 15, 2024 • edited Loading

sinnbeck commented Feb 15, 2024

lestcape commented Feb 15, 2024 • edited Loading

sinnbeck commented Feb 13, 2024 •

edited

Loading

sinnbeck commented Feb 14, 2024 •

edited

Loading

lestcape commented Feb 14, 2024 •

edited

Loading

sinnbeck commented Feb 15, 2024 •

edited

Loading

sinnbeck commented Feb 15, 2024 •

edited

Loading

lestcape commented Feb 15, 2024 •

edited

Loading

lestcape commented Feb 15, 2024 •

edited

Loading