avoid processing vast empty areas of buggy / strange workbooks #612

lindsay-stevens · 2022-06-10T20:24:38Z

somehow it happens that spreadsheets sometimes think they have data for 100's or 1000's of rows/columns even though all cells are empty.
xls / xlsx processing refactored to stop processing after 20 adjacent empty columns (when getting the headers) or 20 adjacent empty rows (when reading rows). That is to say, the practice of inserting a couple of empty rows or columns for readability / formatting purposes won't ruin the form. It'd only be a problem if users insert patches of 20 or more empty columns / rows amongst the form design area.
added tests for xls / xlsx to show bad input processed as expected.
the change to not output empty rows seems to have broken the "flatxlsformtest" case which needs further investigation - seems to just be a question numbering issue and different itext lang order.

Closes #604
Might also address #611 but haven't checked yet.

Why is this the best possible solution? Were any other approaches considered?

It's not clear how / why these buggy spreadsheets come to be. From the associated tickets / forum threads it seems like re-saving with Excel may or may not work. It's possible there is a bug in xlrd or openpyxl to blame, or perhaps these libraries could handle these situations better, but in either case it may be hard to get an upstream fix considering the poor reproduce-ability. The most practical solution seemed to be to have pyxform only process the workbook sheets for as long as there seems to be data in the columns / rows.

What are the regression risks?

As mentioned above there's a broken test which seems at least in part to involve a mismatch from using the row number for a automatically generated question name. But if that's no good then we can still output the empty rows.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

It doesn't seem so at this stage.

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run nosetests and verified all tests pass
run black pyxform tests to format code
verified that any code or assets from external sources are properly credited in comments

- somehow it happens that spreadsheets sometimes think they have data for 100's or 1000's of rows/columns even though all cells are empty. - xls / xlsx processing refactored to stop processing after 20 adjacent empty columns (when getting the headers) or rows (when reading rows). - add tests for xls / xlsx to show bad input processed as expected. - the change to not output empty rows seems to have broken the "flatxlsformtest" case which needs further investigation - seems to just be a question numbering issue and different itext lang order.

lognaturel · 2022-06-13T20:27:15Z

Thanks, @lindsay-stevens! I think this is a reasonable approach. Did you also consider @yanokwa's suggestion at the bottom of #604 to reset the dimensions above a certain reported column count?

lindsay-stevens · 2022-06-14T22:07:18Z

@lognaturel thanks for taking a look! The source for calculate_dimensions iterates through rows until a single break in truthiness, and uses the largest column index from cells indexed in those rows. A cell object can represent formatting info only (no value data). In the example UCL file the re-calculated column dimension at column AMJ (1024th) is from a break in the grey background colour formatting which then extends on forever from column AMK. The relevant XLSForm data goes to column P (16th) only.

So if a user inserts an empty row or column for formatting, but doesn't colour or style it, and the worksheet dimensions are still saved incorrectly somehow, then the recalc might cut off some data. In other words, it'd result in an extra iteration of the worksheet data (2nd time is to read values), and in doing so might iterate too far or not enough. The approach in this PR also guards against accidentally adding irrelevant data or styling way off in at the edges of the worksheet.

Maybe not as much of an issue for XLS (max cols 256 x rows 65536) but there doesn't seem to be an equivalent method in xlrd.

yanokwa · 2022-06-14T22:21:16Z

It might be nice to add a warning when do this? Just in case someone has some number of empty rows/columns?

pyxform/xls2json_backends.py

lognaturel · 2022-06-22T03:40:13Z

Thanks for the additional explanation, @lindsay-stevens, the approach sounds good. I don't feel a strong need for a warning and I imagine it'd be a fair amount of plumbing so doesn't seem worth it.

I'm not sure about the logic and have commented inline. Otherwise the rest of the implementation and the tests look good.

I verified whether it might address #611 and unfortunately it does not. I verified that it does address all 3 forms from the thread in #604. 👍

If at all possible, it would be great to get this released by the end of the week!

lindsay-stevens · 2022-06-23T00:03:04Z

Thanks for the review @lognaturel. I've fixed up the algo so this could be merged. I'll have a look at #611 and open a docs PR for the blank col/row processing behaviour.

lognaturel · 2022-06-23T04:09:23Z

pyxform/xls2json_backends.py

+        if is_empty(column_header):
+            # Preserve column order (will filter later)
+            column_header_list.append(None)
+            if last_col_empty:


Rows looks good but this still doesn’t look quite right! I don’t think you need last_col_empty at all. You can always increment in this branch and set to 0 in the other.

No worries, updated both cols/rows algos to just use adjacent_empty_* instead of flag variables.

lognaturel · 2022-06-23T04:12:55Z

pyxform/xls2json_backends.py

+        # so that any warning messages that mention row numbers are accurate.
+        result_rows.append(row_dict)
+
+    if trim_trailing_empty_rows:


I don’t think you need trim_trailing_empty_rows either. Instead, it should always be appropriate to trim adjacent_empty_rows. That will also handle the case in which there are eg 2 empty columns before the end.

Updated to always trim cols/rows, based on adjacent_empty_*.

lognaturel

Thanks! There's a slightly confusing off-by-one-from-intent, I think, but it doesn't affect users so I'm fine to leave it and will let you decide whether it bothers you enough to patch it with a future PR now that I've mentioned it! 😄

lognaturel · 2022-06-24T20:06:25Z

pyxform/xls2json_backends.py

+            # Preserve column order (will filter later)
+            column_header_list.append(None)
+            # After a run of empty cols, assume we've reached the end of the data.
+            if max_adjacent_empty < adjacent_empty_cols:


OBOB! This will stop after 21 empty columns are identified, the increment should really be before the test. But since 20 is arbitrary anyway, I suppose it doesn't really matter.

Well yes now it has been bothering me! 😄 Will add to a later PR. Thanks

lognaturel · 2022-06-24T20:07:09Z

pyxform/xls2json_backends.py

+
+        if 0 == len(row_dict):
+            # After a run of empty rows, assume we've reached the end of the data.
+            if max_adjacent_empty < adjacent_empty_rows:


Same as above, it will only stop after reaching 21 empty rows.

lognaturel reviewed Jun 21, 2022

View reviewed changes

pyxform/xls2json_backends.py Show resolved Hide resolved

fix: row counting algo to count adjacent (not total) rows and simplify

529f402

lindsay-stevens marked this pull request as ready for review June 23, 2022 00:03

lognaturel reviewed Jun 23, 2022

View reviewed changes

dev: tidy rows/cols algo, always trim trailing empty rows/cols

6ba5138

lognaturel approved these changes Jun 24, 2022

View reviewed changes

lognaturel merged commit e5e2260 into XLSForm:master Jun 24, 2022

lindsay-stevens deleted the pyxform-604 branch June 25, 2022 04:07

lindsay-stevens mentioned this pull request Jun 25, 2022

Add note about data processing from pyxform/#612 XLSForm/xlsform.github.io#226

Merged

yanokwa mentioned this pull request Nov 3, 2022

Increase max_adjacent_empty to prevent form breakage #620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid processing vast empty areas of buggy / strange workbooks #612

avoid processing vast empty areas of buggy / strange workbooks #612

lindsay-stevens commented Jun 10, 2022 •

edited

Loading

lognaturel commented Jun 13, 2022

lindsay-stevens commented Jun 14, 2022

yanokwa commented Jun 14, 2022

lognaturel commented Jun 22, 2022

lindsay-stevens commented Jun 23, 2022

lognaturel Jun 23, 2022

lindsay-stevens Jun 24, 2022

lognaturel Jun 23, 2022

lindsay-stevens Jun 24, 2022

lognaturel left a comment

lognaturel Jun 24, 2022

lindsay-stevens Jun 25, 2022

lognaturel Jun 27, 2022

lognaturel Jun 24, 2022

avoid processing vast empty areas of buggy / strange workbooks #612

avoid processing vast empty areas of buggy / strange workbooks #612

Conversation

lindsay-stevens commented Jun 10, 2022 • edited Loading

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

lognaturel commented Jun 13, 2022

lindsay-stevens commented Jun 14, 2022

yanokwa commented Jun 14, 2022

lognaturel commented Jun 22, 2022

lindsay-stevens commented Jun 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lognaturel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindsay-stevens commented Jun 10, 2022 •

edited

Loading