Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid processing vast empty areas of buggy / strange workbooks #612

Merged
merged 3 commits into from
Jun 24, 2022

Conversation

lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Jun 10, 2022

  • somehow it happens that spreadsheets sometimes think they have data for 100's or 1000's of rows/columns even though all cells are empty.
  • xls / xlsx processing refactored to stop processing after 20 adjacent empty columns (when getting the headers) or 20 adjacent empty rows (when reading rows). That is to say, the practice of inserting a couple of empty rows or columns for readability / formatting purposes won't ruin the form. It'd only be a problem if users insert patches of 20 or more empty columns / rows amongst the form design area.
  • added tests for xls / xlsx to show bad input processed as expected.
  • the change to not output empty rows seems to have broken the "flatxlsformtest" case which needs further investigation - seems to just be a question numbering issue and different itext lang order.

Closes #604
Might also address #611 but haven't checked yet.

Why is this the best possible solution? Were any other approaches considered?

It's not clear how / why these buggy spreadsheets come to be. From the associated tickets / forum threads it seems like re-saving with Excel may or may not work. It's possible there is a bug in xlrd or openpyxl to blame, or perhaps these libraries could handle these situations better, but in either case it may be hard to get an upstream fix considering the poor reproduce-ability. The most practical solution seemed to be to have pyxform only process the workbook sheets for as long as there seems to be data in the columns / rows.

What are the regression risks?

As mentioned above there's a broken test which seems at least in part to involve a mismatch from using the row number for a automatically generated question name. But if that's no good then we can still output the empty rows.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

It doesn't seem so at this stage.

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run nosetests and verified all tests pass
  • run black pyxform tests to format code
  • verified that any code or assets from external sources are properly credited in comments

- somehow it happens that spreadsheets sometimes think they have data
  for 100's or 1000's of rows/columns even though all cells are empty.
- xls / xlsx processing refactored to stop processing after 20 adjacent
  empty columns (when getting the headers) or rows (when reading rows).
- add tests for xls / xlsx to show bad input processed as expected.
- the change to not output empty rows seems to have broken the
  "flatxlsformtest" case which needs further investigation - seems to
  just be a question numbering issue and different itext lang order.
@lognaturel
Copy link
Contributor

Thanks, @lindsay-stevens! I think this is a reasonable approach. Did you also consider @yanokwa's suggestion at the bottom of #604 to reset the dimensions above a certain reported column count?

@lindsay-stevens
Copy link
Contributor Author

@lognaturel thanks for taking a look! The source for calculate_dimensions iterates through rows until a single break in truthiness, and uses the largest column index from cells indexed in those rows. A cell object can represent formatting info only (no value data). In the example UCL file the re-calculated column dimension at column AMJ (1024th) is from a break in the grey background colour formatting which then extends on forever from column AMK. The relevant XLSForm data goes to column P (16th) only.

So if a user inserts an empty row or column for formatting, but doesn't colour or style it, and the worksheet dimensions are still saved incorrectly somehow, then the recalc might cut off some data. In other words, it'd result in an extra iteration of the worksheet data (2nd time is to read values), and in doing so might iterate too far or not enough. The approach in this PR also guards against accidentally adding irrelevant data or styling way off in at the edges of the worksheet.

Maybe not as much of an issue for XLS (max cols 256 x rows 65536) but there doesn't seem to be an equivalent method in xlrd.

@yanokwa
Copy link
Contributor

yanokwa commented Jun 14, 2022

It might be nice to add a warning when do this? Just in case someone has some number of empty rows/columns?

@lognaturel
Copy link
Contributor

Thanks for the additional explanation, @lindsay-stevens, the approach sounds good. I don't feel a strong need for a warning and I imagine it'd be a fair amount of plumbing so doesn't seem worth it.

I'm not sure about the logic and have commented inline. Otherwise the rest of the implementation and the tests look good.

I verified whether it might address #611 and unfortunately it does not. I verified that it does address all 3 forms from the thread in #604. 👍

If at all possible, it would be great to get this released by the end of the week!

@lindsay-stevens
Copy link
Contributor Author

Thanks for the review @lognaturel. I've fixed up the algo so this could be merged. I'll have a look at #611 and open a docs PR for the blank col/row processing behaviour.

@lindsay-stevens lindsay-stevens marked this pull request as ready for review June 23, 2022 00:03
if is_empty(column_header):
# Preserve column order (will filter later)
column_header_list.append(None)
if last_col_empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rows looks good but this still doesn’t look quite right! I don’t think you need last_col_empty at all. You can always increment in this branch and set to 0 in the other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, updated both cols/rows algos to just use adjacent_empty_* instead of flag variables.

# so that any warning messages that mention row numbers are accurate.
result_rows.append(row_dict)

if trim_trailing_empty_rows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think you need trim_trailing_empty_rows either. Instead, it should always be appropriate to trim adjacent_empty_rows. That will also handle the case in which there are eg 2 empty columns before the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to always trim cols/rows, based on adjacent_empty_*.

Copy link
Contributor

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! There's a slightly confusing off-by-one-from-intent, I think, but it doesn't affect users so I'm fine to leave it and will let you decide whether it bothers you enough to patch it with a future PR now that I've mentioned it! 😄

# Preserve column order (will filter later)
column_header_list.append(None)
# After a run of empty cols, assume we've reached the end of the data.
if max_adjacent_empty < adjacent_empty_cols:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OBOB! This will stop after 21 empty columns are identified, the increment should really be before the test. But since 20 is arbitrary anyway, I suppose it doesn't really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes now it has been bothering me! 😄 Will add to a later PR. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁


if 0 == len(row_dict):
# After a run of empty rows, assume we've reached the end of the data.
if max_adjacent_empty < adjacent_empty_rows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, it will only stop after reaching 21 empty rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Form conversion is slow on form with lots of blank rows and columns
3 participants