150: don't use regex to clean up XML following pretty-printing #681

lindsay-stevens · 2024-01-10T09:46:57Z

Closes #150

Default xml.minidom pretty-printer treats each node in a mixed-content element as a separate node. As a result it adds newlines and indents for each text node and output element, which is ugly and wrong. Then survey.py cleans that up with some regex.

This change copies and customises the default writexml implementation from minidom, and skips newline/indent for text nodes. It also includes conditional whitespace to match the previous processing, in case Collect or Enketo expected it to be exactly that way.

Added a test to specifically enumerate various output/text mixture permutations that appear in a variety of existing tests.

Why is this the best possible solution? Were any other approaches considered?

I think not outputting the undesirable whitespace in the first place is better than cleaning it up afterwards. Copying writexml adds a bit of complexity to the code base but I think the trade off vs the regex is worth it. Being able to do this without a new dependency is preferable. It may also allow more flexibility around users inputting whitespace in their question labels, without pyxform mangling it.

What are the regression risks?

While the added test should provide coverage, it's possible that some combination of text and references doesn't produce exactly the same white-spaced output as before, and assuming Collect or Enketo are sensitive to that it might change the output. But that risk seems pretty small. I'm not sure that the whitespace is actually significant in the first place.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

included test cases for core behavior and edge cases in tests
run nosetests and verified all tests pass
run black pyxform tests to format code
verified that any code or assets from external sources are properly credited in comments

lognaturel

Looks good! Pretty printing isn't done by default so it's probably not very common but good to get rid of some regex.

Feel free to merge after addressing the conflict.

- default xml.minidom pretty-printer treats each node in a mixed-content element as a separate node. As a result it adds newlines and indents for each text node and output element, which is ugly and wrong. Then survey.py cleans that up with some regex. - this change copies and customises the default writexml implementation from minidom, and skips newline/indent for text nodes. It also includes conditional whitespace to match the previous processing, in case Collect or Enketo expected it to be exactly that way. - added a test to specifically enumerate various output/text mixture permutations that appear in a variety of existing tests.

lindsay-stevens requested a review from lognaturel January 10, 2024 09:54

lognaturel approved these changes Feb 1, 2024

View reviewed changes

lindsay-stevens force-pushed the pyxform-150 branch from bbff410 to 16b1a25 Compare February 1, 2024 08:06

lindsay-stevens merged commit 8f75605 into XLSForm:master Feb 1, 2024
10 checks passed

lindsay-stevens deleted the pyxform-150 branch February 1, 2024 08:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

150: don't use regex to clean up XML following pretty-printing #681

150: don't use regex to clean up XML following pretty-printing #681

lindsay-stevens commented Jan 10, 2024

lognaturel left a comment

150: don't use regex to clean up XML following pretty-printing #681

150: don't use regex to clean up XML following pretty-printing #681

Conversation

lindsay-stevens commented Jan 10, 2024

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

lognaturel left a comment

Choose a reason for hiding this comment