Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for xml expression to not parse arbitrary strings #679

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

xanderbailey
Copy link
Contributor

Previously it was the case that we would parse the string of the column as the column expression argument for the expression. This leads to being able to execute arbitrary spark SQL if you parse the expression a string literal. It also means that string literals don't work with this expression and it doesn't work within an array transform expression either.

@srowen
Copy link
Collaborator

srowen commented Apr 6, 2024

This project is now incorporated into Apache Spark. Could you open the pull request there? if it's accepted, I will back-port it here just in case.

Can you give an example of what no longer works (and should not), and/or what didn't work before and does now?

@xanderbailey
Copy link
Contributor Author

Thanks for the quick response @srowen! Looks like it's being added to spark 4.0 here https://github.com/apache/spark/blame/29d077fbbd5464f64e0eeb495f7a955850915cc5/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L7146? Reading this path it seems like it isn't using the same constructor as this library so I think we should be good to just back port this here?

I can't think of an example of something that doesn't work now that would have worked before...

Cases that previously didn't work and do now:

from_xml(new Literal("<root><name>foo</name></root>"), schema) <- string literal is read by the SQL parser as sql which is clearly isn't.
functions.transform(new Column("array"), (element, _index) -> from_xml(element, schema)) <- the variable reference is removed here when you use the SQL parser so this throws when planning.

@srowen srowen merged commit c42d6bc into databricks:master Apr 8, 2024
3 checks passed
@xanderbailey
Copy link
Contributor Author

@srowen is there a way for us to tag a new version for this fix?

@srowen
Copy link
Collaborator

srowen commented Apr 10, 2024

Possibly, but does this block anything? it seems like the issues it avoids are things the caller can just not do, or do I misunderstand

@xanderbailey
Copy link
Contributor Author

I got around this by under the underlying spark expression rather than using functions so yes there's a work around but it's causing unexpected errors at the moment when using string literals and when used within an array transform.

@srowen srowen added this to the 0.18.0 milestone Apr 10, 2024
@srowen srowen added the bug label Apr 10, 2024
@srowen
Copy link
Collaborator

srowen commented Apr 10, 2024

I just released 0.18.0 with this change: https://github.com/databricks/spark-xml/releases/tag/v0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants