Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow querying the bigframes dataframe with custom BigQuery SQL like pandasql #647

Open
NiloFreitas opened this issue Apr 30, 2024 · 2 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@NiloFreitas
Copy link

Feature Request

Supposed I used bigframes and processed my data like this:
res_df = df.assign(pred=input_remote_function.apply(generate_text))

And got this output dataframe res_df:

uri pred
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...
gs://dataproc-metastore-public-binaries/ads_ba... {"interpretation": " The primary message of th...

As you can see, the pred column is a JSON string.

Describe the solution you'd like

I would like to use BigQuery JSON Functions to explode this column into a column for each JSON attribute.

For example, I could write a custom BigQuery SQL code to run for this dataframe, like this:
res_df = df.sqldf("SELECT *, JSON_EXTRACT(pred, '$.interpretation') AS interpretation FROM df")

Getting as a result the res_df:

uri pred interpretation
gs://datap... {"interpretation": " The primary message of th... The primary messag...
gs://datap... {"interpretation": " The primary message of th... The primary messag...
gs://datap... {"interpretation": " The primary message of th... The primary messag...

Describe alternatives you've considered

As an alternative I need to use remote functions, like this one, to explode the column:

@bpd.remote_function(
    [str],
    str,
    bigquery_connection=CONNECTION_ID
)
def extract_json_attribute(input_content: str) -> str:
    
    pred = input_content.split("|||")[0]
    json_attribute = input_content.split("|||")[1]
    
    import json
    return json.loads(pred)[json_attribute]
input_remote_function_int = res_df['pred'] + '|||interpretation'
df = res_df.assign(interpretation=input_remote_function_int.apply(extract_json_attribute))

Which I cannot pass multiple parameter because of #646

Additional context

The pandasql allow us to query pandas DataFrames using SQL syntax.
Bigframes supporting, pushing down the query to BQ would be helpful.

https://pypi.org/project/pandasql/

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Apr 30, 2024
@GarrettWu
Copy link
Contributor

It is an interesting request. Maybe we can just inject the df table into the SQL statements? But one argument is if user wants to use SQL, why they don't just use SQL with BQ? @tswast @TrevorBergeron WDYT?

@tswast
Copy link
Collaborator

tswast commented May 3, 2024

I was just talking with @shanecglass and @aliciawilliams about the potential for a similar feature.

I was thinking of providing a SQL engine to our equivalent of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html, but mimicking the pandasql interface could be even more natural. Looks like it's MIT license (https://github.com/yhat/pandasql/blob/master/LICENSE.txt) so we could pull some of that into third_party if needed.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants