-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add bigframes.ml.compose.SQLScalarColumnTransformer to create custom SQL-based transformations #955
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Thanks so much for the Pull Request and the detailed discussion @ferenc-hechler !
This is a fascinating approach, and not one that I've considered. I had considered embedding parameters in SQL, but not how one could recover them as you do here. I'll check in with some backend BQML folks to see if this comment embedding is something we could rely on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your work here and the code examples in the tests. I definitely want to provide something like this, so this is a great starting point. I'll do some design work this week to come up with some proposals for how we might hook into our compilation steps to be able to do something like this in a less SQL-forward way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience as I try to understand this change and your requirements.
Some questions for you:
- How important is it for the custom transformation to be deployed as part of the Pipeline to BQML? Another way to phrase this: does all preprocessing need to happen in the TRANSFORM clause?
- If (1) is important, then how important is it that one can recover the original Pipeline after training? That is to say, if it were possible to use the trained model for predictions without being able to inspect exactly what model / transformations were applied from Python, would that be acceptable?
Overall, I'm warming up to the idea of a SQL-based extension mechanism / helpers, but I am concerned about
- Offering such a bigframes.ml specific way to hook into how BigQuery DataFrames compiles its SQL.
- Trying to recover the original pipeline via special comments.
Long-term, I'd like for our extension mechanism to be provided by the existing scalar BigQuery DataFrames APIs, such as Series.case_when
.
I envision something like the following:
class OneHotEncodingPlusOutliers([bigframes.ml.ScalarTransformerMixin](http://bigframes.ml.scalartransformermixin/)):
def __init__(self, column_name, num_values):
self.column_name = column_name
self.num_values = num_values
def fit(self, X):
# Find unique values in column_name.
values = X[[self.column_name]].groupby(self.column_name, as_index=False).size()
self.top_values_ = values.sort_values("size").head(self.num_values).to_list()
def transform(self, X):
# Note: This can only do scalar operations.
original = X[self.column_name]
X = X.drop(columns=[self.column_name])
X = X.assign(**{
f"{self.column_name}_{value}": original.eq(value) for value in self.top_values_
})
X = X.assign(**{f"{self.column_name}_others": ~original.isin(self.top_values)})
return X
This will take some time, though. Probably longer than the timelines we'd like. Compilation is changing, so extracting the SQL for these scalar transformations is a bit blocked by #827
Some alternatives to consider:
(a). Make bigframes better able to utilize TRANSFORM_ONLY models that have already been deployed to BQML "models" by making them compatible with read_gbq_model()
. This would required relaxing some of the validations, as you've done in this PR. We're tracking this request internally in issue 365996054.
A downside of this approach is that the BQML TRANSFORM clause cannot contain a call to ML.PREDICT / ML.TRANSFORM. To allow these TRANSFORM_ONLY models in a Pipeline would require changes to our compilation and possibly our execution. All transforms that happen before (and including) one of these opaque read_gbq_model() TRANSFORM_ONLY models cannot be deployed with the final BQML model for the Pipeline. Instead, we'd have to embed these transformations in the SQL table expression representing the training data and/or execute several queries.
(b). If (a) is acceptable, then I would be open to some helpers in bigframes to make deploying custom TRANSFORM_ONLY models based on SQL easier.
(c). If (a/b) is not acceptable, perhaps there is some intermediary step we can take towards the long-term vision that would unblock you?
We discussed your questions internally and came to the following conclusions.
If all preprocessing could happen in one TRANSFORM clause, this would be the ideal case (top scenario). But we already recognized that our existing Transformer-Framework has no limitation on transforming To answer your question: If we anyway have to apply other transformations, we do not need to bundle
That one was harder to answer. Let´s formulate it this way: Some kind of data-shift/-skew can be detected based on the transformer-model parameters. A question:
This is a good example, what is not possible using plain ColumnTransformer transformations. But, as we focus on a Transformer-Framework which creates the transformations based on a given configuration,
And in our Transformer-Framework this would be a config like:
This means, the transformer frameworks creates one-hot-encoding columns if the cardinality is below 10,
And the parameters for the TargetMeanEncoder would then be derived in the ColumnTransformer.fit() call. {"book": 33.27,"record": 18.45,"magazin": 4.61,...} Finally the SQL generated would be a case-when. So, we are able to split the preprocessing and are not limited to one-column transformations.
Does this mean, that unknown transformations are not dropped but passed through as black-box transformers?
As already mentioned above, this is fine for us.
If we also get the possibility to store our fitted parameters somewhere,
It feels like playing a book adventure and after each chapter you have to decide, |
Just tested loading a ColumnTransformer containing CustomTransformers and adding it to a pipeline and then train a Linear-Regression model including the transformation:
That worked without any problems.
Loading the stored model also works fine:
|
Excellent!
This feature (using SQL comments to embed learned parameters) remains my main concern / hesitation with merging this PR. It's very cool, but I worry about committing bigframes to supporting this feature long-term. I propose we offer a subset of what you've written here with the following: my_transformer = SQLScalarColumnTransformer(
"CASE WHEN {0} IS NULL THEN -1 ELSE LENGTH({0}) END",
columns=["colname"],
dtype="int64", # Omit this? Not needed with current bigframes.ml expressions, but required in bigframes.pandas.*.
) With such a feature, you won't get the fully nested set of transformers when restoring a Pipeline, but perhaps you could still embed such comments in the SQL with the library of transformers you provide to your downstream users for internal transparency? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! One suggestion and then let's make sure the e2e
tests pass before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your contribution and patience!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small fixes needed for the test suite.
Notebook failure |
This will be included in the 1.20.0 release. Keep an eye on #1017 Thanks again for your help @ferenc-hechler ! |
@tswast thanks for all your support and reviews 🥇 |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #954 🦕
This PR adds the possibility to create custom-transformers to be used in ColumnTransformer.
The base idea is to implement the _compile_to_sql() and _parse_from_sql() methods which are called from ColumnTransformer.
Each CustomTransformer registers itself in a registry and in addition to _BQML_TRANSFROM_TYPE_MAPPING, which contains the ML. transformers, this registry is queried, when transformers are created from reading in a model.
A very simple example is the Identity-Transformer. It does not change the value:
The implementation looks like this:
The class CustomTransformer is new and was added in this PR to bigframes.ml.compose, where also ColumnTransformer is implemented.
It provides some base functionality, like iterating over the column list, which can be reused for all inherited classes.
It also maintains the registry of custom transformers.
The class variable
_CTID
is used for identifying the transform and has to be unique over all registered custom transformers.Another example which replaces a string column with its length and uses a configurable default-value for NULL values.
The SQL for implementing this transformation looks like this:
Here the corresponding code
The custom_compile_to_sql() implementation is straight forward.
In custom_parse_from_sql() the configured parameters are reverse engineered from the SQL code using RegEx.
But this might not always be possible or useful, so we added a feature to persist config parameters along with the sql.
This is done by using a comment section in the SQL.
So, the comment
/*<custom-transformer-id>(<json-args>)*/
is stored in the SQL text of the model and can be used to retrieve this information when parsing the SQL.Okay, this is a hack and It will not work, when the model strips away the comments.
But at the moment it works and if this will change in future, another solution can be found to store this information in the model.
The CustomTransformer class implements the serialization of the config, so only the method get_persistent_config() has to be implemented returning a dict or list:
When parsing the SQL, this persistent_config is retrieved from the SQL and given as additional input to the parse method:
Custom transformers can be used in the same way, as the standard transformers, like LabelEncoder.
This piece of codes creates a ColumnTransformer, fits it and stores the Transform-Only model.
IdentityTransformer, Length1Transformer and Length2Transformer are inhreited from CustomTransformer, while LabelEncoder is from bigframes.ml.preprocessing.
Loading and applying the model is done with the following code:
transform_model is the ColumnTransformer which we persisted above.
Here is the complete source code:
custom_transformers.zip