-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add dependency injection point to transform X & y together #167
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #167 +/- ##
==========================================
- Coverage 99.70% 99.26% -0.44%
==========================================
Files 6 5 -1
Lines 669 678 +9
==========================================
+ Hits 667 673 +6
- Misses 2 5 +3
Continue to review full report at Codecov.
|
…gb/scikeras into whole-dataset-transformer
…gb/scikeras into whole-dataset-transformer
…gb/scikeras into whole-dataset-transformer
@stsievert if you are able to, a review of this new interface would be much appreciated. The best place to start is probably section 6 and 7 of this notebook. |
scikeras/utils/transformers.py
Outdated
class ClassWeightDataTransformer(BaseEstimator, TransformerMixin): | ||
"""Default dataset_transformer for KerasClassifier. | ||
|
||
This transformer implements handling of the `class_weight` parameter | ||
for single output classifiers. | ||
""" | ||
|
||
def __init__(self, class_weight: Optional[Union[str, Dict[int, float]]] = None): | ||
self.class_weight = class_weight | ||
|
||
def fit( | ||
self, | ||
data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]], | ||
dummy: None = None, | ||
) -> "ClassWeightDataTransformer": | ||
return self | ||
|
||
def transform( | ||
self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]] | ||
) -> Tuple[np.ndarray, Union[np.ndarray, None], Union[np.ndarray, None]]: | ||
X, y, sample_weight = data | ||
if self.class_weight is None or y is None: | ||
return (X, y, sample_weight) | ||
sample_weight = 1 if sample_weight is None else sample_weight | ||
sample_weight *= compute_sample_weight(class_weight=self.class_weight, y=y) | ||
return (X, y, sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another option here, that may be cleaner, which would be to add another dependency injection point that runs in "parallel" to feature_encoder
and target_encoder
specifically for sample_weight
. But since it also makes sense that one would need at least y
to process sample_weight
, it might be a bit redundant.
docs/source/advanced.rst
Outdated
.. code-block:: | ||
|
||
↗ feature_encoder ↘ | ||
your data → sklearn-ecosystem → SciKeras dataset_transformer → Keras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this diagram is really useful.
Would this be a better diagram?
↗ feature_encoder ↘
SciKeras.fit(features, labels) dataset_transformer → Keras.fit(dataset)
↘ target_encoder ↗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the addition of .fit
. My only worry is that users might think that SciKeras always creates a tf.data.Dataset
, which is not the case, by default it gives numpy arrays to Keras.fit
. Do you think Keras.fit(dataset or np.array)
makes that clear? It could also be dicts of lists or something, but that's at least more uncommon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can dataset_transformer
return? Only datasets/ndarrays? Or does it support the other arguments of Keras.fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything that Keras.fit
will accept. Internally, it looks something like this:
X, y, sample_weight = dataset_transformer.fit_transform((X, y, sample_weight))
model.fit(x=X, y=y, sample_weight=sample_weight) # aka Keras.fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Keras.fit(data)
is a good way to specify this? That way there's not confusion in interpreting dataset
as tf.data.Dataset
. I can also add a small code block like in #167 (comment) if that helps explain it.
@stsievert thank you for the review, the feedback was very valuable, I think I was able to incorporate most of the smaller suggestions. For the larger stuff, I moved most of the pseudocode and general background out of the notebook and into the main docs, and added some links back to the docs instead of duplicate information, as you suggested. Having read the documentation, do you think this |
docs/source/advanced.rst
Outdated
(eg. to get the number of outputs) but *will* be able to inject data into the model building | ||
function (more on this below). On the other hand, | ||
``data_transformer`` *will* get access to the built Model, but it cannot pass any data to model building | ||
function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important thing to note is that feature_encoder and target_encoder are run before building the Keras Model, while data_transformer is run after the Model is built.
This isn't clear in the example (though it's clear in the edit I've made).
This means that the former two will not have access to the Model
(eg. to get the number of outputs) but will be able to inject data into the model building
function (more on this below). On the other hand,
data_transformer
will get access to the built Model, but it cannot pass any data to model building
function.
I would only say "the output of dataset_transformer
get directly passed to tf.keras.Model.fit through self.model_.fit." All the about the number of outputs/etc is a confusing. I think I'd refer directly to the advanced examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't clear in the example (though it's clear in the edit I've made).
Thank you for pointing that out! The example is much better with this detail.
I would only say
That seems reasonable, I'll cut it down to ~ what you are suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Nothing I say hard and fast, all my comments are illustrations. If I want a change to be hard and fast, I'll submit a PR.
docs/source/advanced.rst
Outdated
As per the table above, if you find that your target is classified as | ||
``"multiclass-multioutput"`` or ``"unknown"``, you will have to implement your own data processing routine. | ||
|
||
Whole dataset manipulation via data_transformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section should go directly below the paragraph on dataset transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unsure about the level of nesting here. It looks like "multi-input" and "whole dataset" are on the same level, something I wouldn't have expected with the titles.
It might be worth collapsing this section into the "data transformers" section. This section really only mentions the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move it as you suggest. It then becomes duplicated with the paragraph discussed in #167 (comment), which warrants removing the latter.
docs/source/advanced.rst
Outdated
This is the last step before passing the data to Keras, and it allows for the greatest | ||
degree of customization because SciKeras does not make any assumptions about the output data | ||
and passes it directly to :py:func:`tensorflow.keras.Model.fit`. | ||
Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's signature is this:
from sklearn.base import BaseEstimator
class DatasetTransformer(BaseEstimator):
def fit(self, X, y, sample_weight=None) -> "DatasetTransformer":
...
return self
def transform(self, X, y, sample_weight): # return a valid input for keras.Model.fit
...
return X, y, sample_weight # option 1
return tensorflow_dataset # option 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have design it wrong, and I'm not a fan of the tuple input either (especially because fit
requires a dummy argument), but I believe it is required so that these can be chained in a pipeline.
With the signature you are proposing, this fails:
from sklearn.base import BaseEstimator
from sklearn.pipeline import make_pipeline
class DatasetTransformer(BaseEstimator):
def fit(self, X, y, sample_weight=None) -> "DatasetTransformer":
return self
def transform(self, X, y, sample_weight): # return a valid input for keras.Model.fit
return X, y, sample_weight
p = make_pipeline(DatasetTransformer(), DatasetTransformer())
X = [1]
y = [1]
p.fit_transform(X, y)
If instead you accept a tuple as your input, then you can chain them:
class DatasetTransformer(BaseEstimator):
def fit(self, data, dummy=None) -> "DatasetTransformer":
return self
def transform(self, data): # return a valid input for keras.Model.fit
X, y, sample_weight = data
return X, y, sample_weight
p = make_pipeline(DatasetTransformer(), DatasetTransformer())
X = [1]
y = [1]
p.fit_transform((X, y, None))
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the example, I only meant "illustrate the signature with code, not with words." I am unsure what's happening with tuple inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Yes, that is a good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it'd be clearer to have the user return a dict. I'd also modify the tuple input to this transform; why not pass keyword arguments?
ValidKerasInput = Union[np.ndarray, tf.Dataset, ...]
def transform(
self,
X: ValidKerasInput,
y: ValidKerasInput,
sample_weight: Optional[ValidKerasInput]=None
) -> dict[str, ValidKerasInput]:
...
return {"X": X, "y": y, "sample_weight": None}
I'm not sure this signature is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it be messy?
It's minor, but it would now be a dict with dozens of keys, some values being arrays, etc. It can easily be documented, but I can see how at first glance it might be more confusing than a 3 element tuple or a 3 key dict.
Doesn't a user-passed kwarg already passed to fit?
Yes, but these parameters can't be calculated based on the data, which becomes especially problematic if doing cross-valdation and such. This is the basic problem in #131.
Of course, if users want to pass a fixed parameter like batch_size=42
they can do so via fit
(or better yet, via the constructor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's minor, but it would now be a dict with dozens of keys,
Why would it have to be dozens of keys? I'm thinking of this code:
# line 885 of wrappers.py
transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight)
kwargs.update(transform_kwargs)
self._fit_keras_model(**kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that implementation in #167 (comment) is not compatible with the sklearn ecosystem, including pipelines, which I think are in important feature. This would be fine:
self.dataset_transformer_.transform(dict(x=X, y=y, sample_weight=sample_weight))
Regarding the multiple keys, what I was thinking was to pass all Model.fit
keys. In _fit_keras_model
:
Lines 480 to 494 in 6eee3c4
params = self.get_params() | |
fit_args = route_params(params, destination="fit", pass_filter=self._fit_kwargs) | |
fit_args["epochs"] = initial_epoch + epochs | |
fit_args["initial_epoch"] = initial_epoch | |
fit_args.update(kwargs) | |
fit_args["x"] = X | |
fit_args["y"] = y | |
fit_args["sample_weight"] = sample_weight | |
fit_args = self.dataset_transformer_.transform(fit_args) | |
if self._random_state is not None: | |
with TFRandomState(self._random_state): | |
hist = self.model_.fit(**fit_args) |
This means that users can not only add keys, but also modify existing keys, to their liking.
Regarding the dozens of keys
comment, I wasn't referring to the keys assigned via kwargs.update(transform_kwargs)
above, I was more thinking that when a user creates a transformer like:
def transform(data: Dict[str, Any) -> Dict[str, Any]:
len(data.keys()) # many keys, array-values, etc.
...
# for context:
class SomeCustomEstimator(BaseWrapper):
@property
def data_transformer(self):
return FunctionTransformer(transformer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that users can not only add keys, but also modify existing keys, to their liking.
Yes, I think that's a good idea. It also leads to simple documentation (something like "the dataset transformer is passed a dict. Normally, this dictionary is unmodified and passed directly to keras.Model.fit. If desired, it's possible to modify this dictionary through a Scikit-learn transformation: [example]."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 Docs preview for commit f5df4c4 at: https://www.adriangb.com/scikeras/refs/pull/167/merge/ |
…gb/scikeras into whole-dataset-transformer
docs/source/advanced.rst
Outdated
This is the last step before passing the data to Keras, and it allows for the greatest | ||
degree of customization because SciKeras does not make any assumptions about the output data | ||
and passes it directly to :py:func:`tensorflow.keras.Model.fit`. | ||
Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's minor, but it would now be a dict with dozens of keys,
Why would it have to be dozens of keys? I'm thinking of this code:
# line 885 of wrappers.py
transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight)
kwargs.update(transform_kwargs)
self._fit_keras_model(**kwargs)
Resolves #160