ENH: add dependency injection point to transform X & y together #167

adriangb · 2021-01-16T21:59:40Z

Resolves #160

codecov-io · 2021-01-16T22:05:45Z

Codecov Report

Merging #167 (6eee3c4) into master (e36f03f) will decrease coverage by 0.43%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #167      +/-   ##
==========================================
- Coverage   99.70%   99.26%   -0.44%     
==========================================
  Files           6        5       -1     
  Lines         669      678       +9     
==========================================
+ Hits          667      673       +6     
- Misses          2        5       +3

Impacted Files	Coverage Δ
scikeras/utils/transformers.py	`100.00% <100.00%> (ø)`
scikeras/wrappers.py	`99.44% <100.00%> (+<0.01%)`	⬆️
scikeras/_saving_utils.py	`96.20% <0.00%> (-3.80%)`	⬇️
scikeras/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e36f03f...6eee3c4. Read the comment docs.

…gb/scikeras into whole-dataset-transformer

adriangb · 2021-01-24T19:18:20Z

@stsievert if you are able to, a review of this new interface would be much appreciated. The best place to start is probably section 6 and 7 of this notebook.

adriangb · 2021-01-24T20:00:11Z

scikeras/utils/transformers.py

+class ClassWeightDataTransformer(BaseEstimator, TransformerMixin):
+    """Default dataset_transformer for KerasClassifier.
+
+    This transformer implements handling of the `class_weight` parameter
+    for single output classifiers.
+    """
+
+    def __init__(self, class_weight: Optional[Union[str, Dict[int, float]]] = None):
+        self.class_weight = class_weight
+
+    def fit(
+        self,
+        data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]],
+        dummy: None = None,
+    ) -> "ClassWeightDataTransformer":
+        return self
+
+    def transform(
+        self, data: Tuple[np.ndarray, Optional[np.ndarray], Optional[np.ndarray]]
+    ) -> Tuple[np.ndarray, Union[np.ndarray, None], Union[np.ndarray, None]]:
+        X, y, sample_weight = data
+        if self.class_weight is None or y is None:
+            return (X, y, sample_weight)
+        sample_weight = 1 if sample_weight is None else sample_weight
+        sample_weight *= compute_sample_weight(class_weight=self.class_weight, y=y)
+        return (X, y, sample_weight)


There is another option here, that may be cleaner, which would be to add another dependency injection point that runs in "parallel" to feature_encoder and target_encoder specifically for sample_weight. But since it also makes sense that one would need at least y to process sample_weight, it might be a bit redundant.

docs/source/notebooks/DataTransformers.md

stsievert · 2021-01-25T19:05:38Z

docs/source/advanced.rst

+.. code-block::
+
+                                             ↗ feature_encoder ↘
+    your data → sklearn-ecosystem → SciKeras                    dataset_transformer → Keras


I think this diagram is really useful.

Would this be a better diagram?

↗ feature_encoder ↘ SciKeras.fit(features, labels) dataset_transformer → Keras.fit(dataset) ↘ target_encoder ↗

I do like the addition of .fit. My only worry is that users might think that SciKeras always creates a tf.data.Dataset, which is not the case, by default it gives numpy arrays to Keras.fit. Do you think Keras.fit(dataset or np.array) makes that clear? It could also be dicts of lists or something, but that's at least more uncommon.

What can dataset_transformer return? Only datasets/ndarrays? Or does it support the other arguments of Keras.fit?

Anything that Keras.fit will accept. Internally, it looks something like this:

X, y, sample_weight = dataset_transformer.fit_transform((X, y, sample_weight)) model.fit(x=X, y=y, sample_weight=sample_weight) # aka Keras.fit

Maybe Keras.fit(data) is a good way to specify this? That way there's not confusion in interpreting dataset as tf.data.Dataset. I can also add a small code block like in #167 (comment) if that helps explain it.

docs/source/advanced.rst

docs/source/notebooks/DataTransformers.md

adriangb · 2021-01-25T23:59:13Z

@stsievert thank you for the review, the feedback was very valuable, I think I was able to incorporate most of the smaller suggestions.

For the larger stuff, I moved most of the pseudocode and general background out of the notebook and into the main docs, and added some links back to the docs instead of duplicate information, as you suggested.

Having read the documentation, do you think this data_transformer interface makes sense? And in particular, do you feel that using it to handle class_weight is problematic?

docs/source/advanced.rst

stsievert · 2021-01-27T20:33:31Z

docs/source/advanced.rst

+(eg. to get the number of outputs) but *will* be able to inject data into the model building
+function (more on this below). On the other hand,
+``data_transformer`` *will* get access to the built Model, but it cannot pass any data to model building
+function.


One important thing to note is that feature_encoder and target_encoder are run before building the Keras Model, while data_transformer is run after the Model is built.

This isn't clear in the example (though it's clear in the edit I've made).

This means that the former two will not have access to the Model
(eg. to get the number of outputs) but will be able to inject data into the model building
function (more on this below). On the other hand,
data_transformer will get access to the built Model, but it cannot pass any data to model building
function.

I would only say "the output of dataset_transformer get directly passed to tf.keras.Model.fit through self.model_.fit." All the about the number of outputs/etc is a confusing. I think I'd refer directly to the advanced examples.

This isn't clear in the example (though it's clear in the edit I've made).

Thank you for pointing that out! The example is much better with this detail.

I would only say

That seems reasonable, I'll cut it down to ~ what you are suggesting.

Thanks. Nothing I say hard and fast, all my comments are illustrations. If I want a change to be hard and fast, I'll submit a PR.

stsievert · 2021-01-27T20:34:51Z

docs/source/advanced.rst

+As per the table above, if you find that your target is classified as
+``"multiclass-multioutput"`` or ``"unknown"``, you will have to implement your own data processing routine.
+
+Whole dataset manipulation via data_transformer


I think this section should go directly below the paragraph on dataset transformers.

I'm also unsure about the level of nesting here. It looks like "multi-input" and "whole dataset" are on the same level, something I wouldn't have expected with the titles.

It might be worth collapsing this section into the "data transformers" section. This section really only mentions the signature.

I'll move it as you suggest. It then becomes duplicated with the paragraph discussed in #167 (comment), which warrants removing the latter.

stsievert · 2021-01-27T20:37:49Z

docs/source/advanced.rst

+This is the last step before passing the data to Keras, and it allows for the greatest
+degree of customization because SciKeras does not make any assumptions about the output data
+and passes it directly to :py:func:`tensorflow.keras.Model.fit`.
+Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``,


It's signature is this:

from sklearn.base import BaseEstimator class DatasetTransformer(BaseEstimator): def fit(self, X, y, sample_weight=None) -> "DatasetTransformer": ... return self def transform(self, X, y, sample_weight): # return a valid input for keras.Model.fit ... return X, y, sample_weight # option 1 return tensorflow_dataset # option 2

I may have design it wrong, and I'm not a fan of the tuple input either (especially because fit requires a dummy argument), but I believe it is required so that these can be chained in a pipeline.

With the signature you are proposing, this fails:

from sklearn.base import BaseEstimator from sklearn.pipeline import make_pipeline class DatasetTransformer(BaseEstimator): def fit(self, X, y, sample_weight=None) -> "DatasetTransformer": return self def transform(self, X, y, sample_weight): # return a valid input for keras.Model.fit return X, y, sample_weight p = make_pipeline(DatasetTransformer(), DatasetTransformer()) X = [1] y = [1] p.fit_transform(X, y)

If instead you accept a tuple as your input, then you can chain them:

class DatasetTransformer(BaseEstimator): def fit(self, data, dummy=None) -> "DatasetTransformer": return self def transform(self, data): # return a valid input for keras.Model.fit X, y, sample_weight = data return X, y, sample_weight p = make_pipeline(DatasetTransformer(), DatasetTransformer()) X = [1] y = [1] p.fit_transform((X, y, None))

Am I missing something?

With the example, I only meant "illustrate the signature with code, not with words." I am unsure what's happening with tuple inputs.

Oh I see. Yes, that is a good suggestion.

Maybe it'd be clearer to have the user return a dict. I'd also modify the tuple input to this transform; why not pass keyword arguments?

ValidKerasInput = Union[np.ndarray, tf.Dataset, ...] def transform( self, X: ValidKerasInput, y: ValidKerasInput, sample_weight: Optional[ValidKerasInput]=None ) -> dict[str, ValidKerasInput]: ... return {"X": X, "y": y, "sample_weight": None}

I'm not sure this signature is correct.

How would it be messy?

It's minor, but it would now be a dict with dozens of keys, some values being arrays, etc. It can easily be documented, but I can see how at first glance it might be more confusing than a 3 element tuple or a 3 key dict.

Doesn't a user-passed kwarg already passed to fit?

Yes, but these parameters can't be calculated based on the data, which becomes especially problematic if doing cross-valdation and such. This is the basic problem in #131.

Of course, if users want to pass a fixed parameter like batch_size=42 they can do so via fit (or better yet, via the constructor).

It's minor, but it would now be a dict with dozens of keys,

Why would it have to be dozens of keys? I'm thinking of this code:

# line 885 of wrappers.py transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight) kwargs.update(transform_kwargs) self._fit_keras_model(**kwargs)

I think that implementation in #167 (comment) is not compatible with the sklearn ecosystem, including pipelines, which I think are in important feature. This would be fine:

self.dataset_transformer_.transform(dict(x=X, y=y, sample_weight=sample_weight))

Regarding the multiple keys, what I was thinking was to pass all Model.fit keys. In _fit_keras_model:

scikeras/scikeras/wrappers.py

Lines 480 to 494 in 6eee3c4

params = self.get_params()

fit_args = route_params(params, destination="fit", pass_filter=self._fit_kwargs)

fit_args["epochs"] = initial_epoch + epochs

fit_args["initial_epoch"] = initial_epoch

fit_args.update(kwargs)

fit_args["x"] = X

fit_args["y"] = y

fit_args["sample_weight"] = sample_weight

fit_args = self.dataset_transformer_.transform(fit_args)

if self._random_state is not None:

with TFRandomState(self._random_state):

hist = self.model_.fit(**fit_args)

This means that users can not only add keys, but also modify existing keys, to their liking.

Regarding the dozens of keys comment, I wasn't referring to the keys assigned via kwargs.update(transform_kwargs) above, I was more thinking that when a user creates a transformer like:

def transform(data: Dict[str, Any) -> Dict[str, Any]: len(data.keys()) # many keys, array-values, etc. ... # for context: class SomeCustomEstimator(BaseWrapper): @property def data_transformer(self): return FunctionTransformer(transformer)

This means that users can not only add keys, but also modify existing keys, to their liking.

Yes, I think that's a good idea. It also leads to simple documentation (something like "the dataset transformer is passed a dict. Normally, this dictionary is unmodified and passed directly to keras.Model.fit. If desired, it's possible to modify this dictionary through a Scikit-learn transformation: [example]."

Great. I'm going to loop back to #131 and #159 to see if the issue reporters have any input on this implementation.

…gb/scikeras into whole-dataset-transformer

github-actions · 2021-01-28T21:41:00Z

📝 Docs preview for commit f5df4c4 at: https://www.adriangb.com/scikeras/refs/pull/167/merge/

…gb/scikeras into whole-dataset-transformer

scikeras/wrappers.py

stsievert · 2021-02-16T02:30:21Z

docs/source/advanced.rst

+This is the last step before passing the data to Keras, and it allows for the greatest
+degree of customization because SciKeras does not make any assumptions about the output data
+and passes it directly to :py:func:`tensorflow.keras.Model.fit`.
+Its signature is ``dataset_transformer.fit_transform((X, y, sample_weight))``,


It's minor, but it would now be a dict with dozens of keys,

Why would it have to be dozens of keys? I'm thinking of this code:

# line 885 of wrappers.py transform_kwargs = self.dataset_transformer_.transform(x=X, y=y, sample_weight=sample_weight) kwargs.update(transform_kwargs) self._fit_keras_model(**kwargs)

ENH: add dependancy injection point to transform X & y together

1d718ff

adriangb mentioned this pull request Jan 16, 2021

Dealing with variable length inputs #160

Closed

adriangb added 2 commits January 16, 2021 17:13

Merge branch 'master' into whole-dataset-transformer

1a6037e

Merge branch 'master' into whole-dataset-transformer

28535b2

adriangb mentioned this pull request Jan 21, 2021

ENH: multi-output class weights #173

Open

adriangb added 17 commits January 21, 2021 15:19

Extend data transformer notebook with examples of data_transformer usage

c170f4b

Merge branch 'whole-dataset-transformer' of https://github.com/adrian…

cd5f415

…gb/scikeras into whole-dataset-transformer

run entire notebook

b7fb34c

Merge branch 'master' into whole-dataset-transformer

d3357e4

Update docstring

bc92cff

Merge branch 'whole-dataset-transformer' of https://github.com/adrian…

45887e4

…gb/scikeras into whole-dataset-transformer

typo

5b8e133

Merge branch 'master' into whole-dataset-transformer

2f2b7a5

Test pipeline, move notebook to markdown

6ee6425

fix undef transformer

a3092c2

remove unused dummy transformer

8f92591

Remove unused import

fa728c1

remove empty cell

6fdea0d

Merge branch 'master' into whole-dataset-transformer

8aba7cb

Fix typos

6675889

Merge branch 'whole-dataset-transformer' of https://github.com/adrian…

c317699

…gb/scikeras into whole-dataset-transformer

add comment

5acbd0f

adriangb added 3 commits January 24, 2021 13:19

print all data

5d9e02b

Update data transformer docs

9b43e9c

Merge branch 'master' into whole-dataset-transformer

0fbecd0

adriangb commented Jan 24, 2021

View reviewed changes

Finish sentence

deb4858

stsievert reviewed Jan 25, 2021

View reviewed changes

adriangb added 3 commits January 25, 2021 17:08

use embedded links, ref links seem to be broken

3cf1ed5

spacing

a198eb3

fix code block

047d430

adriangb added 3 commits January 27, 2021 11:50

Merge branch 'master' into whole-dataset-transformer

5413015

Merge branch 'master' into whole-dataset-transformer

e71625e

Merge branch 'master' into whole-dataset-transformer

f4c0dcc

stsievert reviewed Jan 27, 2021

View reviewed changes

adriangb added 5 commits January 27, 2021 15:59

PR feedback

54cfc43

Merge branch 'whole-dataset-transformer' of https://github.com/adrian…

1742ef4

…gb/scikeras into whole-dataset-transformer

use code block for signature

d03248f

remove dummy parameter

87452ff

Merge branch 'master' into whole-dataset-transformer

d2b4402

adriangb added 6 commits January 29, 2021 12:04

Merge branch 'master' into whole-dataset-transformer

034fc7f

Merge branch 'master' into whole-dataset-transformer

491e0b1

re-add dummy

3f8f9b4

Merge branch 'master' into whole-dataset-transformer

f569b48

Merge branch 'whole-dataset-transformer' of https://github.com/adrian…

8dafe1b

…gb/scikeras into whole-dataset-transformer

Merge master

f918966

stsievert reviewed Feb 16, 2021

View reviewed changes

adriangb added 4 commits February 15, 2021 23:59

Use dicts, add more examples

fd62b82

fix broken test

6eee3c4

update docs

5ca7da8

add clarifying comment in docs

5bd222e

adriangb mentioned this pull request Feb 16, 2021

add batch_size = 'all' option? #159

Closed

adriangb added 3 commits February 16, 2021 09:18

update TOC

f560687

Merge branch 'master' into whole-dataset-transformer

2353408

Merge branch 'master' into whole-dataset-transformer

f5df4c4

adriangb mentioned this pull request Feb 24, 2021

No default loss for KerasClassifier #206

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add dependency injection point to transform X & y together #167

ENH: add dependency injection point to transform X & y together #167

adriangb commented Jan 16, 2021

codecov-io commented Jan 16, 2021 •

edited

Loading

adriangb commented Jan 24, 2021

adriangb Jan 24, 2021

stsievert Jan 25, 2021

adriangb Jan 25, 2021

stsievert Jan 25, 2021

adriangb Jan 25, 2021

adriangb Jan 25, 2021 •

edited

Loading

adriangb commented Jan 25, 2021

stsievert Jan 27, 2021

adriangb Jan 27, 2021

stsievert Jan 27, 2021 •

edited

Loading

stsievert Jan 27, 2021

stsievert Jan 27, 2021

adriangb Jan 27, 2021

stsievert Jan 27, 2021

adriangb Jan 27, 2021 •

edited

Loading

stsievert Jan 27, 2021

adriangb Jan 27, 2021

stsievert Jan 27, 2021 •

edited

Loading

adriangb Feb 16, 2021

stsievert Feb 16, 2021

adriangb Feb 16, 2021

stsievert Feb 16, 2021

adriangb Feb 16, 2021

github-actions bot commented Jan 28, 2021 •

edited

Loading

stsievert Feb 16, 2021

	params = self.get_params()
	fit_args = route_params(params, destination="fit", pass_filter=self._fit_kwargs)
	fit_args["epochs"] = initial_epoch + epochs
	fit_args["initial_epoch"] = initial_epoch
	fit_args.update(kwargs)

	fit_args["x"] = X
	fit_args["y"] = y
	fit_args["sample_weight"] = sample_weight

	fit_args = self.dataset_transformer_.transform(fit_args)

	if self._random_state is not None:
	with TFRandomState(self._random_state):
	hist = self.model_.fit(**fit_args)

ENH: add dependency injection point to transform X & y together #167

Are you sure you want to change the base?

ENH: add dependency injection point to transform X & y together #167

Conversation

adriangb commented Jan 16, 2021

codecov-io commented Jan 16, 2021 • edited Loading

Codecov Report

adriangb commented Jan 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

adriangb commented Jan 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 28, 2021 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Jan 16, 2021 •

edited

Loading

adriangb Jan 25, 2021 •

edited

Loading

stsievert Jan 27, 2021 •

edited

Loading

adriangb Jan 27, 2021 •

edited

Loading

stsievert Jan 27, 2021 •

edited

Loading

github-actions bot commented Jan 28, 2021 •

edited

Loading