RFC: Composable input/output pipeline #234

adriangb · 2021-06-18T23:11:13Z

This is an attempt to collect ideas from various issues / PRs into a coherent framework and generalize our current input/output transformers.

@stsievert would you mind taking a look?

codecov-commenter · 2021-06-18T23:19:15Z

Codecov Report

Merging #234 (98d279b) into master (7c072fb) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #234   +/-   ##
=======================================
  Coverage   98.90%   98.90%           
=======================================
  Files           6        6           
  Lines         728      728           
=======================================
  Hits          720      720           
  Misses          8        8

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c072fb...98d279b. Read the comment docs.

github-actions · 2021-06-18T23:22:10Z

📝 Docs preview for commit 98d279b at: https://www.adriangb.com/scikeras/refs/pull/234/merge/

stsievert

Thanks for this RFC – this is easier to review than a PR. Here are some questions:

What issue are you trying to solve with this RFC? What's an illustration of that issue?
Is that issue user-facing?
How does this RFC solve that (user-facing?) issue?

rfcs/1-input-pipeline.md

stsievert · 2021-06-21T00:53:44Z

rfcs/1-input-pipeline.md

+If the data comes in as a `tf.data.Dataset`, the first pipeline is skipped. If not, it is run and the output is converted to a `Dataset`.
+The second pipeline is then run in all cases.
+
+These pipelines will consist of chained transformers implementing a Scikit-Learn-like interface, but without being restricted to the exact Scikit-Learn API.


but without being restricted to the exact Scikit-Learn API.

By "exact Scikit-Learn API," you mean "the exact names of Scikit-learn transformers," right? And the proposed "Scikit-learn-like interface" is public, correct?

If the answers to both of those questions are affirmative, 👎 to that interface. How can this RFC be reworked to conform with the "exact Scikit-Learn API"?

By "exact Scikit-Learn API" I mean roughly:

having a fit method with the signature fit(X, y=None) -> self

having a transform method with the signature transform(X) -> X'

an inverse_transform method with the signature inverse_transform(X') -> X.
And no other methods (that are part of the API).
Also inverse_transform(transform(X)) == X is not a requirement of the API, but it certainly is the spirit of it (and how most sklearn transformers are implemented, when it makes sense to do so).

And the proposed "Scikit-learn-like interface" is public, correct?

Yes, it would be public and consists of ArrayTransformer and DatasetTransformer.
These can be base classes, protocols/interfaces or just duck-typing (implementation detail).
The idea is that users could take our default data validation/transformations and mix/match with their own for use cases like multi-output models or multi-output class reweighting.

This proposed interface violates the sklearn interface in several ways:

by passing X, y and sample_weight in ArrayTransformer

by passing a tf.data.Dataset object in DatasetTransformer

because these really aren't inverse transformations: the forward/input transformation transforms X & y, the output transformation transforms y' (y predictions) and/or y itself. So inverse_transform(transform(X)) != X.

by allowing side effects to be applied to the model (this is an addition to the API, it doesn't strictly break the implementation, but it does break the spirit).

How can this RFC be reworked to conform with the "exact Scikit-Learn API"?

I don't think it can be easily reworked: there are too many limitations in the sklearn transformer API (and the authors have said as much themselves).

The good news is that the Scikit-Learn API is a strict subset of this API, so we could provide wrappers to convert the Scikit-Learn transformer API into this one (eg. by specifying what X is to your transformer and dispatching the methods).

But let me ask: why is it important to adhere to the Scikit-Learn API? I know there are good reasons, I just want understand which you are thinking of.
My thoughts are that if we can't interoperate directly (i.e. use make_pipeline and other facilities of sklearn), there is not much value in being semi-compatible. I also think there aren't that many things (beyond pipelines) that would be useful here. But this may be short-sighted, I'm open to other opinions.

This library has one job: to bring the Scikit-learn API to Keras. That means there needs to be a really good reason to break the Scikit-learn API.

That reason needs to consider alternative implementations, including the one that exactly follows the Scikit-learn API. I imagine some questions that need to be answered include the following:

What affect does the implementation have on developers?

What affect does the implementation have on the user?

What specific issues make the implementation inappropriate?

stsievert · 2021-06-21T01:00:01Z

rfcs/1-input-pipeline.md

+        ...
+```
+
+## Issues this can potentially resolve


These issues seem to be around modular validation/transforming. This RFC proposes one solution. What are other solutions? Why does this RFC represent the best solution?

What are other solutions?

This is, of course, a great question. I think general solutions would be variations of the same idea, perhaps less structured (eg. hardcoding that tf.data.Dataset inputs should skip BaseWrapper._validate_data). I'd have to think a bit to see if I can come up with any other structured approaches that might also solve the issues.

Why not have two Scikit-learn transformers, one for validation and one to change the data?

We would need:

Interface for validation using array-like data

Interface for data changing using array-like data

Interface for validation using Dataset

Interface for data changing using array-like data

I think this is too many interfaces. The only difference between a "validation" and a "transformation" is that the transformation needs to return the data, but the validation does not. So by having the validation return the data we can collapse those two concepts into one. Validation transformers simply inspect the data but do not modify it.

What use case/issue motivates having separate classes for Dataset/array-like inputs? Why not collapse it into one class?

Why is 4 interfaces "too many"? Why is the current framework with target_encoder_ and feature_encoder_ not sufficient?

I'm asking questions to get answers encoded in the RFC (an RFC is a really good idea to encode these decisions).

rfcs/1-input-pipeline.md

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

adriangb · 2021-06-21T02:04:39Z

Thanks for this RFC – this is easier to review than a PR.

Good to know it helps! And thank you for reviewing.

What issue are you trying to solve with this RFC? What's an illustration of that issue?

This solves some user-facing issues, as well as allowing us to rework things internally to be cleaner and more flexible for users.

There are several different issues, but I think the clearest one is #160.

The general user request is to be able to use tf.data.Dataset.

We could hack this together as is (by hardcoding bypasses to BaseWrapper._validate_data and feature/target encoders), but even then we would lose attributes like n_features_in_ that are both part of our API and the Scikit-Learn API. This RFC solves this problem by making processing of tf.data.Dataset a first-class citizen, and even allowing us to move some of our validation/preprocessing to use tf.data.Dataset which, for example, would allow us to provide n_features_in_ and classes_ regardless of if the input was array-like or a tf.data.Dataset.

The other issues linked are generally around the theme of more validations (#106, #143) or figuring out how to clean up our current validations (#111, #209). These two issues are intertwined because:

Every validation added has the chance of rubbing up against edge cases. If these validations aren't modular (eg if they're buried 3 layers deep into private method calls), we might block a use case we can't think of right now just to provide a nice error message.
If we add more ad-hoc private methods that do validation/transformation (eg BaseWrapper._validate_data, BaseWrapper._check_model_compatibility) we end up with a mess of private methods where it's hard to understand the execution order, goal of the method, etc.

This RFC resolves this by providing a unified interface for these validations/transformations that is modular, composable and public. I envision it helping our users (since they can disable or add validation/transformation steps) as well as us (the developers) because we can organize our default validation/transformations better.

stsievert · 2021-06-21T23:25:39Z

rfcs/1-input-pipeline.md

+If the data comes in as a `tf.data.Dataset`, the first pipeline is skipped. If not, it is run and the output is converted to a `Dataset`.
+The second pipeline is then run in all cases.
+
+These pipelines will consist of chained transformers implementing a Scikit-Learn-like interface, but without being restricted to the exact Scikit-Learn API.


This library has one job: to bring the Scikit-learn API to Keras. That means there needs to be a really good reason to break the Scikit-learn API.

That reason needs to consider alternative implementations, including the one that exactly follows the Scikit-learn API. I imagine some questions that need to be answered include the following:

What affect does the implementation have on developers?

What affect does the implementation have on the user?

What specific issues make the implementation inappropriate?

stsievert · 2021-06-21T23:26:28Z

rfcs/1-input-pipeline.md

+
+class ArrayTransformer:
+
+    def set_model(self, model: "BaseWrapper") -> None:


👎

Why not follow the Scikit-Learn API?

class ArrayTransformer(BaseEstimator): def __init__(self, model=None): self.model = model ... a = ArrayTransformer(model=model) # option 0 a = ArrayTransformer().model = model # option 1 a = ArrayTransformer() a.set_params(model=model) # option 2

Yes, TF uses set_model. Why should their boilerplate be followed?

This would require users to give us a class to initialize (this object is being passed to the constructor that creates model).

I guess we could use parameter routing to allow setting any other parameters. Does this look any better?

class MyTransf: def __init__(self, model, other): self.other = other ... est = BaseWrapper(..., pipeline_param=[MyTransf], pipeline_param__0__other=True) # or est = BaseWrapper(..., pipeline_param={"tfname": MyTransf}, pipeline_param__ tfname__other=True) # relies on dict ordering to know what order to run pipeline in # or est = BaseWrapper(..., pipeline_param=[("tfname", MyTransf)], pipeline_param__ tfname__other=True) # same as an sklearn pipeline using a tuple to set the name

An alternative would be to require users to bind any other parameters using functools.partial or something.

rfcs/1-input-pipeline.md

stsievert · 2021-06-21T23:31:19Z

rfcs/1-input-pipeline.md

+    def transform_input(self, data: tf.data.Dataset, *, initialize: bool = True) -> tf.data.Dataset:
+        return data
+
+    def transform_output(self, y_pred_proba: np.ndarray, y: Union[np.ndarray, None]) -> Tuple[np.ndarray, Union[np.ndarray, None]]:


Why not rename these functions inverse_transform?

Because their signature and usage does not match inverse_transform's signature. They wouldn't work in an sklearn Pipeline, nor would they be chainable with sklearn estimators.

stsievert · 2021-06-21T23:35:51Z

rfcs/1-input-pipeline.md

+        ...
+```
+
+## Issues this can potentially resolve


Why not have two Scikit-learn transformers, one for validation and one to change the data?

adriangb added 5 commits June 18, 2021 11:46

add rfc

fb4f1ce

initialize

ece0349

examples

f6d1196

cleanup

e6b754f

list issues

1275c8e

adriangb added 4 commits June 18, 2021 16:47

Merge branch 'master' into pipe-rfc

a1dda46

Update 1-input-pipeline.md

8b8861a

Update 1-input-pipeline.md

842537b

Create 1-input-pipeline.md

292d8bf

stsievert reviewed Jun 21, 2021

View reviewed changes

adriangb and others added 7 commits June 20, 2021 18:10

Update rfcs/1-input-pipeline.md

7cd7bb6

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

Update rfcs/1-input-pipeline.md

afe52df

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

Update rfcs/1-input-pipeline.md

46c847b

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

Update rfcs/1-input-pipeline.md

17d8a08

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

Update rfcs/1-input-pipeline.md

42f1646

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

add link

da2c5fd

Update rfcs/1-input-pipeline.md

44bb449

Co-authored-by: Scott Sievert <stsievert@users.noreply.github.com>

adriangb added 3 commits June 20, 2021 19:07

comment on loose ends

55e3910

save file

44cd76b

rework text

98d279b

stsievert reviewed Jun 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Composable input/output pipeline #234

RFC: Composable input/output pipeline #234

adriangb commented Jun 18, 2021

codecov-commenter commented Jun 18, 2021 •

edited

Loading

github-actions bot commented Jun 18, 2021 •

edited

Loading

stsievert left a comment

stsievert Jun 21, 2021

adriangb Jun 21, 2021 •

edited

Loading

stsievert Jun 21, 2021

stsievert Jun 21, 2021

adriangb Jun 21, 2021

stsievert Jun 21, 2021

adriangb Jun 22, 2021

stsievert Jun 22, 2021

adriangb commented Jun 21, 2021

stsievert Jun 21, 2021

stsievert Jun 21, 2021

adriangb Jun 22, 2021

stsievert Jun 21, 2021

adriangb Jun 22, 2021

stsievert Jun 21, 2021


		class ArrayTransformer:

		def set_model(self, model: "BaseWrapper") -> None:

RFC: Composable input/output pipeline #234

Are you sure you want to change the base?

RFC: Composable input/output pipeline #234

Conversation

adriangb commented Jun 18, 2021

codecov-commenter commented Jun 18, 2021 • edited Loading

Codecov Report

github-actions bot commented Jun 18, 2021 • edited Loading

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Jun 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 18, 2021 •

edited

Loading

github-actions bot commented Jun 18, 2021 •

edited

Loading

adriangb Jun 21, 2021 •

edited

Loading