Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add dependency injection point to transform X & y together #167

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
1d718ff
ENH: add dependancy injection point to transform X & y together
adriangb Jan 16, 2021
1a6037e
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 16, 2021
28535b2
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 19, 2021
c170f4b
Extend data transformer notebook with examples of data_transformer usage
adriangb Jan 21, 2021
cd5f415
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 21, 2021
b7fb34c
run entire notebook
adriangb Jan 21, 2021
d3357e4
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 21, 2021
bc92cff
Update docstring
adriangb Jan 22, 2021
45887e4
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 22, 2021
5b8e133
typo
adriangb Jan 22, 2021
2f2b7a5
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
6ee6425
Test pipeline, move notebook to markdown
adriangb Jan 24, 2021
a3092c2
fix undef transformer
adriangb Jan 24, 2021
8f92591
remove unused dummy transformer
adriangb Jan 24, 2021
fa728c1
Remove unused import
adriangb Jan 24, 2021
6fdea0d
remove empty cell
adriangb Jan 24, 2021
8aba7cb
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
6675889
Fix typos
adriangb Jan 24, 2021
c317699
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 24, 2021
5acbd0f
add comment
adriangb Jan 24, 2021
5d9e02b
print all data
adriangb Jan 24, 2021
9b43e9c
Update data transformer docs
adriangb Jan 24, 2021
0fbecd0
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 24, 2021
deb4858
Finish sentence
adriangb Jan 24, 2021
981e61c
PR feedback
adriangb Jan 25, 2021
0d55306
fix error
adriangb Jan 25, 2021
3cf1ed5
use embedded links, ref links seem to be broken
adriangb Jan 25, 2021
a198eb3
spacing
adriangb Jan 25, 2021
047d430
fix code block
adriangb Jan 25, 2021
5413015
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
e71625e
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
f4c0dcc
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 27, 2021
54cfc43
PR feedback
adriangb Jan 27, 2021
1742ef4
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 27, 2021
d03248f
use code block for signature
adriangb Jan 27, 2021
87452ff
remove dummy parameter
adriangb Jan 27, 2021
d2b4402
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 28, 2021
034fc7f
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 29, 2021
491e0b1
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 31, 2021
3f8f9b4
re-add dummy
adriangb Jan 31, 2021
f569b48
Merge branch 'master' into whole-dataset-transformer
adriangb Jan 31, 2021
8dafe1b
Merge branch 'whole-dataset-transformer' of https://github.com/adrian…
adriangb Jan 31, 2021
f918966
Merge master
adriangb Feb 16, 2021
fd62b82
Use dicts, add more examples
adriangb Feb 16, 2021
6eee3c4
fix broken test
adriangb Feb 16, 2021
5ca7da8
update docs
adriangb Feb 16, 2021
5bd222e
add clarifying comment in docs
adriangb Feb 16, 2021
f560687
update TOC
adriangb Feb 16, 2021
2353408
Merge branch 'master' into whole-dataset-transformer
adriangb Feb 16, 2021
f5df4c4
Merge branch 'master' into whole-dataset-transformer
adriangb Feb 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 52 additions & 6 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -178,11 +178,50 @@ This is basically the same as calling :py:func:`~scikeras.wrappers.BaseWrapper.g
Data Transformers
^^^^^^^^^^^^^^^^^

In some cases, the input actually consists of multiple inputs. E.g.,
Keras supports a much wider range of inputs/outputs than Scikit-Learn does. E.g.,
in a text classification task, you might have an array that contains
the integers representing the tokens for each sample, and another
array containing the number of tokens of each sample. SciKeras has you
covered here as well.
array containing the number of tokens of each sample.

In order to reconcile Keras' expanded input/output support and Scikit-Learn's more
limited options, SciKeras introduces "data transformers". These are really just
dependency injection points where you can declare custom data transformations,
for example to split an array into a list of arrays, join `X` & `y` into a `Dataset`, etc.
In order to keep these transformations in a familiar format, they are implemented as
sklearn-style transformers. You can think of this setup as an sklearn Pipeline:

.. code-block::

↗ feature_encoder ↘
your data → sklearn-ecosystem → SciKeras dataset_transformer → Keras
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this diagram is really useful.

Would this be a better diagram?

                                   ↗ feature_encoder ↘
    SciKeras.fit(features, labels)                    dataset_transformer → Keras.fit(dataset)
                                   ↘ target_encoder  ↗ 

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the addition of .fit. My only worry is that users might think that SciKeras always creates a tf.data.Dataset, which is not the case, by default it gives numpy arrays to Keras.fit. Do you think Keras.fit(dataset or np.array) makes that clear? It could also be dicts of lists or something, but that's at least more uncommon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can dataset_transformer return? Only datasets/ndarrays? Or does it support the other arguments of Keras.fit?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that Keras.fit will accept. Internally, it looks something like this:

X, y, sample_weight = dataset_transformer.fit_transform((X, y, sample_weight))
model.fit(x=X, y=y, sample_weight=sample_weight)  # aka Keras.fit

Copy link
Owner Author

@adriangb adriangb Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Keras.fit(data) is a good way to specify this? That way there's not confusion in interpreting dataset as tf.data.Dataset. I can also add a small code block like in #167 (comment) if that helps explain it.

↘ target_encoder ↗


As you can see, there are 2 stages of data transformations within SciKeras:

- Target/Feature transformations:
- feature_encoder: Handles transformations to the features (`X`). This can be used
to implement multi-input models.
- target_encoder: Handles transformations to the target (`y`). This can be used
to implement non-int labels (eg: strings) as well as mutli-output models.
- Whole dataset transformations:
- dataset_transformer: This is the last step before passing the data to Keras.
It can be used to implement conversion to a `Dataset`, amongst other things.
adriangb marked this conversation as resolved.
Show resolved Hide resolved

`feature_encoder` and `target_encoder` are run before building the Keras Model,
adriangb marked this conversation as resolved.
Show resolved Hide resolved
while `data_transformer` is run after the Model is built. This means that the
former two will not have access to the Model (eg. to get the number of outputs)
but *will* be able to inject data into the model building function (more on this
below). `data_transformer` on the other hand *will* get access to the built Model,
but it cannot pass any data to model building.

Although you could just implement everything in `dataset_transformer`,
having several distinct dependency injections points allows for more modularity,
for example to keep the default processing of string-encoded labels but convert
the data to a `Dataset` before passing to Keras.

Multi-input and output models
+++++++++++++++++++++++++++++

Scikit-Learn natively supports multiple outputs, although it technically
requires them to be arrays of equal length
Expand All @@ -208,11 +247,11 @@ type, and implements basic handling of the following cases out of the box:
+--------------------------+--------------+----------------+----------------+---------------+
| "binary" | [1, 0, 1] | 1 | 1 or 2 | Yes |
+--------------------------+--------------+----------------+----------------+---------------+
| "mulilabel-indicator" | [[1, 1], | 1 or >1 | 2 per target | Single output |
| "multilabel-indicator" | [[1, 1], | 1 or >1 | 2 per target | Single output |
| | | | | |
| | [0, 2], | | | only |
| | [0, 1], | | | only |
| | | | | |
| | [1, 1]] | | | |
| | [1, 0]] | | | |
+--------------------------+--------------+----------------+----------------+---------------+
| "multiclass-multioutput" | [[1, 1], | >1 | >=2 per target | No |
| | | | | |
Expand All @@ -232,6 +271,13 @@ type, and implements basic handling of the following cases out of the box:
If you find that your target is classified as ``"multiclass-multioutput"`` or ``"unknown"``, you will have to
implement your own data processing routine.

In addition to converting data, `feature_encoder` and `target_encoder`, allows you to inject data
into your model construction method. This is useful if for example you use `target_encoder` to dynamically
determine how many outputs your model should have based on the data and then use this information to
assign the right number of outputs in your Model. To return data from `feature_encoder` or `target_encoder`,
you will need to provide a transformer with a `get_metadata` method, which is expected to return a dictionary
which will be injected into your model building function via the `meta` parameter.

For a complete examples implementing custom data processing, see the examples in the
:ref:`tutorials` section.

Expand Down
Loading