Here we outline some more complicated transformers and transformers that require additional processing to work nicely with Pipelines and Feature Unions.
The LabelEncoder
is synonymous with StringIndexer
in Spark, however there are a couple of unique features of the scikit transformer that we need to account for:
LabelEncoder
only opperates on a single feature at a time- The output of the
LabelEncoder
is a numpy array of shape (1,n) instead of (n,1), which is required for further processing like One-Hot-Encoding
Here is what an example Pipeline looks like for a LabelEncoder
# Create a dataframe with some a categorical and a continuous feature
df = pd.DataFrame(np.array([ ['Alice', 32], ['Jack', 18], ['Bob',34]]), columns=['name', 'age'])
# Define our feature extractor
feature_extractor_tf = FeatureExtractor(input_scalars=['name'],
output_vector='name_continuous_feature',
output_vector_items=['name_label_encoded'])
# Label Encoder for x1 Label
label_encoder_tf = LabelEncoder()
label_encoder_tf.mlinit(input_features = feature_extractor_tf.output_vector_items, output_features='name_label_le')
# Reshape the output of the LabelEncoder to N-by-1 array
reshape_le_tf = ReshapeArrayToN1()
# Create our pipeline object and initialize MLeap Serialization
le_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
(label_encoder_tf.name, label_encoder_tf),
(reshape_le_tf.name, reshape_le_tf)
])
le_pipeline.mlinit()
# Transform our DataFrame
le_pipeline.fit_transform(df)
# output
array([[0],
[2],
[1]])
Next step is to combine the label indexer with a OneHotEncoder
We'll continue the example above to demonstrate how the out-of-the-box Scikit OneHotEncoder works.
## Vector Assembler for x1 One Hot Encoder
one_hot_encoder_tf = OneHotEncoder(sparse=False) # Make sure to set sparse=False
one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
#
# Construct our pipeline
one_hot_encoder_pipeline_x0 = Pipeline([
(feature_extractor_tf.name, feature_extractor_tf),
(label_encoder_tf.name, label_encoder_tf),
(reshape_le_tf.name, reshape_le_tf),
(one_hot_encoder_tf.name, one_hot_encoder_tf)
])
one_hot_encoder_pipeline_x0.mlinit()
# Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
one_hot_encoder_pipeline_x0.fit_transform(df)
matrix([[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.]])
One of the short-comings of Scikit's OneHotEncoder is that it's missing a drop_last
functionality that's required in ML pipelines.
MLeap comes with it's own OneHotEncoder that enables that function
Very similar to the Scikit OneHotEncoder, except we set an additional drop_last
attribute.
from mleap.sklearn.extensions.data import OneHotEncoder
## Vector Assembler for x1 One Hot Encoder
one_hot_encoder_tf = OneHotEncoder(sparse=False, drop_last=True) # Make sure to set sparse=False
one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
#
# Construct our pipeline
one_hot_encoder_pipeline_x0 = Pipeline([
(feature_extractor_tf.name, feature_extractor_tf),
(label_encoder_tf.name, label_encoder_tf),
(reshape_le_tf.name, reshape_le_tf),
(one_hot_encoder_tf.name, one_hot_encoder_tf)
])
one_hot_encoder_pipeline_x0.mlinit()
# Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
one_hot_encoder_pipeline_x0.fit_transform(df)
matrix([[ 1., 0.],
[ 0., 0.],
[ 0., 1.]])