This package contains libraries to integrate MLeap with:
- PySpark
- Scikit-Learn
- TensorFlow (coming soon)
$ pip install mleap
MLeap's PySpark library provides serialization and deserialization functionality to/from Bundle.ML. There is 100% parity between MLeap's PySpark and Scala/Spark support and all of the supported transformers can be found here.
We have both a basic tutorial and an advance demo of serializing and de-serializing using PySpark, but in short you can continue to write ML Pipelines as you normally would and we provide the following interface for serialization/de-serialization:
# Define your pipeline
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
# Fit your pipeline
fittedPipeline = featurePipeline.fit(df)
# Serialize your pipeline
fittedPipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip", fittedPipeline.transform(df))
# dict of label mappings
labels = {'a': 1.0}
string_map_transformer = StringMap(
labels, 'key_col', 'value_col', handleInvalid='keep', defaultValue=0.0)
Example usage:
# dict of label mappings
input_dataframe = pd.DataFrame([[0.1, 0.2, 0.3]], columns=['f1'])
sin_transformer = MathUnary(
operation=UnaryOperation.Sin,
inputCol="f1",
outputCol="sin(f1)",
)
sin_transformer.transform(input_dataframe)
MLeap's Scikit-Learn library provides serialization (de-serialization coming soon) functionality to Bundle.ML. There is already parity between the math that Scikit and Spark transformers execute, and MLeap takes advantage of that to provide a common serialization format for the two technologies.
A simple example is the StandardScaler
transformer that normalizes the data given the mean and standard deviation. Both Spark and Scikit perform the standard normal transform on the data, and can both be serialized to the following format:
{
"op": "standard_scaler",
"attributes": {
"mean": {
"double": [0.2354223, 1.34502332],
"shape": {
"dimensions": [{
"size": 2,
"name": ""
}]
},
"type": "tensor"
},
"std": {
"double": [0.13842223, 0.78320249],
"shape": {
"dimensions": [{
"size": 2,
"name": ""
}]
},
"type": "tensor"
}
}
}
Scikit-Learn pipelines, just like Spark Pipelines, can be serialized to an MLeap Bundle and deployed to an MLeap runtime environment.
You can also take your scikit pipelines and deploy them to your Spark cluster, because MLeap can de-serialize them into a Spark ML Pipeline and execute them on data frames.
Documentation can be found on our mleap docs page:
Contributions are welcome! Make sure all python tests pass. You can run them from the top-level makefile:
make py37_test
If you'd rather use the inner python/Makefile
, remember to source SCALA_CLASS_PATH by running:
source scripts/scala_classpath_for_python.sh
cd python/ && make test