Tools for Production Level Machine Learning

[NOTE: This repo is an constant work in progress. Any feedback is greatly appreciated 😄]

A curated list of useful tools and references for production level machine learning, both open source, and propreitary. Also included are some pointers for tradeoffs of choosing

Data

Data handing is a broad topic that requires a number of tools for processing, storage, privacy, pipelines, and many others.

Storage

Open Source & SaaS Services

Object store: Store binary data (images, sound files, compressed texts)
- Amazon S3
- Ceph
Database: Store metadata (file paths, labels, user activity, etc).
- Postgres is the right choice for most of applications, with the best-in-class SQL and great support for unstructured JSON.
Data Lake: to aggregate features which are not obtainable from database (e.g. logs)
- Amazon Redshift
Feature Store: store, access, and share machine learning features. Feature extraction can be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams.
- FEAST: Only for Google Cloud
- Michelangelo Palette: Specific part of a proprietary project within Uber

Pipelines

Open Source

Machine Learning Development Frameworks

There are number of development frameworks out there. There are fundamental libraries as well as derivative APIs (e.g. Keras) which simplifies the interface.

Open Source

Tensorflow: Fundamental tool for deep learning and well supported by Google & community
PyTorch: Fundamental tool based upon Torch developed and well supported by Facebook & community
Keras: Simplified API for easier development
Scikit Learn
DeepDetect

Model / Experiment Management

Open Source

Polyaxon: reproducible machine learning at scale
Datmo: replicable model versions
MLFlow: machine learning experiment tracking
ModelDB: system for managing machine learning models for scikit-learn & spark.ml
DVC: replicable etl and feature extraction pipelines
CookieCutter Data Science: replicable file structures for data projects
Docker CookieCutter Data Science: fork of above to run cookie-cutter project in a Docker container
Duct Tape: replicable running of code
Dynamic Training Bench: tensorflow training and tuning
Sacred: reproduce experiments with a GUI to track
Pachyderm: reproducible way to version data and ETL pipelines
Django Estimators: specific to django and scikit-learn estimators
MAX: model template for tracking model types
Kinoa: save experiment results easily

Continuous Integration

SaaS Tools

Argo: Open source Kubernetes native workflow engine for orchestrating parallel jobs (incudes workflows, events, CI and CD).
CircleCI: Language-Inclusive Support, Custom Environments, Flexible Resource Allocation, used by instacart, Lyft, and StackShare.
Travis CI
Buildkite: Fast and stable builds, Open source agent runs on almost any machine and architecture, Freedom to use your own tools and services

Open Source

Jenkins: Open source on device build system

Training for Machine Learning / Deep Learning

Open Source

SystemML - for big data applications using Spark
FfDL

For Production Systems / Model Serving

Open Source

End-to-End

SaaS Proprietary

References

https://www.researchgate.net/publication/229869757_An_Introduction_to_Model_Versioning

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tools for Production Level Machine Learning

Data

Storage

Pipelines

Machine Learning Development Frameworks

Model / Experiment Management

Continuous Integration

Training for Machine Learning / Deep Learning

For Production Systems / Model Serving

End-to-End

About

Releases

Packages

asampat3090/production-level-machine-learning

Folders and files

Latest commit

History

Repository files navigation

Tools for Production Level Machine Learning

Data

Storage

Pipelines

Machine Learning Development Frameworks

Model / Experiment Management

Continuous Integration

Training for Machine Learning / Deep Learning

For Production Systems / Model Serving

End-to-End

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages