[NOTE: This repo is an constant work in progress. Any feedback is greatly appreciated 😄]
A curated list of useful tools and references for production level machine learning, both open source, and propreitary. Also included are some pointers for tradeoffs of choosing
Data handing is a broad topic that requires a number of tools for processing, storage, privacy, pipelines, and many others.
Open Source & SaaS Services
- Object store: Store binary data (images, sound files, compressed texts)
- Database: Store metadata (file paths, labels, user activity, etc).
- Postgres is the right choice for most of applications, with the best-in-class SQL and great support for unstructured JSON.
- Data Lake: to aggregate features which are not obtainable from database (e.g. logs)
- Feature Store: store, access, and share machine learning features. Feature extraction can be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams.
- FEAST: Only for Google Cloud
- Michelangelo Palette: Specific part of a proprietary project within Uber
Open Source
There are number of development frameworks out there. There are fundamental libraries as well as derivative APIs (e.g. Keras) which simplifies the interface.
Open Source
- Tensorflow: Fundamental tool for deep learning and well supported by Google & community
- PyTorch: Fundamental tool based upon Torch developed and well supported by Facebook & community
- Keras: Simplified API for easier development
- Scikit Learn
- DeepDetect
Open Source
- Polyaxon: reproducible machine learning at scale
- Datmo: replicable model versions
- MLFlow: machine learning experiment tracking
- ModelDB: system for managing machine learning models for scikit-learn & spark.ml
- DVC: replicable etl and feature extraction pipelines
- CookieCutter Data Science: replicable file structures for data projects
- Docker CookieCutter Data Science: fork of above to run cookie-cutter project in a Docker container
- Duct Tape: replicable running of code
- Dynamic Training Bench: tensorflow training and tuning
- Sacred: reproduce experiments with a GUI to track
- Pachyderm: reproducible way to version data and ETL pipelines
- Django Estimators: specific to django and scikit-learn estimators
- MAX: model template for tracking model types
- Kinoa: save experiment results easily
SaaS Tools
- Argo: Open source Kubernetes native workflow engine for orchestrating parallel jobs (incudes workflows, events, CI and CD).
- CircleCI: Language-Inclusive Support, Custom Environments, Flexible Resource Allocation, used by instacart, Lyft, and StackShare.
- Travis CI
- Buildkite: Fast and stable builds, Open source agent runs on almost any machine and architecture, Freedom to use your own tools and services
Open Source
- Jenkins: Open source on device build system
Open Source
Open Source
SaaS Proprietary
- Tensorflow Extended (TFX)
- Michelangelo
- Google Cloud AI Platform
- Amazon SageMaker
- Neptune
- Floydhub
- Paperspace
- Determined AI
- Domino Data Lab
References