Skip to content

curated list of awesome open source repositories for data pipelining and machine learning in production.

Notifications You must be signed in to change notification settings

asampat3090/production-level-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

Tools for Production Level Machine Learning

[NOTE: This repo is an constant work in progress. Any feedback is greatly appreciated 😄]

A curated list of useful tools and references for production level machine learning, both open source, and propreitary. Also included are some pointers for tradeoffs of choosing

Data

Data handing is a broad topic that requires a number of tools for processing, storage, privacy, pipelines, and many others.

Storage

Open Source & SaaS Services

  • Object store: Store binary data (images, sound files, compressed texts)
  • Database: Store metadata (file paths, labels, user activity, etc).
    • Postgres is the right choice for most of applications, with the best-in-class SQL and great support for unstructured JSON.
  • Data Lake: to aggregate features which are not obtainable from database (e.g. logs)
  • Feature Store: store, access, and share machine learning features. Feature extraction can be computationally expensive and nearly impossible to scale, hence re-using features by different models and teams is a key to high performance ML teams.

Pipelines

Open Source

Machine Learning Development Frameworks

There are number of development frameworks out there. There are fundamental libraries as well as derivative APIs (e.g. Keras) which simplifies the interface.

Open Source

  • Tensorflow: Fundamental tool for deep learning and well supported by Google & community
  • PyTorch: Fundamental tool based upon Torch developed and well supported by Facebook & community
  • Keras: Simplified API for easier development
  • Scikit Learn
  • DeepDetect

Model / Experiment Management

Open Source

  • Polyaxon: reproducible machine learning at scale
  • Datmo: replicable model versions
  • MLFlow: machine learning experiment tracking
  • ModelDB: system for managing machine learning models for scikit-learn & spark.ml
  • DVC: replicable etl and feature extraction pipelines
  • CookieCutter Data Science: replicable file structures for data projects
  • Docker CookieCutter Data Science: fork of above to run cookie-cutter project in a Docker container
  • Duct Tape: replicable running of code
  • Dynamic Training Bench: tensorflow training and tuning
  • Sacred: reproduce experiments with a GUI to track
  • Pachyderm: reproducible way to version data and ETL pipelines
  • Django Estimators: specific to django and scikit-learn estimators
  • MAX: model template for tracking model types
  • Kinoa: save experiment results easily

Continuous Integration

SaaS Tools

  • Argo: Open source Kubernetes native workflow engine for orchestrating parallel jobs (incudes workflows, events, CI and CD).
  • CircleCI: Language-Inclusive Support, Custom Environments, Flexible Resource Allocation, used by instacart, Lyft, and StackShare.
  • Travis CI
  • Buildkite: Fast and stable builds, Open source agent runs on almost any machine and architecture, Freedom to use your own tools and services

Open Source

  • Jenkins: Open source on device build system

Training for Machine Learning / Deep Learning

Open Source

For Production Systems / Model Serving

Open Source

End-to-End

SaaS Proprietary

References

About

curated list of awesome open source repositories for data pipelining and machine learning in production.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published