Skip to content

chartpath/reyearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reyearn

A server that aims to help people evolve NLP models in production by tightly looping the development pipeline: ingestion -> annotation -> training -> testing -> deploying. It takes an ensemble-of-models approach. For more on the philosophy, read the vision doc.

Made by and for generalists who need a scalable and production-ready NLP solution out of the box, with a friendly pathway towards understanding the underlying concepts. It won't be all things data engineering or all things ML, but should have sane default solutions to common needs. See the roadmap below and open some issues!

Features

  • Can be installed into an existing application's PostgreSQL database
  • Built with multi-tenancy in mind so observations can be siloed by organization
  • Everything is async and distributed
  • Runtime DAG-based ingestion pipeline to load training data into the database
  • Observations and annotations can be added directly through the API
  • New predictions optionally persisted as annotations and observations for future processing
  • Annotations optionally trigger the model training pipeline to run without blocking
  • Previously predicted annotations can be confirmed or rejected to determine if they will get picked up in the next training run
  • If the newly trained model is more accurate than the previous one, it will be injected as the new live model via hot-reloading
  • Many different models can be used in parallel
  • Models can be trained on multiple labels
  • Naive bayes with TF-IDF vectors is the initial text classification algorithm

Usage

Assuming PostgresSQL is installed and running on the default port with no username or password (the default on Mac for homebrew builds), this sequence of commands should get you going.

$ pip install poetry
$ poetry install
...
$ poetry shell
(reyearn-venv)$ createdb reyearn_dev
(reyearn-venv)$ cd db && alembic upgrade head
...
(reyearn-venv)$ cd .. && python -m server

To set up a data import see the data import readme. Then to manually run the importer pipeline to ingest the data, run python -m dags.importer. The server and the importer and trainer DAGs have debug configs for VS code.

The model trainer can be run standalone with python -m dags.trainer.

To explore the API, once the server is running, go to http://127.0.0.1:8000/docs in your browser and call the endpoints.

Implementation Details

Data Workflows

Prefect is used for parallel and distributed execution of workflows (DAGs). It uses Dask for running serialized distributed tasks that integrate well with the Python data science ecosystem.

API Tooling

FastAPI is the web glue library of choice because it's async (thanks to Starlette) and provides a lot of type safety and validation out of the box.

PostgreSQL Database

Tight integration with PostgreSQL is the main approach, with a thin layer of SQLAlchemy Core on top for metadata reflection and query generation. Class label hierarchies are implemented using the LTREE data type. Alembic is used for migrations.

NLP Tooling

All model training either uses or is compatible with scikit-learn APIs. This is the same approach as dask-ml.

Roadmap/wishlist

This will probably be moved to an issue tracker. The list keeps expanding from todos and nice-to-haves observed during development. In no particular order:

  • model metadata and training endpoints
  • class label endpoints
  • integration tests
  • KNN search endpoint with btree_gist on pg_trgrm
  • plugin system with base classes
  • POS tagging endpoint
  • fine-tune default classifier with better sample control
  • annotation type and range columns to support NER
  • NER endpoint
  • convert naive bayes to use Hashing Vectorizer
  • swap out joblib backend for dask-ml
  • gzip file upload observations
  • CLI to wrap DAGs and API
  • python and node client libraries
  • proper tutorial in docs
  • proper config management
  • basic JWT/security
  • PII obfuscation
  • custom result handlers in DAGs for failure triage
  • prodigy or equivalent annotation UI integration
  • experiment tracking/reporting
  • additional data ingestion sources (e.g. data lakes, distributed file systems)
  • additional ML algorithms

About

Simple and scalable NLU service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published