Reyearn

A server that aims to help people evolve NLP models in production by tightly looping the development pipeline: ingestion -> annotation -> training -> testing -> deploying. It takes an ensemble-of-models approach. For more on the philosophy, read the vision doc.

Made by and for generalists who need a scalable and production-ready NLP solution out of the box, with a friendly pathway towards understanding the underlying concepts. It won't be all things data engineering or all things ML, but should have sane default solutions to common needs. See the roadmap below and open some issues!

Features

Can be installed into an existing application's PostgreSQL database
Built with multi-tenancy in mind so observations can be siloed by organization
Everything is async and distributed
Runtime DAG-based ingestion pipeline to load training data into the database
Observations and annotations can be added directly through the API
New predictions optionally persisted as annotations and observations for future processing
Annotations optionally trigger the model training pipeline to run without blocking
Previously predicted annotations can be confirmed or rejected to determine if they will get picked up in the next training run
If the newly trained model is more accurate than the previous one, it will be injected as the new live model via hot-reloading
Many different models can be used in parallel
Models can be trained on multiple labels
Naive bayes with TF-IDF vectors is the initial text classification algorithm

Usage

Assuming PostgresSQL is installed and running on the default port with no username or password (the default on Mac for homebrew builds), this sequence of commands should get you going.

$ pip install poetry
$ poetry install
...
$ poetry shell
(reyearn-venv)$ createdb reyearn_dev
(reyearn-venv)$ cd db && alembic upgrade head
...
(reyearn-venv)$ cd .. && python -m server

To set up a data import see the data import readme. Then to manually run the importer pipeline to ingest the data, run python -m dags.importer. The server and the importer and trainer DAGs have debug configs for VS code.

The model trainer can be run standalone with python -m dags.trainer.

To explore the API, once the server is running, go to http://127.0.0.1:8000/docs in your browser and call the endpoints.

Implementation Details

Data Workflows

Prefect is used for parallel and distributed execution of workflows (DAGs). It uses Dask for running serialized distributed tasks that integrate well with the Python data science ecosystem.

API Tooling

FastAPI is the web glue library of choice because it's async (thanks to Starlette) and provides a lot of type safety and validation out of the box.

PostgreSQL Database

Tight integration with PostgreSQL is the main approach, with a thin layer of SQLAlchemy Core on top for metadata reflection and query generation. Class label hierarchies are implemented using the LTREE data type. Alembic is used for migrations.

NLP Tooling

All model training either uses or is compatible with scikit-learn APIs. This is the same approach as dask-ml.

Roadmap/wishlist

This will probably be moved to an issue tracker. The list keeps expanding from todos and nice-to-haves observed during development. In no particular order:

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.vscode		.vscode
api		api
dags		dags
data/import/email		data/import/email
db		db
docs		docs
models		models
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reyearn

Features

Usage

Implementation Details

Data Workflows

API Tooling

PostgreSQL Database

NLP Tooling

Roadmap/wishlist

About

Releases

Packages

Languages

License

chartpath/reyearn

Folders and files

Latest commit

History

Repository files navigation

Reyearn

Features

Usage

Implementation Details

Data Workflows

API Tooling

PostgreSQL Database

NLP Tooling

Roadmap/wishlist

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages