Skip to content

Commit

Permalink
Flesh out deployment-design.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jeancochrane authored Oct 17, 2023
1 parent c0ffa5a commit 5cfc7ca
Showing 1 changed file with 31 additions and 10 deletions.
41 changes: 31 additions & 10 deletions docs/deployment-design.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,55 @@
# Design doc: Model deployment pipeline

This doc represents a proposal for a CI/CD pipeline for running our residential model.
This doc represents a proposal for a simple CI/CD pipeline that we can use to deploy residential models and run experiments on them.

## Background

At a high level, our **existing process** for experimenting with changes to our models looks like this:

* Models run on an on-prem server
* Data scientists trigger model runs by SSHing into the server and running shell commands from cloned copies of this repo
* Model artifacts are saved using DVC
* Data scientists commit corresponding changes to model code after their experiment runs prove successful

This process has the advantage of being simple, easy to maintain, and cheap; as a result it has been useful to our team during the recent past when we only had a few data scientists on staff and they needed to focus most of their effort on building a new model from the ground up. However, some of its **limitations** are becoming apparent as our team scales up and begins to expand our focus:

* Our on-prem server only has enough resources to run one model at a time, so only one data scientist may be running modeling experiments at a given time
* Further, our server has no notion of a job queue, so a data scientist who is waiting to run a model must notice that a previous run has completed and initiate their run manually
* Our on-prem server does not have a GPU, so it can't make use of GPU-accelerated libraries like XGBoost
* Model runs are decoupled from version control changes, so data scientists have to remember to commit their changes correctly
* Results of model runs are not easily accessible to PR reviewers

The design described below aims to remove these limitations while retaining as much simplicity, maintainability, and affordability as possible.

## Requirements

At a high level, a model deployment pipeline must:
At a high level, a model deployment pipeline should:

* Integrate with our cloud tools (GitHub and AWS)
* Integrate with our existing cloud infrastructure (GitHub and AWS)
* Trigger model runs from pull request branches
* Require code authors to approve model runs before they are initiated
* Run the model on ephemeral, cheap, and isolated cloud infrastructure
* Report model statistics back to the pull request branch that triggered a run
* Run multiple model runs simultaneously on separate hardware
* Report model statistics back to the pull request that triggered a run

## Design

Adapted from: https://aws.amazon.com/blogs/opensource/github-actions-aws-fargate
Here is a rough sketch of a new model deployment pipeline:

* Define a new workflow, `run-model.yaml`, that runs on:
* Every commit to every pull request against the main branch
* `workflow_dispatch`
* The `workflow_dispatch` event
* Set up the workflow so that it deploys to the `staging` environment and requires [manual approval](https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#using-environments-to-manually-trigger-workflow-jobs)
* Use the [`configure-aws-credentials`](https://github.com/aws-actions/configure-aws-credentials) action to authenticate with AWS
* Build and push a new docker image to ECR
* Run Terraform to make sure an AWS Batch job queue and job definition exist for the PR
* Build and push a new Docker image to ECR
* Use the AWS CLI to [submit a job](https://docs.aws.amazon.com/cli/latest/reference/batch/submit-job.html) to the Batch queue
* Use the AWS CLI to [poll the job status](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/batch/describe-jobs.html) until it has a terminal status (`SUCCEEDED` or `FAILED`)
* Once the job has at least a `RUNNING` status, use the `jobStreamName` parameter to print a link to its logs
* TK: Output?
* _TK: Output?_

## Tasks

* Add container image definition for the model
* Add Docker image definition for the model
* Add GitHub workflow to deploy and run an AWS Batch job on commits to PRs
* TK: Show better output?
* _TK: Show better output?_

0 comments on commit 5cfc7ca

Please sign in to comment.