-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dockerfile and build-and-run-model
workflow for CI model runs
#9
Changes from 19 commits
06a7bc9
8eb4fa7
36e84e5
1d8bb4a
299b40b
f2391db
d9b0712
7fff710
2315f8e
8bd4983
a5a64f5
190210f
e11dfb1
82c3081
82d6155
e801405
33f83a5
4756f53
41ebdfb
a852f22
3f2114e
fd2fcdd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Workflow that builds a Docker image containing the model code, | ||
# pushes it to the GitHub Container Registry, and then optionally uses | ||
# that container image to run the model using an AWS Batch job. | ||
# | ||
# Images are built on every commit to a PR or main branch in order to ensure | ||
# that the build continues to work properly, but Batch jobs are gated behind | ||
# a `deploy` environment that requires manual approval from a | ||
# @ccao-data/core-team member. | ||
|
||
name: build-and-run-model | ||
|
||
on: | ||
pull_request: | ||
types: [opened, reopened, synchronize, closed] | ||
workflow_dispatch: | ||
push: | ||
branches: [master] | ||
|
||
jobs: | ||
build-and-run-model: | ||
permissions: | ||
# contents:read and id-token:write permissions are needed to interact | ||
# with GitHub's OIDC Token endpoint so that we can authenticate with AWS | ||
contents: read | ||
id-token: write | ||
# While packages:write is usually not required for workflows, it is | ||
# required in order to allow the reusable called workflow to push to | ||
# GitHub Container Registry | ||
packages: write | ||
uses: ccao-data/actions/.github/workflows/build-and-run-batch-job.yaml@jeancochrane/add-batch-and-terraform-workflows-and-actions | ||
with: | ||
ref: jeancochrane/add-batch-and-terraform-workflows-and-actions | ||
vcpu: "16.0" | ||
memory: "65536" | ||
role-duration-seconds: 14400 # Worst-case time for a full model run | ||
secrets: | ||
AWS_IAM_ROLE_TO_ASSUME_ARN: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }} | ||
AWS_ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
FROM rocker/r-ver:4.3.1 | ||
|
||
# Use PPM for binary installs | ||
ENV RENV_CONFIG_REPOS_OVERRIDE "https://packagemanager.posit.co/cran/__linux__/jammy/latest" | ||
ENV RENV_PATHS_LIBRARY renv/library | ||
|
||
# Install system dependencies | ||
RUN apt-get update && apt-get install --no-install-recommends -y \ | ||
libcurl4-openssl-dev libssl-dev libxml2-dev libgit2-dev git \ | ||
libudunits2-dev python3-dev python3-pip libgdal-dev libgeos-dev \ | ||
libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc | ||
|
||
# Install pipenv for Python dependencies | ||
RUN pip install pipenv | ||
|
||
# Copy pipenv files into the image. The reason this is a separate step from | ||
# the later step that adds files from the working directory is because we want | ||
# to avoid having to reinstall dependencies every time a file in the directory | ||
# changes, as Docker will bust the cache of every layer following a layer that | ||
# needs to change | ||
COPY Pipfile . | ||
COPY Pipfile.lock . | ||
|
||
# Install Python dependencies | ||
RUN pipenv install --system --deploy | ||
|
||
# Copy R bootstrap files into the image | ||
COPY renv.lock . | ||
COPY .Rprofile . | ||
COPY renv/ renv/ | ||
|
||
# Install R dependencies | ||
RUN Rscript -e 'renv::restore()' | ||
|
||
# Copy the directory into the container | ||
ADD ./ model-condo-avm/ | ||
|
||
# Copy R dependencies into the app directory | ||
RUN rm -Rf model-condo-avm/renv | ||
RUN mv renv model-condo-avm/ | ||
|
||
# Set the working directory to the app dir | ||
WORKDIR model-condo-avm/ | ||
|
||
CMD dvc pull && dvc repro |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
--- | ||
title: "Table of Contents" | ||
output: | ||
output: | ||
dfsnow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
github_document: | ||
toc: true | ||
toc_depth: 3 | ||
|
@@ -46,7 +46,7 @@ The repository itself contains the [code](./pipeline) and [data](./input) for th | |
|
||
## Differences Compared to the Residential Model | ||
|
||
The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics. | ||
The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics. | ||
|
||
The only information our office has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier: | ||
|
||
|
@@ -106,11 +106,9 @@ ccao::vars_dict %>% | |
) | ||
) %>% | ||
mutate(`Unique to Condo Model` = ifelse( | ||
var_name_model != "loc_tax_municipality_name" & ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We no longer need this conditional now that both repos are (temporarily) using |
||
var_name_model %in% condo_unique_preds | | ||
`Feature Name` %in% | ||
c("Condominium Building Year Built", "Condominium % Ownership") | ||
), | ||
var_name_model %in% condo_unique_preds | | ||
`Feature Name` %in% | ||
c("Condominium Building Year Built", "Condominium % Ownership"), | ||
"X", "" | ||
)) %>% | ||
arrange(desc(`Unique to Condo Model`), Category) %>% | ||
|
@@ -124,7 +122,7 @@ For the most part, condos are valued the same way as single- and multi-family re | |
|
||
However, because the CCAO has so [little information about individual units](#differences-compared-to-the-residential-model), we must rely on the [condominium percentage of ownership](#features-used) to differentiate between units in a building. This feature is effectively the proportion of the building's overall value held by a unit. It is created when a condominium declaration is filed with the County (usually by the developer of the building). The critical assumption underlying the condo valuation process is that percentage of ownership correlates with current market value. | ||
|
||
Percentage of ownership is used in two ways: | ||
Percentage of ownership is used in two ways: | ||
|
||
1. It is used directly as a predictor/feature in the regression model to estimate differing unit values within the same building. | ||
2. It is used to reapportion unit values directly i.e. the value of a unit is ultimately equal to `% of ownership * total building value`. | ||
|
@@ -186,7 +184,7 @@ The condo model relies on sales within the same building to calculate [strata](# | |
|
||
Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales. | ||
|
||
### Buildings Without Sales | ||
### Buildings Without Sales | ||
|
||
When no sales have occurred in a building in the 5 years prior to assessment, the building's strata features are imputed. The model will look at nearby buildings that have similar unit counts/age and then try to assign an appropriate strata to the target building. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,12 +152,12 @@ the 2023 assessment model. | |
| Percent Population Mobility, Moved From Within Same County in Past Year | acs5 | numeric | | | ||
| Longitude | loc | numeric | | | ||
| Latitude | loc | numeric | | | ||
| Municipality Name | loc | character | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just an artifact of removing the |
||
| FEMA Special Flood Hazard Area | loc | logical | | | ||
| First Street Factor | loc | numeric | | | ||
| First Street Risk Direction | loc | numeric | | | ||
| School Elementary District GEOID | loc | character | | | ||
| School Secondary District GEOID | loc | character | | | ||
| Municipality Name | loc | character | | | ||
| CMAP Walkability Score (No Transit) | loc | numeric | | | ||
| CMAP Walkability Total Score | loc | numeric | | | ||
| Airport Noise DNL | loc | numeric | | | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,7 +47,7 @@ stages: | |
cache: false | ||
- output/workflow/recipe/model_workflow_recipe.rds: | ||
cache: false | ||
frozen: true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure this step was erroneously marked as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is definitely an error. The only step that should really ever be frozen is the ingest step. |
||
|
||
assess: | ||
cmd: Rscript pipeline/02-assess.R | ||
desc: > | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
suppressPackageStartupMessages({ | ||
library(arrow) | ||
library(aws.s3) | ||
library(aws.ec2metadata) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discovered in ccao-data/model-res-avm#26, this package is necessary in order to allow the |
||
library(ccao) | ||
library(dplyr) | ||
library(here) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not correct, but I'm setting it to match the value for the res model for now until we improve model performance and get a better sense of our worst-case times.