Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dockerfile and build-and-run-model workflow for CI model runs #9

Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
06a7bc9
Simplify Pipenv dependencies
jeancochrane Nov 9, 2023
8eb4fa7
Add Dockerfile
jeancochrane Nov 9, 2023
36e84e5
Add build-and-run-model workflow
jeancochrane Nov 9, 2023
1d8bb4a
Temporarily pin build-and-run-model to development version of build-a…
jeancochrane Nov 9, 2023
299b40b
Temporarily restrict dvc repro to the train step, for testing
jeancochrane Nov 9, 2023
f2391db
Revert "Temporarily restrict dvc repro to the train step, for testing"
jeancochrane Nov 9, 2023
d9b0712
Unfreeze train step in dvc.yaml
jeancochrane Nov 9, 2023
7fff710
Fix typo in loc_cook_municipality_name feature in params.yaml
jeancochrane Nov 9, 2023
2315f8e
Update README to reflect change in municipality name feature
jeancochrane Nov 9, 2023
8bd4983
Lint README.Rmd
jeancochrane Nov 13, 2023
a5a64f5
Add print statements to finalize step to debug S3 failures
jeancochrane Nov 13, 2023
190210f
Add aws.ec2metadata packaage so that ECS can authenticate with S3
jeancochrane Nov 13, 2023
e11dfb1
Revert "Add print statements to finalize step to debug S3 failures"
jeancochrane Nov 13, 2023
82c3081
Pin build-and-run-model to main branch of ccao-data/actions
jeancochrane Nov 13, 2023
82d6155
Rename loc_cook_municipality_name back to loc_tax_municipality_name
jeancochrane Nov 13, 2023
e801405
Revert "Pin build-and-run-model to main branch of ccao-data/actions"
jeancochrane Nov 13, 2023
33f83a5
Add ingest dependencies to renv.lock
jeancochrane Nov 14, 2023
4756f53
Revert "Add ingest dependencies to renv.lock"
jeancochrane Nov 14, 2023
41ebdfb
Revert "Rename loc_cook_municipality_name back to loc_tax_municipalit…
jeancochrane Nov 14, 2023
a852f22
Revert "Revert "Pin build-and-run-model to main branch of ccao-data/a…
jeancochrane Nov 14, 2023
3f2114e
Revert "Revert "Revert "Pin build-and-run-model to main branch of cca…
jeancochrane Nov 14, 2023
fd2fcdd
Revert "Revert "Revert "Revert "Pin build-and-run-model to main branc…
jeancochrane Nov 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/workflows/build-and-run-model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Workflow that builds a Docker image containing the model code,
# pushes it to the GitHub Container Registry, and then optionally uses
# that container image to run the model using an AWS Batch job.
#
# Images are built on every commit to a PR or main branch in order to ensure
# that the build continues to work properly, but Batch jobs are gated behind
# a `deploy` environment that requires manual approval from a
# @ccao-data/core-team member.

name: build-and-run-model

on:
pull_request:
types: [opened, reopened, synchronize, closed]
workflow_dispatch:
push:
branches: [master]

jobs:
build-and-run-model:
permissions:
# contents:read and id-token:write permissions are needed to interact
# with GitHub's OIDC Token endpoint so that we can authenticate with AWS
contents: read
id-token: write
# While packages:write is usually not required for workflows, it is
# required in order to allow the reusable called workflow to push to
# GitHub Container Registry
packages: write
uses: ccao-data/actions/.github/workflows/build-and-run-batch-job.yaml@jeancochrane/add-batch-and-terraform-workflows-and-actions
with:
ref: jeancochrane/add-batch-and-terraform-workflows-and-actions
vcpu: "16.0"
memory: "65536"
role-duration-seconds: 14400 # Worst-case time for a full model run
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not correct, but I'm setting it to match the value for the res model for now until we improve model performance and get a better sense of our worst-case times.

secrets:
AWS_IAM_ROLE_TO_ASSUME_ARN: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
AWS_ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }}
45 changes: 45 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
FROM rocker/r-ver:4.3.1

# Use PPM for binary installs
ENV RENV_CONFIG_REPOS_OVERRIDE "https://packagemanager.posit.co/cran/__linux__/jammy/latest"
ENV RENV_PATHS_LIBRARY renv/library

# Install system dependencies
RUN apt-get update && apt-get install --no-install-recommends -y \
libcurl4-openssl-dev libssl-dev libxml2-dev libgit2-dev git \
libudunits2-dev python3-dev python3-pip libgdal-dev libgeos-dev \
libproj-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev pandoc

# Install pipenv for Python dependencies
RUN pip install pipenv

# Copy pipenv files into the image. The reason this is a separate step from
# the later step that adds files from the working directory is because we want
# to avoid having to reinstall dependencies every time a file in the directory
# changes, as Docker will bust the cache of every layer following a layer that
# needs to change
COPY Pipfile .
COPY Pipfile.lock .

# Install Python dependencies
RUN pipenv install --system --deploy

# Copy R bootstrap files into the image
COPY renv.lock .
COPY .Rprofile .
COPY renv/ renv/

# Install R dependencies
RUN Rscript -e 'renv::restore()'

# Copy the directory into the container
ADD ./ model-condo-avm/

# Copy R dependencies into the app directory
RUN rm -Rf model-condo-avm/renv
RUN mv renv model-condo-avm/

# Set the working directory to the app dir
WORKDIR model-condo-avm/

CMD dvc pull && dvc repro
6 changes: 1 addition & 5 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,7 @@ verify_ssl = true
name = "pypi"

[packages]
numpy = "*"
pandas = "*"
scipy = "*"
scikit-learn = "*"
dvc = "*"
"dvc[s3]" = "*"

[dev-packages]

Expand Down
2,087 changes: 1,131 additions & 956 deletions Pipfile.lock

Large diffs are not rendered by default.

16 changes: 7 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Table of Contents"
output:
output:
dfsnow marked this conversation as resolved.
Show resolved Hide resolved
github_document:
toc: true
toc_depth: 3
Expand Down Expand Up @@ -46,7 +46,7 @@ The repository itself contains the [code](./pipeline) and [data](./input) for th

## Differences Compared to the Residential Model

The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics.
The Cook County Assessor's Office ***does not track characteristic data for condominiums***. Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics.

The only information our office has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier:

Expand Down Expand Up @@ -106,11 +106,9 @@ ccao::vars_dict %>%
)
) %>%
mutate(`Unique to Condo Model` = ifelse(
var_name_model != "loc_tax_municipality_name" & (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this conditional now that both repos are (temporarily) using loc_cook_municipality_name.

var_name_model %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership")
),
var_name_model %in% condo_unique_preds |
`Feature Name` %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
"X", ""
)) %>%
arrange(desc(`Unique to Condo Model`), Category) %>%
Expand All @@ -124,7 +122,7 @@ For the most part, condos are valued the same way as single- and multi-family re

However, because the CCAO has so [little information about individual units](#differences-compared-to-the-residential-model), we must rely on the [condominium percentage of ownership](#features-used) to differentiate between units in a building. This feature is effectively the proportion of the building's overall value held by a unit. It is created when a condominium declaration is filed with the County (usually by the developer of the building). The critical assumption underlying the condo valuation process is that percentage of ownership correlates with current market value.

Percentage of ownership is used in two ways:
Percentage of ownership is used in two ways:

1. It is used directly as a predictor/feature in the regression model to estimate differing unit values within the same building.
2. It is used to reapportion unit values directly i.e. the value of a unit is ultimately equal to `% of ownership * total building value`.
Expand Down Expand Up @@ -186,7 +184,7 @@ The condo model relies on sales within the same building to calculate [strata](#

Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales.

### Buildings Without Sales
### Buildings Without Sales

When no sales have occurred in a building in the 5 years prior to assessment, the building's strata features are imputed. The model will look at nearby buildings that have similar unit counts/age and then try to assign an appropriate strata to the target building.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,12 +152,12 @@ the 2023 assessment model.
| Percent Population Mobility, Moved From Within Same County in Past Year | acs5 | numeric | |
| Longitude | loc | numeric | |
| Latitude | loc | numeric | |
| Municipality Name | loc | character | |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an artifact of removing the var_name_model != "loc_tax_municipality_name" conditional in README.Rmd above. I'm not sure why it changes the order of the feature in the table, but it represents the reverse of what happened in #4.

| FEMA Special Flood Hazard Area | loc | logical | |
| First Street Factor | loc | numeric | |
| First Street Risk Direction | loc | numeric | |
| School Elementary District GEOID | loc | character | |
| School Secondary District GEOID | loc | character | |
| Municipality Name | loc | character | |
| CMAP Walkability Score (No Transit) | loc | numeric | |
| CMAP Walkability Total Score | loc | numeric | |
| Airport Noise DNL | loc | numeric | |
Expand Down
2 changes: 1 addition & 1 deletion dvc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ stages:
cache: false
- output/workflow/recipe/model_workflow_recipe.rds:
cache: false
frozen: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this step was erroneously marked as frozen; we don't cache intermediate outputs so freezing the train step leads to an error. Billy seemed to think this was reasonable but let me know if I'm off base!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely an error. The only step that should really ever be frozen is the ingest step.


assess:
cmd: Rscript pipeline/02-assess.R
desc: >
Expand Down
4 changes: 2 additions & 2 deletions params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ model:
"char_full_baths",
"loc_longitude",
"loc_latitude",
"loc_tax_municipality_name",
"loc_cook_municipality_name",
"loc_env_flood_fema_sfha",
"loc_env_flood_fs_factor",
"loc_env_flood_fs_risk_direction",
Expand Down Expand Up @@ -244,7 +244,7 @@ model:
categorical: [
"meta_township_code",
"meta_nbhd_code",
"loc_tax_municipality_name",
"loc_cook_municipality_name",
"loc_school_elementary_district_geoid",
"loc_school_secondary_district_geoid",
"time_sale_quarter_of_year",
Expand Down
2 changes: 1 addition & 1 deletion pipeline/01-train.R
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ test %>%
select(
meta_year, meta_pin, meta_pin10, meta_class, meta_card_num, meta_lline_num,
meta_triad_code, meta_township_code, meta_nbhd_code,
loc_tax_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_cook_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_census_tract_geoid, loc_school_elementary_district_geoid,
loc_school_secondary_district_geoid, loc_school_unified_district_geoid,
char_unit_sf, char_building_sf, char_full_baths, char_half_baths,
Expand Down
2 changes: 1 addition & 1 deletion pipeline/02-assess.R
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ assessment_data_pin <- assessment_data_merged %>%
# Keep locations, prior year values, and indicators
loc_longitude, loc_latitude,
starts_with(c(
"loc_property_", "loc_tax_", "loc_ward_", "loc_chicago_",
"loc_property_", "loc_cook_", "loc_ward_", "loc_chicago_",
"loc_census", "loc_school_", "prior_", "ind_"
)),
meta_pin10_5yr_num_sale,
Expand Down
6 changes: 3 additions & 3 deletions pipeline/03-evaluate.R
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ col_rename_dict <- c(
"geography_id" = "meta_township_code",
"geography_id" = "meta_township_name",
"geography_id" = "meta_nbhd_code",
"geography_id" = "loc_tax_municipality_name",
"geography_id" = "loc_cook_municipality_name",
"geography_id" = "loc_ward_num",
"geography_id" = "loc_census_puma_geoid",
"geography_id" = "loc_census_tract_geoid",
Expand Down Expand Up @@ -80,7 +80,7 @@ if (run_type == "full") {
assessment_data_pin <- read_parquet(paths$output$assessment_pin$local) %>%
select(
meta_pin, meta_class, meta_triad_code, meta_township_code, meta_nbhd_code,
loc_tax_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_cook_municipality_name, loc_ward_num, loc_census_puma_geoid,
loc_census_tract_geoid, loc_school_elementary_district_geoid,
loc_school_secondary_district_geoid, loc_school_unified_district_geoid,
char_total_bldg_sf, char_unit_sf, prior_far_tot, prior_near_tot,
Expand Down Expand Up @@ -343,7 +343,7 @@ gen_agg_stats_quantile <- function(data, truth, estimate,
# class or no class option for each level
geographies_quosures <- rlang::quos(
meta_township_code,
meta_nbhd_code, loc_tax_municipality_name,
meta_nbhd_code, loc_cook_municipality_name,
loc_ward_num, loc_census_puma_geoid, loc_census_tract_geoid,
loc_school_elementary_district_geoid, loc_school_secondary_district_geoid,
loc_school_unified_district_geoid,
Expand Down
1 change: 1 addition & 0 deletions pipeline/05-finalize.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
suppressPackageStartupMessages({
library(arrow)
library(aws.s3)
library(aws.ec2metadata)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discovered in ccao-data/model-res-avm#26, this package is necessary in order to allow the aws.s3 package to authenticate using credentials in an ECS environment.

library(ccao)
library(dplyr)
library(here)
Expand Down
11 changes: 11 additions & 0 deletions renv.lock
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,17 @@
],
"Hash": "89761c5b8e6ce66e7176cbb70a892df8"
},
"aws.ec2metadata": {
"Package": "aws.ec2metadata",
"Version": "0.2.0",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
"curl",
"jsonlite"
],
"Hash": "2e630add89f47540d9da2240710e2702"
},
"aws.s3": {
"Package": "aws.s3",
"Version": "0.3.21",
Expand Down