Update dbt documentation to document Python models (#501)

* Update dbt documentation to document Python models * Small edits to new-dbt-model.md * Fix link in `new-dbt-model.md` * Small typo in `new-dbt-model.md` * Clean up Python model dep instructions in new-dbt-model.md * Remove unnecessary materialization from reporting.ratio_stats * Update new-dbt-model issue template to clarify that only SQL models need materialization * Revert "Remove unnecessary materialization from reporting.ratio_stats" This reverts commit 22226d2.
ccao-data · Jun 12, 2024 · fd514fa · fd514fa
1 parent b3bc8e2
commit fd514fa
Show file tree

Hide file tree

Showing 2 changed files with 136 additions and 35 deletions.
diff --git a/.github/ISSUE_TEMPLATE/new-dbt-model.md b/.github/ISSUE_TEMPLATE/new-dbt-model.md
@@ -16,6 +16,9 @@ _(Brief description of the task here.)_
 
 * **Name**: _(What should the model be called? See [Model
  naming](/ccao-data/data-architecture/tree/master/dbt#model-naming) for guidance.)_
+* **Model type**: _(SQL or Python? See [Model type (SQL or
+  Python)](/ccao-data/data-architecture/tree/master/dbt#model-type-sql-or-python)
+  for guidance.)_
 * **Materialization**: _(Should the model be a table or a view? See [Model
   materialization](/ccao-data/data-architecture/tree/master/dbt#model-materialization) for
   guidance.)_
@@ -35,14 +38,17 @@ Otherwise, delete it in favor of the long checklist in the following section.)_
 - [ ] Confirm that a subdirectory for this model's database exists in
   the `dbt/models/` directory, and if not, create one, add a new `schema.yml`
   file, and update `dbt_project.yml` to document the `+schema`
-- [ ] Define the SQL query that creates the model in the model subdirectory,
-  following any existing file naming schema
-- [ ] Optionally configure model materialization within the query file
+- [ ] Define the SQL query or Python script that creates the model in the model
+  subdirectory, following any existing file naming schema
+- [ ] Use `source()` and `ref()` to reference other models where possible
+- [ ] _[SQL models only]_ Optionally configure model materialization in the
+  query file
 - [ ] Update the `schema.yml` file in the subfolder of `dbt/models/` to point
   to the new model definition
-- [ ] Add tests to your new model definition in `schema.yml`
-- [ ] If your model definition requires any new macros, make sure those macros
-  are tested in `dbt/macros/tests/test_all.sql`
+- [ ] _[Python models only]_ Configure any third-party pure Python packages
+- [ ] Add tests to the model schema definition in `schema.yml`
+- [ ] _[SQL models only]_ If your model definition requires any new macros, make
+  sure those macros are tested in `dbt/macros/tests/test_all.sql`
 - [ ] Commit your changes to a branch and open a pull request
 
 ## Checklist
@@ -78,29 +84,32 @@ models:
        +schema: census
 ```
 
-- [ ] Define the SQL query that creates the model in the appropriate subfolder
-  of the `dbt/models/` directory. For example, if you're adding a view to the
-  `default` schema, then the model definition file should live in
-  `dbt/models/default`. The file should have the same name as the model that
-  appears in Athena. A period in the model name should separate the
+- [ ] Define the SQL query or Python script that creates the model in the
+  appropriate subfolder of the `dbt/models/` directory. For example, if you're
+  adding a view to the `default` schema, then the model definition file should
+  live in `dbt/models/default`. The file should have the same name as the model
+  that appears in Athena. A period in the model name should separate the
   entity name from the database namespace (e.g. `default.vw_pin_universe.sql`).
   All views should have a name prefixed with `vw_`.
 
 ```bash
-# View example
+# SQL view example
 touch dbt/models/default/default.vw_new_model.sql
 
-# Table example
+# SQL table example
 touch dbt/models/proximity/proximity.new_model.sql
+
+# Python model example
+touch dbt/models/proximity/proximity.new_model.py
 ```
 
 - [ ] Use
   [`source()`](https://docs.getdbt.com/reference/dbt-jinja-functions/source)
   and [`ref()`](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) to
-  reference other models where possible in your query.
+  reference other models where possible in your query or script.
 
 ```sql
--- View or table example
+-- SQL view or table example
 -- Either dbt/models/default/default.vw_new_model.sql
 -- or dbt/models/default/default.new_model.sql
 select pin10, year
@@ -109,23 +118,36 @@ join {{ ref('default.vw_pin_universe') }}
 using (pin10, year)
 ```
 
-- [ ] Optionally configure model materialization. If the output of the query
-  should be a view, no action is necessary, since the default for all models in
-  this repository is to materialize as views; but if the output should be a
-  table, with table data stored in S3, then you'll need to add a config block
-  to the top of the view to configure materialization.
+```python
+# Python model example
+# dbt/models/default/default.new_model.py
+import pandas as pd
 
-```sql
--- Table example
--- dbt/models/default/default.new_model.sql
-{{
-  config(
-    materialized='table',
-    partitioned_by=['year'],
-    bucketed_by=['pin10'],
-    bucket_count=1
-  )
-}}
+def model(dbt, spark_session):
+    raw_foobar = dbt.source("raw", "foobar")
+    vw_pin_universe = dbt.ref("default.vw_pin_universe")
+    result = pd.merge(raw_foobar, vw_pin_universe, on=["pin10", "year"])
+    dbt.write(result[["pin10", "year"]])
+```
+
+- [ ] _[SQL models only]_ Optionally configure model materialization. If the
+  output of the query should be a view, no action is necessary, since the
+  default for all models in this repository is to materialize as views; but if
+  the output should be a table, with table data stored in S3, then you'll need
+  to add a config block to the top of the view to configure materialization.
+
+```diff
+# Table example
+--- dbt/models/default/default.new_model.sql
++++ dbt/models/default/default.new_model.sql
++ {{
++   config(
++     materialized='table',
++     partitioned_by=['year'],
++     bucketed_by=['pin10'],
++     bucket_count=1
++   )
++ }}
 select pin10, year
 from {{ source('raw', 'foobar') }}
 join {{ ref('default.vw_pin_universe') }}
@@ -137,7 +159,7 @@ using (pin10, year)
   (models, sources, columns, etc). See
   [Model description](/ccao-data/data-architecture/tree/master/dbt#model-description)
   and [Column descriptions](/ccao-data/data-architecture/tree/master/dbt#column-descriptions)
-  for specific guidance on doc locations and using docs blocks
+  for specific guidance on doc locations and using docs blocks.
 
 ```diff
 # Table example (only the model name would change for a view)
@@ -157,6 +179,12 @@ using (pin10, year)
      data_tests:
 ```
 
+- [ ] _[Python models only]_ If you need any third-party pure Python packages
+  that are not [preinstalled in the Athena PySpark
+  environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html),
+  follow [the docs for configuring Python model
+  dependencies](/ccao-data/data-architecture/tree/master/dbt#a-note-on-third-party-pure-python-dependencies-for-python-models).
+
 - [ ] Add tests to your new model definition in `schema.yml`.
 
 ```diff
@@ -183,9 +211,9 @@ using (pin10, year)
      data_tests:
 ```
 
-- [ ] If your model definition requires any new macros, make sure those macros
-  are tested in `dbt/macros/tests/test_all.sql`. If any tests need implementing,
-  follow the pattern set by existing tests to implement them.
+- [ ] _[SQL models only]_ If your model definition requires any new macros, make
+  sure those macros are tested in `dbt/macros/tests/test_all.sql`. If any tests
+  need implementing, follow the pattern set by existing tests to implement them.
 
 - [ ] Commit your changes to a branch and open a pull request to build your
   model and run tests in a CI environment.
diff --git a/dbt/README.md b/dbt/README.md
@@ -331,6 +331,79 @@ to the DAG.
 There are a few subtleties to consider when requesting a new model, outlined
 below.
 
+### Model type (SQL or Python)
+
+We default to SQL models, since they are simple and well-supported, but in
+some cases we make use of [Python
+models](https://docs.getdbt.com/docs/build/python-models) instead.
+Prefer a Python model if all of the following conditions are true:
+
+* The model requires complex transformations that are simpler to express using
+  pandas than using SQL
+* The model only depends on (i.e. joins to) other models materialized as tables,
+  and does not depend on any models materialized as views
+* The model's pandas code only imports third-party packages that are either
+  [preinstalled in the Athena PySpark
+  environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html)
+  or that are pure Python (i.e. that do not include any C extensions or code in
+  other languages)
+    * The most common packages that we need that are _not_ pure Python are
+      geospatial analysis packages like `geopandas`
+
+#### A note on third-party pure Python dependencies for Python models
+
+If your Python model needs to use a third-party pure Python package that is not
+[preinstalled in the Athena PySpark
+environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html),
+you can configure the dependency to be automatically deployed to our S3 bucket
+that stores PySpark dependencies as part of the dbt build workflow on GitHub
+Actions. Follow these steps to include your dependency:
+
+1. Update the `config.packages` array on your model definition in your
+   model's `schema.yml` file to add elements for each of the packages
+   you want to install
+     * Make sure to provide a specific version for each package so that our
+       builds are deterministic
+     * Unlike a typical `pip install` call, the dependency resolver will _not_
+       automatically install your dependency's dependencies, so check the
+       dependency's documentation to see if you need to manually specify any
+       other dependencies in order for your dependency to work
+
+```yaml
+# Example -- replace `model.name` with your model name, `dependency_name` with
+# your dependency name, and `X.Y.Z` with the version of the dependency you want
+# to install
+models:
+  - name: database_name.table_name
+    config:
+      packages:
+        - "dependency_name==X.Y.Z"
+```
+
+2. Add an `sc.addPyFile` call to the top of the Python code that represents your
+   model's query definition so that PySpark will make the dependency available
+   in the context of your code
+
+```python
+# Example -- replace `dependency_name` with your dependency name and `X.Y.Z`
+# with the version of the dependency you want to import
+# type: ignore
+sc.addPyFile(  # noqa: F821
+    "s3://ccao-athena-dependencies-us-east-1/dependency_name==X.Y.Z.zip"
+)
+```
+
+3. Call `import dependency_name` as normal in your script to make use of the
+   dependency
+
+```python
+# Example -- replace `dependency_name` with your dependency name
+import dependency_name
+```
+
+See the `reporting.ratio_stats` model for an example of this type of
+configuration.
+
 ### Model materialization
 
 There are a number of different ways of materializing tables in Athena