Skip to content

Commit

Permalink
Update dbt documentation to document Python models (#501)
Browse files Browse the repository at this point in the history
* Update dbt documentation to document Python models

* Small edits to new-dbt-model.md

* Fix link in `new-dbt-model.md`

* Small typo in `new-dbt-model.md`

* Clean up Python model dep instructions in new-dbt-model.md

* Remove unnecessary materialization from reporting.ratio_stats

* Update new-dbt-model issue template to clarify that only SQL models need materialization

* Revert "Remove unnecessary materialization from reporting.ratio_stats"

This reverts commit 22226d2.
  • Loading branch information
jeancochrane authored Jun 12, 2024
1 parent b3bc8e2 commit fd514fa
Show file tree
Hide file tree
Showing 2 changed files with 136 additions and 35 deletions.
98 changes: 63 additions & 35 deletions .github/ISSUE_TEMPLATE/new-dbt-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ _(Brief description of the task here.)_

* **Name**: _(What should the model be called? See [Model
naming](/ccao-data/data-architecture/tree/master/dbt#model-naming) for guidance.)_
* **Model type**: _(SQL or Python? See [Model type (SQL or
Python)](/ccao-data/data-architecture/tree/master/dbt#model-type-sql-or-python)
for guidance.)_
* **Materialization**: _(Should the model be a table or a view? See [Model
materialization](/ccao-data/data-architecture/tree/master/dbt#model-materialization) for
guidance.)_
Expand All @@ -35,14 +38,17 @@ Otherwise, delete it in favor of the long checklist in the following section.)_
- [ ] Confirm that a subdirectory for this model's database exists in
the `dbt/models/` directory, and if not, create one, add a new `schema.yml`
file, and update `dbt_project.yml` to document the `+schema`
- [ ] Define the SQL query that creates the model in the model subdirectory,
following any existing file naming schema
- [ ] Optionally configure model materialization within the query file
- [ ] Define the SQL query or Python script that creates the model in the model
subdirectory, following any existing file naming schema
- [ ] Use `source()` and `ref()` to reference other models where possible
- [ ] _[SQL models only]_ Optionally configure model materialization in the
query file
- [ ] Update the `schema.yml` file in the subfolder of `dbt/models/` to point
to the new model definition
- [ ] Add tests to your new model definition in `schema.yml`
- [ ] If your model definition requires any new macros, make sure those macros
are tested in `dbt/macros/tests/test_all.sql`
- [ ] _[Python models only]_ Configure any third-party pure Python packages
- [ ] Add tests to the model schema definition in `schema.yml`
- [ ] _[SQL models only]_ If your model definition requires any new macros, make
sure those macros are tested in `dbt/macros/tests/test_all.sql`
- [ ] Commit your changes to a branch and open a pull request

## Checklist
Expand Down Expand Up @@ -78,29 +84,32 @@ models:
+schema: census
```
- [ ] Define the SQL query that creates the model in the appropriate subfolder
of the `dbt/models/` directory. For example, if you're adding a view to the
`default` schema, then the model definition file should live in
`dbt/models/default`. The file should have the same name as the model that
appears in Athena. A period in the model name should separate the
- [ ] Define the SQL query or Python script that creates the model in the
appropriate subfolder of the `dbt/models/` directory. For example, if you're
adding a view to the `default` schema, then the model definition file should
live in `dbt/models/default`. The file should have the same name as the model
that appears in Athena. A period in the model name should separate the
entity name from the database namespace (e.g. `default.vw_pin_universe.sql`).
All views should have a name prefixed with `vw_`.

```bash
# View example
# SQL view example
touch dbt/models/default/default.vw_new_model.sql
# Table example
# SQL table example
touch dbt/models/proximity/proximity.new_model.sql
# Python model example
touch dbt/models/proximity/proximity.new_model.py
```

- [ ] Use
[`source()`](https://docs.getdbt.com/reference/dbt-jinja-functions/source)
and [`ref()`](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) to
reference other models where possible in your query.
reference other models where possible in your query or script.

```sql
-- View or table example
-- SQL view or table example
-- Either dbt/models/default/default.vw_new_model.sql
-- or dbt/models/default/default.new_model.sql
select pin10, year
Expand All @@ -109,23 +118,36 @@ join {{ ref('default.vw_pin_universe') }}
using (pin10, year)
```

- [ ] Optionally configure model materialization. If the output of the query
should be a view, no action is necessary, since the default for all models in
this repository is to materialize as views; but if the output should be a
table, with table data stored in S3, then you'll need to add a config block
to the top of the view to configure materialization.
```python
# Python model example
# dbt/models/default/default.new_model.py
import pandas as pd
```sql
-- Table example
-- dbt/models/default/default.new_model.sql
{{
config(
materialized='table',
partitioned_by=['year'],
bucketed_by=['pin10'],
bucket_count=1
)
}}
def model(dbt, spark_session):
raw_foobar = dbt.source("raw", "foobar")
vw_pin_universe = dbt.ref("default.vw_pin_universe")
result = pd.merge(raw_foobar, vw_pin_universe, on=["pin10", "year"])
dbt.write(result[["pin10", "year"]])
```

- [ ] _[SQL models only]_ Optionally configure model materialization. If the
output of the query should be a view, no action is necessary, since the
default for all models in this repository is to materialize as views; but if
the output should be a table, with table data stored in S3, then you'll need
to add a config block to the top of the view to configure materialization.

```diff
# Table example
--- dbt/models/default/default.new_model.sql
+++ dbt/models/default/default.new_model.sql
+ {{
+ config(
+ materialized='table',
+ partitioned_by=['year'],
+ bucketed_by=['pin10'],
+ bucket_count=1
+ )
+ }}
select pin10, year
from {{ source('raw', 'foobar') }}
join {{ ref('default.vw_pin_universe') }}
Expand All @@ -137,7 +159,7 @@ using (pin10, year)
(models, sources, columns, etc). See
[Model description](/ccao-data/data-architecture/tree/master/dbt#model-description)
and [Column descriptions](/ccao-data/data-architecture/tree/master/dbt#column-descriptions)
for specific guidance on doc locations and using docs blocks
for specific guidance on doc locations and using docs blocks.

```diff
# Table example (only the model name would change for a view)
Expand All @@ -157,6 +179,12 @@ using (pin10, year)
data_tests:
```

- [ ] _[Python models only]_ If you need any third-party pure Python packages
that are not [preinstalled in the Athena PySpark
environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html),
follow [the docs for configuring Python model
dependencies](/ccao-data/data-architecture/tree/master/dbt#a-note-on-third-party-pure-python-dependencies-for-python-models).

- [ ] Add tests to your new model definition in `schema.yml`.

```diff
Expand All @@ -183,9 +211,9 @@ using (pin10, year)
data_tests:
```

- [ ] If your model definition requires any new macros, make sure those macros
are tested in `dbt/macros/tests/test_all.sql`. If any tests need implementing,
follow the pattern set by existing tests to implement them.
- [ ] _[SQL models only]_ If your model definition requires any new macros, make
sure those macros are tested in `dbt/macros/tests/test_all.sql`. If any tests
need implementing, follow the pattern set by existing tests to implement them.

- [ ] Commit your changes to a branch and open a pull request to build your
model and run tests in a CI environment.
73 changes: 73 additions & 0 deletions dbt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,79 @@ to the DAG.
There are a few subtleties to consider when requesting a new model, outlined
below.

### Model type (SQL or Python)

We default to SQL models, since they are simple and well-supported, but in
some cases we make use of [Python
models](https://docs.getdbt.com/docs/build/python-models) instead.
Prefer a Python model if all of the following conditions are true:

* The model requires complex transformations that are simpler to express using
pandas than using SQL
* The model only depends on (i.e. joins to) other models materialized as tables,
and does not depend on any models materialized as views
* The model's pandas code only imports third-party packages that are either
[preinstalled in the Athena PySpark
environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html)
or that are pure Python (i.e. that do not include any C extensions or code in
other languages)
* The most common packages that we need that are _not_ pure Python are
geospatial analysis packages like `geopandas`

#### A note on third-party pure Python dependencies for Python models

If your Python model needs to use a third-party pure Python package that is not
[preinstalled in the Athena PySpark
environment](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html),
you can configure the dependency to be automatically deployed to our S3 bucket
that stores PySpark dependencies as part of the dbt build workflow on GitHub
Actions. Follow these steps to include your dependency:

1. Update the `config.packages` array on your model definition in your
model's `schema.yml` file to add elements for each of the packages
you want to install
* Make sure to provide a specific version for each package so that our
builds are deterministic
* Unlike a typical `pip install` call, the dependency resolver will _not_
automatically install your dependency's dependencies, so check the
dependency's documentation to see if you need to manually specify any
other dependencies in order for your dependency to work

```yaml
# Example -- replace `model.name` with your model name, `dependency_name` with
# your dependency name, and `X.Y.Z` with the version of the dependency you want
# to install
models:
- name: database_name.table_name
config:
packages:
- "dependency_name==X.Y.Z"
```
2. Add an `sc.addPyFile` call to the top of the Python code that represents your
model's query definition so that PySpark will make the dependency available
in the context of your code

```python
# Example -- replace `dependency_name` with your dependency name and `X.Y.Z`
# with the version of the dependency you want to import
# type: ignore
sc.addPyFile( # noqa: F821
"s3://ccao-athena-dependencies-us-east-1/dependency_name==X.Y.Z.zip"
)
```

3. Call `import dependency_name` as normal in your script to make use of the
dependency

```python
# Example -- replace `dependency_name` with your dependency name
import dependency_name
```

See the `reporting.ratio_stats` model for an example of this type of
configuration.

### Model materialization

There are a number of different ways of materializing tables in Athena
Expand Down

0 comments on commit fd514fa

Please sign in to comment.