Implement microbatch incremental strategy #825

benc-db · 2024-10-11T22:09:18Z

Resolves #824

Description

Implements the microbatch incremental strategy: https://docs.getdbt.com/docs/build/incremental-microbatch

Core idea is that dbt will determine slices of time to break up an insert into multiple statements; we run a replace-where with those slices so that any old data is replaced by the newest version of that data. This makes it much easier for users to back fill, and on failure, only rerun the slices that failed.

I have to cast the column to TIMESTAMP, as if your event_time column is a date, Databricks casts the conditions to date and then it looks like
replace where date >= X and date < X

I also hit an issue with column comments that I think was introduced in dbt-core 1.9.0b2 that I have fixed here.

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

dbt/adapters/databricks/impl.py

benc-db · 2024-10-15T16:39:50Z

dbt/include/databricks/macros/materializations/incremental/strategies.sql

+  {%- if end_time -%}
+    {%- do incremental_predicates.append("cast(" ~ event_time ~ " as TIMESTAMP) < '" ~ end_time ~ "'") -%}
+  {%- endif -%}
+  {%- do arg_dict.update({'incremental_predicates': incremental_predicates}) -%}


Preps for replace_where strategy by adding

cast(<event_time> as TIMESTAMP) >= <start_time> and cast(<event_time> as TIMESTAMP) < <end_time>

as an incremental predicate.

benc-db · 2024-10-15T16:41:12Z

dbt/adapters/databricks/impl.py

-                and columns[name]["description"] != (column.comment or "")
-            ):
-                return_columns[name] = columns[name]
+            if name in columns:


Not sure exactly what introduces this uncertainty, but I've experimentally observed that sometimes config_column is a dict and sometimes its a ColumnInfo, and these types have different access methods for getting description.

benc-db · 2024-10-15T16:42:26Z

tests/functional/adapter/microbatch/fixtures.py

+        relation: True
+        columns: True
+    description: This is a microbatch model
+    columns:


Added my own schema with column comments to supplement the included tests, since hitting comments originally broke my implementation despite passing the included functional tests.

jackyhu-db · 2024-10-15T17:29:20Z

dbt/adapters/databricks/__version__.py

@@ -1 +1 @@
-version: str = "1.8.7"
+version: str = "1.9.0b1"


shall the version bumping be done in a separated PR?

I can if you would prefer. My reasoning is that this is more of just a miss (not having this from the start in the 1.9.latest branch), and saves another round of integration tests running just to merge the version (which is its own issue that at some point I should address). When we release 1.9.0, that will be its own version PR.

dbt/adapters/databricks/impl.py

benc-db requested review from andrefurlan-db and rcypher-databricks as code owners October 11, 2024 22:09

benc-db commented Oct 11, 2024

View reviewed changes

dbt/adapters/databricks/impl.py Outdated Show resolved Hide resolved

benc-db added 4 commits October 11, 2024 15:11

wip

6812460

fix failing test

fb62fac

cast

4834834

fix column issue

23d7283

benc-db force-pushed the microbatch_investigation branch from f5d145c to 23d7283 Compare October 11, 2024 22:11

benc-db had a problem deploying to azure-prod October 11, 2024 22:12 — with GitHub Actions Error

undo removal

bd4be24

benc-db temporarily deployed to azure-prod October 11, 2024 22:16 — with GitHub Actions Inactive

changelog

8993064

benc-db requested review from eric-wang-1990 and jackyhu-db October 15, 2024 16:37

benc-db commented Oct 15, 2024

View reviewed changes

up version to prepare for beta release

28e3cfd

benc-db had a problem deploying to azure-prod October 15, 2024 17:22 — with GitHub Actions Error

benc-db had a problem deploying to azure-prod October 15, 2024 17:22 — with GitHub Actions Failure

jackyhu-db approved these changes Oct 15, 2024

View reviewed changes

jackyhu-db reviewed Oct 15, 2024

View reviewed changes

dbt/adapters/databricks/impl.py Show resolved Hide resolved

check for attr presence

62b47db

benc-db had a problem deploying to azure-prod October 15, 2024 17:35 — with GitHub Actions Failure

benc-db temporarily deployed to azure-prod October 15, 2024 17:35 — with GitHub Actions Inactive

benc-db had a problem deploying to azure-prod October 15, 2024 17:56 — with GitHub Actions Failure

benc-db had a problem deploying to azure-prod October 15, 2024 18:26 — with GitHub Actions Failure

benc-db had a problem deploying to azure-prod October 15, 2024 19:58 — with GitHub Actions Failure

benc-db temporarily deployed to azure-prod October 15, 2024 20:07 — with GitHub Actions Inactive

benc-db temporarily deployed to azure-prod October 15, 2024 20:45 — with GitHub Actions Inactive

benc-db merged commit d0378d2 into 1.9.latest Oct 15, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement microbatch incremental strategy #825

Implement microbatch incremental strategy #825

benc-db commented Oct 11, 2024 •

edited

Loading

benc-db Oct 15, 2024

benc-db Oct 15, 2024

benc-db Oct 15, 2024

jackyhu-db Oct 15, 2024

benc-db Oct 15, 2024

Implement microbatch incremental strategy #825

Implement microbatch incremental strategy #825

Conversation

benc-db commented Oct 11, 2024 • edited Loading

Description

Checklist

benc-db Oct 15, 2024

Choose a reason for hiding this comment

benc-db Oct 15, 2024

Choose a reason for hiding this comment

benc-db Oct 15, 2024

Choose a reason for hiding this comment

jackyhu-db Oct 15, 2024

Choose a reason for hiding this comment

benc-db Oct 15, 2024

Choose a reason for hiding this comment

benc-db commented Oct 11, 2024 •

edited

Loading