Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement microbatch incremental strategy #825

Merged
merged 8 commits into from
Oct 15, 2024

Conversation

benc-db
Copy link
Collaborator

@benc-db benc-db commented Oct 11, 2024

Resolves #824

Description

Implements the microbatch incremental strategy: https://docs.getdbt.com/docs/build/incremental-microbatch

Core idea is that dbt will determine slices of time to break up an insert into multiple statements; we run a replace-where with those slices so that any old data is replaced by the newest version of that data. This makes it much easier for users to back fill, and on failure, only rerun the slices that failed.

I have to cast the column to TIMESTAMP, as if your event_time column is a date, Databricks casts the conditions to date and then it looks like
replace where date >= X and date < X

I also hit an issue with column comments that I think was introduced in dbt-core 1.9.0b2 that I have fixed here.

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

{%- if end_time -%}
{%- do incremental_predicates.append("cast(" ~ event_time ~ " as TIMESTAMP) < '" ~ end_time ~ "'") -%}
{%- endif -%}
{%- do arg_dict.update({'incremental_predicates': incremental_predicates}) -%}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preps for replace_where strategy by adding

cast(<event_time> as TIMESTAMP) >= <start_time> and cast(<event_time> as TIMESTAMP) < <end_time>

as an incremental predicate.

and columns[name]["description"] != (column.comment or "")
):
return_columns[name] = columns[name]
if name in columns:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure exactly what introduces this uncertainty, but I've experimentally observed that sometimes config_column is a dict and sometimes its a ColumnInfo, and these types have different access methods for getting description.

relation: True
columns: True
description: This is a microbatch model
columns:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added my own schema with column comments to supplement the included tests, since hitting comments originally broke my implementation despite passing the included functional tests.

@@ -1 +1 @@
version: str = "1.8.7"
version: str = "1.9.0b1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall the version bumping be done in a separated PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can if you would prefer. My reasoning is that this is more of just a miss (not having this from the start in the 1.9.latest branch), and saves another round of integration tests running just to merge the version (which is its own issue that at some point I should address). When we release 1.9.0, that will be its own version PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants