dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

Starkie · 2024-10-18T12:34:16Z

Hey all!

Describe the bug
We have a dbt core project that targets a BigQuery data warehouse. When ingesting the metadata into DataHub with include_column_lineage=true and prefer_sql_parser_lineage=true, the column-level lineage (CLL) is missing for some of the datasets. The table-level lineage is fine for all of them.

We've tracked down the issue to a row deduplication macro from dbt_utils. The problem seems to be in the generated SQL code:

select unique.*
from (
     select
         array_agg (
                 original
                     order by article_name desc
            limit 1
         )[offset(0)] unique
     from all_articles original
     group by id
)

When replacing it with a simple SELECT * FROM all_articles statement, the CLL is generated correctly.

We're not sure if this is specific to DataHub or should be reported to sqlglot instead. Let me know and I can create the issue there.

To Reproduce
We have created a repository with a small dbt project to reproduce it: https://github.com/Starkie/datahub-dbt-lineage-repro

Steps to reproduce the behavior:

Setup a dbt project that targets BigQuery
Add a model whose last statement has the following code. Replace the values between <> with the correct ones for your model:

select unique.*
from (
     select
         array_agg (
                 original
                     order by <table_column> desc
            limit 1
         )[offset(0)] unique
     from <table_name> original
     group by id
)

Execute dbt with DataHub's recommended commands:

dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate
cp target/run_results_backup.json target/run_results.json

Execute a dbt DataHub ingestion recipe that includes column-level lineage and uses the SQL parser for lineage.

pipeline_name: "kafka-inventory-01-dbt"

source:
  type: "dbt"
  config:
    env: dev

    # Coordinates
    manifest_path: "./target/manifest.json"
    catalog_path: "./target/catalog.json"
    run_results_paths: ["./target/run_results.json"]

    # Options
    convert_column_urns_to_lowercase: true
    target_platform: "bigquery"
    skip_sources_in_lineage: true
    prefer_sql_parser_lineage: true
    include_column_lineage: true
    infer_dbt_schemas: true

Expected behavior
We expected the column-level lineage between the source table and the final model to be generated.

Actual behavior

Desktop (please complete the following information):

OS: Windows 11
Browser Firefox
Version 131.03
DataHub version: 0.14.1
DataHub CLI version: 0.14.1

The text was updated successfully, but these errors were encountered:

jjoyce0510 · 2024-11-05T20:27:33Z

Thank you for all the details! This is super helpful. We can try to reproduce but I think it's not a bad idea to raise this directly on SQL glot as well to ensure the parser can in general handle these types of statements.

Starkie added the bug Bug report label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

Starkie commented Oct 18, 2024

jjoyce0510 commented Nov 5, 2024

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

Comments

Starkie commented Oct 18, 2024

jjoyce0510 commented Nov 5, 2024