Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt/BigQuery: Incomplete Column Level Lineage when using deduplication macros #11670

Open
Starkie opened this issue Oct 18, 2024 · 1 comment
Labels
bug Bug report

Comments

@Starkie
Copy link
Contributor

Starkie commented Oct 18, 2024

Hey all!

Describe the bug
We have a dbt core project that targets a BigQuery data warehouse. When ingesting the metadata into DataHub with include_column_lineage=true and prefer_sql_parser_lineage=true, the column-level lineage (CLL) is missing for some of the datasets. The table-level lineage is fine for all of them.

We've tracked down the issue to a row deduplication macro from dbt_utils. The problem seems to be in the generated SQL code:

select unique.*
from (
     select
         array_agg (
                 original
                     order by article_name desc
            limit 1
         )[offset(0)] unique
     from all_articles original
     group by id
)

When replacing it with a simple SELECT * FROM all_articles statement, the CLL is generated correctly.

We're not sure if this is specific to DataHub or should be reported to sqlglot instead. Let me know and I can create the issue there.

To Reproduce
We have created a repository with a small dbt project to reproduce it: https://github.com/Starkie/datahub-dbt-lineage-repro

Steps to reproduce the behavior:

  1. Setup a dbt project that targets BigQuery
  2. Add a model whose last statement has the following code. Replace the values between <> with the correct ones for your model:
select unique.*
from (
     select
         array_agg (
                 original
                     order by <table_column> desc
            limit 1
         )[offset(0)] unique
     from <table_name> original
     group by id
)
  1. Execute dbt with DataHub's recommended commands:
dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate
cp target/run_results_backup.json target/run_results.json
  1. Execute a dbt DataHub ingestion recipe that includes column-level lineage and uses the SQL parser for lineage.
pipeline_name: "kafka-inventory-01-dbt"

source:
  type: "dbt"
  config:
    env: dev

    # Coordinates
    manifest_path: "./target/manifest.json"
    catalog_path: "./target/catalog.json"
    run_results_paths: ["./target/run_results.json"]

    # Options
    convert_column_urns_to_lowercase: true
    target_platform: "bigquery"
    skip_sources_in_lineage: true
    prefer_sql_parser_lineage: true
    include_column_lineage: true
    infer_dbt_schemas: true

Expected behavior
We expected the column-level lineage between the source table and the final model to be generated.

01-expected-lineage

Actual behavior
02-actual-lineage

Desktop (please complete the following information):

  • OS: Windows 11

  • Browser Firefox

  • Version 131.03

  • DataHub version: 0.14.1

  • DataHub CLI version: 0.14.1

@Starkie Starkie added the bug Bug report label Oct 18, 2024
@jjoyce0510
Copy link
Collaborator

Thank you for all the details! This is super helpful. We can try to reproduce but I think it's not a bad idea to raise this directly on SQL glot as well to ensure the parser can in general handle these types of statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants