Skip to content

Commit

Permalink
Merge branch 'datahub-project:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
anshbansal authored Oct 15, 2024
2 parents b847102 + 1eec2c4 commit df564b0
Show file tree
Hide file tree
Showing 21 changed files with 408 additions and 172 deletions.
12 changes: 12 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,18 @@ module.exports = {
id: "docs/automations/snowflake-tag-propagation",
className: "saasOnly",
},
{
label: "AI Classification",
type: "doc",
id: "docs/automations/ai-term-suggestion",
className: "saasOnly",
},
{
label: "AI Documentation",
type: "doc",
id: "docs/automations/ai-docs",
className: "saasOnly",
},
],
},
{
Expand Down
132 changes: 65 additions & 67 deletions docs/api/datahub-apis.md

Large diffs are not rendered by default.

36 changes: 36 additions & 0 deletions docs/automations/ai-docs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import FeatureAvailability from '@site/src/components/FeatureAvailability';

# AI Documentation

<FeatureAvailability saasOnly />

:::info

This feature is currently in closed beta. Reach out to your Acryl representative to get access.

:::

With AI-powered documentation, you can automatically generate documentation for tables and columns.

<p align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/_7DieZeZspY?si=Q5FkCA0gZPEFMj0Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
</p>

## Configuring

No configuration is required - just hit "Generate" on any table or column in the UI.

## How it works

Generating good documentation requires a holistic understanding of the data. Information we take into account includes, but is not limited to:

- Dataset name and any existing documentation
- Column name, type, description, and sample values
- Lineage relationships to upstream and downstream assets
- Metadata about other related assets

Data privacy: Your metadata is not sent to any third-party LLMs. We use AWS Bedrock internally, which means all metadata remains within the Acryl AWS account. We do not fine-tune on customer data.

## Limitations

- This feature is powered by an LLM, which can produce inaccurate results. While we've taken steps to reduce the likelihood of hallucinations, they can still occur.
72 changes: 72 additions & 0 deletions docs/automations/ai-term-suggestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import FeatureAvailability from '@site/src/components/FeatureAvailability';

# AI Glossary Term Suggestions

<FeatureAvailability saasOnly />

:::info

This feature is currently in closed beta. Reach out to your Acryl representative to get access.

:::

The AI Glossary Term Suggestion automation uses LLMs to suggest [Glossary Terms](../glossary/business-glossary.md) for tables and columns in your data.

This is useful for improving coverage of glossary terms across your organization, which is important for compliance and governance efforts.

This automation can:

- Automatically suggests glossary terms for tables and columns.
- Goes beyond a predefined set of terms and works with your business glossary.
- Generates [proposals](../managed-datahub/approval-workflows.md) for owners to review, or can automatically add terms to tables/columns.
- Automatically adjusts to human-provided feedback and curation (coming soon).

## Prerequisites

- A business glossary with terms defined. Additional metadata, like documentation and existing term assignments, will improve the accuracy of our suggestions.

## Configuring

1. **Navigate to Automations**: Click on 'Govern' > 'Automations' in the navigation bar.

<p align="center">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
</p>

2. **Create the Automation**: Click on 'Create' and select 'AI Glossary Term Suggestions'.

<p align="center">
<img width="40%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/automation-type.png"/>
</p>

3. **Configure the Automation**: Fill in the required fields to configure the automation.
The main fields to configure are (1) what terms to use for suggestions and (2) what entities to generate suggestions for.

<p align="center">
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/automation-config.png"/>
</p>

4. Once it's enabled, that's it! You'll start to see terms show up in the UI, either on assets or in the proposals page.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/ai-term-suggestion/term-proposals.png"/>
</p>

## How it works

The automation will scan through all the datasets matched by the configured filters. For each one, it will generate suggestions.
If new entities are added that match the configured filters, those will also be classified within 24 hours.

We take into account the following metadata when generating suggestions:

- Dataset name and description
- Column name, type, description, and sample values
- Glossary term name, documentation, and hierarchy
- Feedback loop: existing assignments and accepted/rejected proposals (coming soon)

Data privacy: Your metadata is not sent to any third-party LLMs. We use AWS Bedrock internally, which means all metadata remains within the Acryl AWS account. We do not fine-tune on customer data.

## Limitations

- A single configured automation can classify at most 10k entities.
- We cannot do partial reclassification. If you add a new column to an existing table, we won't regenerate suggestions for that table.
33 changes: 16 additions & 17 deletions docs/automations/snowflake-tag-propagation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

import FeatureAvailability from '@site/src/components/FeatureAvailability';

# Snowflake Tag Propagation Automation
Expand All @@ -20,22 +19,22 @@ both columns and tables back to Snowflake. This automation is available in DataH

1. **Navigate to Automations**: Click on 'Govern' > 'Automations' in the navigation bar.

<p align="left">
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
<p align="center">
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automations-nav-link.png"/>
</p>

2. **Create An Automation**: Click on 'Create' and select 'Snowflake Tag Propagation'.

<p align="left">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-type.png"/>
<p align="center">
<img width="60%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-type.png"/>
</p>

3. **Configure Automation**: Fill in the required fields to connect to Snowflake, along with the name, description, and category.
Note that you can limit propagation based on specific Tags and Glossary Terms. If none are selected, then ALL Tags or Glossary Terms will be automatically
propagated to Snowflake tables and columns. Finally, click 'Save and Run' to start the automation
3. **Configure Automation**: Fill in the required fields to connect to Snowflake, along with the name, description, and category.
Note that you can limit propagation based on specific Tags and Glossary Terms. If none are selected, then ALL Tags or Glossary Terms will be automatically
propagated to Snowflake tables and columns. Finally, click 'Save and Run' to start the automation

<p align="left">
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-form.png"/>
<p align="center">
<img width="60%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/automation-form.png"/>
</p>

## Propagating for Existing Assets
Expand All @@ -46,13 +45,13 @@ Note that it may take some time to complete the initial back-filling process, de
To do so, navigate to the Automation you created in Step 3 above, click the 3-dot "More" menu

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
</p>

and then click "Initialize".

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-initialize.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-initialize.png"/>
</p>

This one-time step will kick off the back-filling process for existing descriptions. If you only want to begin propagating
Expand All @@ -68,21 +67,21 @@ that you no longer want propagated descriptions to be visible.
To do this, navigate to the Automation you created in Step 3 above, click the 3-dot "More" menu

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-more-menu.png"/>
</p>

and then click "Rollback".

<p align="left">
<img width="15%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-rollback.png"/>
<img width="20%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/automation-rollback.png"/>
</p>

This one-time step will remove all propagated tags and glossary terms from Snowflake. To simply stop propagating new tags, you can disable the automation.

## Viewing Propagated Tags

You can view propagated Tags (and corresponding DataHub URNs) inside the Snowflake UI to confirm the automation is working as expected.
You can view propagated Tags (and corresponding DataHub URNs) inside the Snowflake UI to confirm the automation is working as expected.

<p align="left">
<img width="50%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/view-snowflake-tags.png"/>
<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/automation/saas/snowflake-tag-propagation/view-snowflake-tags.png"/>
</p>
Original file line number Diff line number Diff line change
Expand Up @@ -383,13 +383,11 @@ def on_task_instance_running(
return

logger.debug(
f"DataHub listener got notification about task instance start for {task_instance.task_id} of dag {task_instance.dag_run.dag_id}"
f"DataHub listener got notification about task instance start for {task_instance.task_id} of dag {task_instance.dag_id}"
)

if not self.config.dag_filter_pattern.allowed(task_instance.dag_run.dag_id):
logger.debug(
f"DAG {task_instance.dag_run.dag_id} is not allowed by the pattern"
)
if not self.config.dag_filter_pattern.allowed(task_instance.dag_id):
logger.debug(f"DAG {task_instance.dag_id} is not allowed by the pattern")
return

if self.config.render_templates:
Expand Down
2 changes: 2 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/api/source.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from datahub.ingestion.api.source_helpers import (
auto_browse_path_v2,
auto_fix_duplicate_schema_field_paths,
auto_fix_empty_field_paths,
auto_lowercase_urns,
auto_materialize_referenced_tags_terms,
auto_status_aspect,
Expand Down Expand Up @@ -444,6 +445,7 @@ def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]:
partial(
auto_fix_duplicate_schema_field_paths, platform=self._infer_platform()
),
partial(auto_fix_empty_field_paths, platform=self._infer_platform()),
browse_path_processor,
partial(auto_workunit_reporter, self.get_report()),
auto_patch_last_modified,
Expand Down
44 changes: 44 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/api/source_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,50 @@ def auto_fix_duplicate_schema_field_paths(
)


def auto_fix_empty_field_paths(
stream: Iterable[MetadataWorkUnit],
*,
platform: Optional[str] = None,
) -> Iterable[MetadataWorkUnit]:
"""Count schema metadata aspects with empty field paths and emit telemetry."""

total_schema_aspects = 0
schemas_with_empty_fields = 0
empty_field_paths = 0

for wu in stream:
schema_metadata = wu.get_aspect_of_type(SchemaMetadataClass)
if schema_metadata:
total_schema_aspects += 1

updated_fields: List[SchemaFieldClass] = []
for field in schema_metadata.fields:
if field.fieldPath:
updated_fields.append(field)
else:
empty_field_paths += 1

if empty_field_paths > 0:
logger.info(
f"Fixing empty field paths in schema aspect for {wu.get_urn()} by dropping empty fields"
)
schema_metadata.fields = updated_fields
schemas_with_empty_fields += 1

yield wu

if schemas_with_empty_fields > 0:
properties = {
"platform": platform,
"total_schema_aspects": total_schema_aspects,
"schemas_with_empty_fields": schemas_with_empty_fields,
"empty_field_paths": empty_field_paths,
}
telemetry.telemetry_instance.ping(
"ingestion_empty_schema_field_paths", properties
)


def auto_empty_dataset_usage_statistics(
stream: Iterable[MetadataWorkUnit],
*,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -223,15 +223,14 @@ def ingest_table(
)

customProperties = {
"number_of_files": str(get_file_count(delta_table)),
"partition_columns": str(delta_table.metadata().partition_columns),
"table_creation_time": str(delta_table.metadata().created_time),
"id": str(delta_table.metadata().id),
"version": str(delta_table.version()),
"location": self.source_config.complete_path,
}
if not self.source_config.require_files:
del customProperties["number_of_files"] # always 0
if self.source_config.require_files:
customProperties["number_of_files"] = str(get_file_count(delta_table))

dataset_properties = DatasetPropertiesClass(
description=delta_table.metadata().description,
Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/src/datahub/ingestion/source/preset.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ class PresetConfig(SupersetConfig):
def remove_trailing_slash(cls, v):
return config_clean.remove_trailing_slashes(v)

@root_validator
@root_validator(skip_on_failure=True)
def default_display_uri_to_connect_uri(cls, values):
base = values.get("display_uri")
if base is None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -334,19 +334,26 @@ def _process_view_lineage(self, lineage_row: LineageRow) -> None:
)

def _process_copy_command(self, lineage_row: LineageRow) -> None:
source = self._lineage_v1._get_sources(
logger.debug(f"Processing COPY command for lineage row: {lineage_row}")
sources = self._lineage_v1._get_sources(
lineage_type=LineageCollectorType.COPY,
db_name=self.database,
source_schema=None,
source_table=None,
ddl=None,
filename=lineage_row.filename,
)[0]
)
logger.debug(f"Recognized sources: {sources}")
source = sources[0]
if not source:
logger.debug("Ignoring command since couldn't recognize proper source")
return
s3_urn = source[0].urn

logger.debug(f"Recognized s3 dataset urn: {s3_urn}")
if not lineage_row.target_schema or not lineage_row.target_table:
logger.debug(
f"Didn't find target schema (found: {lineage_row.target_schema}) or target table (found: {lineage_row.target_table})"
)
return
target = self._make_filtered_target(lineage_row)
if not target:
Expand Down
Loading

0 comments on commit df564b0

Please sign in to comment.