OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

vejeta · 2024-08-12T11:37:08Z

Describe the bug

We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions.

You can see the memory stays stable under 1GiB during the execution of the job.

However, with DataHub 0.13.x versions it always fail with Error 137, out of memory.

We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.

To Reproduce
Steps to reproduce the behavior:

Kubernetes CronJob set up
- 2Gi memory
- 2 CPU
Our deployment:

            - name: bigquery
              image: 'acryldata/datahub-ingestion:v0.13.3'
              imagePullPolicy: Always
              args: ["ingest", "-c", "/recipes/bq-recipe.yml"]

Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.

Our BigQuery recipe

            source:
              type: bigquery
              config:
                project_on_behalf: {{ bq_slots_project }}
                project_id_pattern:
                  allow:
                    - .*{{ gcp_project }}
                dataset_pattern:
                  allow:
                    - {{ profile_dataset }}
                  deny:
                    - ^temp_.*
                    - .*_temp$
                    - .*_temp_.*
                    - .*-temp.*
                    - .*temporary.*
        
                use_exported_bigquery_audit_metadata: true
                bigquery_audit_metadata_datasets:
                  - {{ gcp_project }}.bigquery_audit_log
                use_date_sharded_audit_log_tables: true
                upstream_lineage_in_report: true
        
                include_usage_statistics: true
        
                capture_table_label_as_tag: true
                capture_dataset_label_as_tag: true
                extract_column_lineage: true
                convert_urns_to_lowercase: true
        
                profiling:
                  enabled: "true"
                  profile_table_size_limit: null
                  profile_table_row_limit: null

                  use_sampling: false
                  partition_profiling_enabled: false
                  include_field_mean_value: false
                  include_field_median_value: false
                  include_field_sample_values: false
                  include_field_stddev_value: false

        
                stateful_ingestion:
                  enabled: true
                  state_provider:
                    type: "datahub"
                    config:
                      datahub_api:
                        server: "http://our-gms:8080"
        
            pipeline_name: {{ pipeline }}
            sink:
              type: "datahub-rest"
              config:
                server: "http://our-gms:8080"

Expected behavior
Not having a Out of Memory error in DataHub 0.13.3

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Google Cloud GKE/Kubernetes
Version: DataHub 0.13.x

Additional context
Logs with some summary after a successful execution with DataHub 0.12.x

Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
 'pending_requests': 0}
 'gms_version': 'v0.13.0',
 'total_duration_in_seconds': 12396.68,
 'current_time': '2024-08-11 03:26:39.897462 (now)',
 'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
 'failures': [],
 'warnings': [],
 'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
 'running_time': '3 hours, 26 minutes and 30.91 seconds'}
 'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
                                       'extra={}',
                                       "successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
 'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
 'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
 'partition_info': {},
                      'sampled': '10 sampled of at most 3059 entries.'},
-----
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host: 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
 'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
 'total_query_log_entries': 0,
 'sampled': '10 sampled of at most 2880 entries.'},
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host': 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},

The text was updated successfully, but these errors were encountered:

vejeta · 2024-08-22T08:35:47Z

Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)

hsheth2 · 2024-09-16T21:59:11Z

@vejeta could you generate a memory profile - see https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/

vejeta · 2024-09-18T13:55:06Z

Thanks, @hsheth2 for the suggestion! I will do that with the latest version (0.14.1)

vejeta · 2024-09-25T11:11:30Z

Just out of curiosity, to see if this is easier, is there a published docker image with the memray dependency included?
After enabling the memory profiling in my recipe, I am getting:

ERROR    {datahub.entrypoints:218} - Command failed: No module named 'memray'

It that doesn't exist, no worries, I will see to test it with our own docker image with memray included.

pedro93 · 2024-09-25T11:54:58Z

memray is installed when you install the datahub debug sub-package:

datahub/metadata-ingestion/setup.py

Line 895 in cf1d296

"debug": list(debug_requirements),

Check these docs on how to use it: https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/#how-to-use (opt for the cli option)

vejeta · 2024-10-02T01:38:23Z

@hsheth2, @pedro93 I have generated the file with the memory profiling. It is 801Mb in size.

Shall I provide the output of running memray over it somehow?
What output can be more useful?

hsheth2 · 2024-10-02T03:58:37Z

For starters, if you could run memray flamegraph file.bin to generate an html file, that'd be helpful. Hopefully that gives us enough info, but we might need the full file if there isn't enough in the flamegraph.

Also - we separately have been working on a bigquery lineage/usage v2 implementation (enabled with use_queries_v2). If the bottleneck is from that, we'll probably ask you to enable the new version since it is a bit more optimized with regards to runtime and memory usage

vejeta · 2024-10-02T09:45:30Z

@hsheth2 I see. Thanks a lot. This has been generated with 0.13.3.

I tried with 0.14.1 but our GMS is not in 0.14.x yet, so there are some incompatibilities when sending the metadata.

Hopefully, it is enough to locate the bottleneck.
If use_queries_v2 is for 0.14.x, we will need to perform an upgrade first, that will take a bit of time.

Attached the flamegraph:

memray-flamegraph-bigquery-2024_10_01-20_23_13.html.gz

hsheth2 · 2024-10-02T17:42:53Z

@vejeta looking at the flamegraph, it looks like the OOM is being caused by our SQL parser.

If that's the case, use_queries_v2 probably won't help. We made a few tweaks to the SQL parser since 0.13.3 which might help, but I'm not too confident on that.

If you could run with datahub --debug ingest ..., the last ~1000 lines or so of the log should have some additional details about the exact SQL query that we're having issues with.

vejeta · 2024-10-09T16:06:52Z

Thanks a lot @hsheth2, here it goes the last 4000 lines with --debug.

I had to "anonymize" the field names, hopefully it still can give a clue of what introduced the memory leak from 0.12.x to 0.13+

bigquery_job_4000lines.txt.gz

edulodgify · 2024-10-14T06:56:57Z

Hi, we are facing the same issue all info is in ticket #11597 and I attach here also the anonymized logs using debug mode
acryl.log

hsheth2 · 2024-10-18T23:11:28Z

Now that #11432 and #11645 are both part of 0.14.1.3 - would you guys mind giving this another try?

vejeta · 2024-10-20T17:26:39Z

Thanks a lot @hsheth2 for the quick feedback on this issue. I am going to work on migrating from 0.13.x to the latest 0.14.x so I can provide feedback on this.

Will the tag 0.14.1.3 be published soon?

hsheth2 · 2024-10-22T17:31:12Z

That's the CLI version, which is published here https://pypi.org/project/acryl-datahub/0.14.1.3/

vejeta added the bug Bug report label Aug 12, 2024

vejeta changed the title ~~OOM (Out of Memory Errors) in DataHub 0.13.x when ingesting Bigquery metadata~~ OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata Aug 22, 2024

vejeta closed this as completed Sep 25, 2024

vejeta reopened this Sep 25, 2024

david-leifker mentioned this issue Oct 11, 2024

OOMKilled for BigQuery ingestion from UI #11597

Closed

david-leifker assigned hsheth2 Oct 14, 2024

hsheth2 mentioned this issue Oct 16, 2024

perf(ingest): streamline CLL generation #11645

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

vejeta commented Aug 12, 2024

vejeta commented Aug 22, 2024 •

edited

Loading

hsheth2 commented Sep 16, 2024

vejeta commented Sep 18, 2024

vejeta commented Sep 25, 2024

pedro93 commented Sep 25, 2024

vejeta commented Oct 2, 2024

hsheth2 commented Oct 2, 2024

vejeta commented Oct 2, 2024

hsheth2 commented Oct 2, 2024 •

edited

Loading

vejeta commented Oct 9, 2024

edulodgify commented Oct 14, 2024

hsheth2 commented Oct 18, 2024

vejeta commented Oct 20, 2024 •

edited

Loading

hsheth2 commented Oct 22, 2024

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

Comments

vejeta commented Aug 12, 2024

vejeta commented Aug 22, 2024 • edited Loading

hsheth2 commented Sep 16, 2024

vejeta commented Sep 18, 2024

vejeta commented Sep 25, 2024

pedro93 commented Sep 25, 2024

vejeta commented Oct 2, 2024

hsheth2 commented Oct 2, 2024

vejeta commented Oct 2, 2024

hsheth2 commented Oct 2, 2024 • edited Loading

vejeta commented Oct 9, 2024

edulodgify commented Oct 14, 2024

hsheth2 commented Oct 18, 2024

vejeta commented Oct 20, 2024 • edited Loading

hsheth2 commented Oct 22, 2024

vejeta commented Aug 22, 2024 •

edited

Loading

hsheth2 commented Oct 2, 2024 •

edited

Loading

vejeta commented Oct 20, 2024 •

edited

Loading