Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata #11147

Open
vejeta opened this issue Aug 12, 2024 · 14 comments
Open
Assignees
Labels
bug Bug report

Comments

@vejeta
Copy link
Contributor

vejeta commented Aug 12, 2024

Describe the bug

We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions.
0 12 0_Datahub

You can see the memory stays stable under 1GiB during the execution of the job.

However, with DataHub 0.13.x versions it always fail with Error 137, out of memory.
OOM_137

We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.

To Reproduce
Steps to reproduce the behavior:

  • Kubernetes CronJob set up
    • 2Gi memory
    • 2 CPU
  • Our deployment:
            - name: bigquery
              image: 'acryldata/datahub-ingestion:v0.13.3'
              imagePullPolicy: Always
              args: ["ingest", "-c", "/recipes/bq-recipe.yml"]    

Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.

  • Our BigQuery recipe
            source:
              type: bigquery
              config:
                project_on_behalf: {{ bq_slots_project }}
                project_id_pattern:
                  allow:
                    - .*{{ gcp_project }}
                dataset_pattern:
                  allow:
                    - {{ profile_dataset }}
                  deny:
                    - ^temp_.*
                    - .*_temp$
                    - .*_temp_.*
                    - .*-temp.*
                    - .*temporary.*
        
                use_exported_bigquery_audit_metadata: true
                bigquery_audit_metadata_datasets:
                  - {{ gcp_project }}.bigquery_audit_log
                use_date_sharded_audit_log_tables: true
                upstream_lineage_in_report: true
        
                include_usage_statistics: true
        
                capture_table_label_as_tag: true
                capture_dataset_label_as_tag: true
                extract_column_lineage: true
                convert_urns_to_lowercase: true
        
                profiling:
                  enabled: "true"
                  profile_table_size_limit: null
                  profile_table_row_limit: null

                  use_sampling: false
                  partition_profiling_enabled: false
                  include_field_mean_value: false
                  include_field_median_value: false
                  include_field_sample_values: false
                  include_field_stddev_value: false

        
                stateful_ingestion:
                  enabled: true
                  state_provider:
                    type: "datahub"
                    config:
                      datahub_api:
                        server: "http://our-gms:8080"
        
            pipeline_name: {{ pipeline }}
            sink:
              type: "datahub-rest"
              config:
                server: "http://our-gms:8080"

Expected behavior
Not having a Out of Memory error in DataHub 0.13.3

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Google Cloud GKE/Kubernetes
  • Version: DataHub 0.13.x

Additional context
Logs with some summary after a successful execution with DataHub 0.12.x

Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
 'pending_requests': 0}
 'gms_version': 'v0.13.0',
 'total_duration_in_seconds': 12396.68,
 'current_time': '2024-08-11 03:26:39.897462 (now)',
 'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
 'failures': [],
 'warnings': [],
 'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
 'running_time': '3 hours, 26 minutes and 30.91 seconds'}
 'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
                                       'extra={}',
                                       "successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
 'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
 'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
 'partition_info': {},
                      'sampled': '10 sampled of at most 3059 entries.'},
-----
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host: 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
 'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
 'total_query_log_entries': 0,
 'sampled': '10 sampled of at most 2880 entries.'},
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host': 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
@vejeta vejeta added the bug Bug report label Aug 12, 2024
@vejeta
Copy link
Contributor Author

vejeta commented Aug 22, 2024

Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)

@vejeta vejeta changed the title OOM (Out of Memory Errors) in DataHub 0.13.x when ingesting Bigquery metadata OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata Aug 22, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 16, 2024

@vejeta could you generate a memory profile - see https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/

@vejeta
Copy link
Contributor Author

vejeta commented Sep 18, 2024

Thanks, @hsheth2 for the suggestion! I will do that with the latest version (0.14.1)

@vejeta
Copy link
Contributor Author

vejeta commented Sep 25, 2024

Just out of curiosity, to see if this is easier, is there a published docker image with the memray dependency included?
After enabling the memory profiling in my recipe, I am getting:

ERROR    {datahub.entrypoints:218} - Command failed: No module named 'memray' 

It that doesn't exist, no worries, I will see to test it with our own docker image with memray included.

@vejeta vejeta closed this as completed Sep 25, 2024
@vejeta vejeta reopened this Sep 25, 2024
@pedro93
Copy link
Collaborator

pedro93 commented Sep 25, 2024

memray is installed when you install the datahub debug sub-package:

"debug": list(debug_requirements),

Check these docs on how to use it: https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/#how-to-use (opt for the cli option)

@vejeta
Copy link
Contributor Author

vejeta commented Oct 2, 2024

@hsheth2, @pedro93 I have generated the file with the memory profiling. It is 801Mb in size.

Shall I provide the output of running memray over it somehow?
What output can be more useful?

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 2, 2024

For starters, if you could run memray flamegraph file.bin to generate an html file, that'd be helpful. Hopefully that gives us enough info, but we might need the full file if there isn't enough in the flamegraph.

Also - we separately have been working on a bigquery lineage/usage v2 implementation (enabled with use_queries_v2). If the bottleneck is from that, we'll probably ask you to enable the new version since it is a bit more optimized with regards to runtime and memory usage

@vejeta
Copy link
Contributor Author

vejeta commented Oct 2, 2024

@hsheth2 I see. Thanks a lot. This has been generated with 0.13.3.

I tried with 0.14.1 but our GMS is not in 0.14.x yet, so there are some incompatibilities when sending the metadata.

Hopefully, it is enough to locate the bottleneck.
If use_queries_v2 is for 0.14.x, we will need to perform an upgrade first, that will take a bit of time.

Attached the flamegraph:

memray-flamegraph-bigquery-2024_10_01-20_23_13.html.gz

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 2, 2024

@vejeta looking at the flamegraph, it looks like the OOM is being caused by our SQL parser.

If that's the case, use_queries_v2 probably won't help. We made a few tweaks to the SQL parser since 0.13.3 which might help, but I'm not too confident on that.

If you could run with datahub --debug ingest ..., the last ~1000 lines or so of the log should have some additional details about the exact SQL query that we're having issues with.

@vejeta
Copy link
Contributor Author

vejeta commented Oct 9, 2024

Thanks a lot @hsheth2, here it goes the last 4000 lines with --debug.

I had to "anonymize" the field names, hopefully it still can give a clue of what introduced the memory leak from 0.12.x to 0.13+

bigquery_job_4000lines.txt.gz

@edulodgify
Copy link

Hi, we are facing the same issue all info is in ticket #11597 and I attach here also the anonymized logs using debug mode
acryl.log

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 18, 2024

Now that #11432 and #11645 are both part of 0.14.1.3 - would you guys mind giving this another try?

@vejeta
Copy link
Contributor Author

vejeta commented Oct 20, 2024

Thanks a lot @hsheth2 for the quick feedback on this issue. I am going to work on migrating from 0.13.x to the latest 0.14.x so I can provide feedback on this.

Will the tag 0.14.1.3 be published soon?

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 22, 2024

That's the CLI version, which is published here https://pypi.org/project/acryl-datahub/0.14.1.3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

4 participants