Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datahub performance #11671

Open
pilipyukaaa opened this issue Oct 18, 2024 · 4 comments
Open

datahub performance #11671

pilipyukaaa opened this issue Oct 18, 2024 · 4 comments
Labels
bug Bug report

Comments

@pilipyukaaa
Copy link

pilipyukaaa commented Oct 18, 2024

Hello,
I have a problem with performance on process which consume messages from kafka and push changes in elasticsearch and neo4j
i was added this envs to my gms

  extraEnvs:
    - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_RECORDS
      value: '10'
    - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_INTERVAL_MS
      value: '120000'
    - name: ES_BULK_REQUESTS_LIMIT
      value: '1500'
    - name: ES_BULK_FLUSH_PERIOD
      value: '2'
    - name: LOGGING_LEVEL_ORG_APACHE_KAFKA_CLIENTS_CONSUMER
      value: DEBUG
    - name: LOGGING_LEVEL_ORG_SPRINGFRAMEWORK_KAFKA
      value: DEBUG
    - name: ELASTICSEARCH_THREAD_COUNT
      value: '15'
    - name: ES_BULK_ENABLE_BATCH_DELETE
      value: 'true'
    - name: LOGGING_LEVEL_ORG_APACHE_KAFKA_CLIENTS_CONSUMER
      value: DEBUG
    - name: LOGGING_LEVEL_ORG_SPRINGFRAMEWORK_KAFKA
      value: DEBUG
[2024-10-18 09:01:22,092 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 172. Number of events: 5 Took time ms: 3
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,463 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:109 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD) with aspect upstreamLineage received by Sibling Hook.
2024-10-18 09:01:40,467 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:244 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_bdm.bdm_dim_opportunity_view_final,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD) as siblings.
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:137 - Successfully completed MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_opportunity_view_final,PROD)
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:82 - Got MCL event key: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD), topic: MetadataChangeLog_Versioned_v1, partition: 0, offset: 119678, value size: 143224, timestamp: 1729168196437
2024-10-18 09:01:40,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:106 - Invoking MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD), aspect name: upstreamLineage, entity type: dataset, change type: UPSERT
2024-10-18 09:01:40,474 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:01:40,479 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: EtcUX9vACyZAw/dPG+Inzw==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-18 09:02:04,472 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,473 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:109 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD) with aspect upstreamLineage received by Sibling Hook.
2024-10-18 09:02:04,476 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:244 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_bdm.bdm_dim_request,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD) as siblings.
2024-10-18 09:02:04,481 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:119 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
2024-10-18 09:02:04,481 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.MetadataChangeLogProcessor:137 - Successfully completed MCL hooks for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_bdm.bdm_dim_request,PROD)
](url)

but performance is very low, can you help me find bottleneck?

@pilipyukaaa pilipyukaaa added the bug Bug report label Oct 18, 2024
@deepgarg-visa
Copy link
Contributor

deepgarg-visa commented Oct 20, 2024

Hi @pilipyukaaa , which version of Datahub you are using ?

Neo4j is certainly a bottleneck here.
PRs to improve neo4j query performances. Check your version has these changes.

https://github.com/datahub-project/datahub/pull/10598/files
#10593

Also create indexes for entities in neo4j if not created already. By default they are not getting created.

@Daniellundin048
Copy link

Daniellundin048 commented Oct 20, 2024 via email

@pilipyukaaa
Copy link
Author

hello, @deepgarg-visa i am using datahub version 0.13.3

@pilipyukaaa
Copy link
Author

pilipyukaaa commented Oct 21, 2024

i was update my datahub to 0.14.1 version and its still not good

2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,425 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:121 - Urn urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD) with aspect datasetKey received by Sibling Hook.
2024-10-21 13:24:50,433 [ThreadPoolTaskExecutor-1] INFO  c.l.m.k.h.s.SiblingAssociationHook:256 - Associating urn:li:dataset:(urn:li:dataPlatform:dbt,_dds.dist_5_dds_CRM_issues,PROD) and urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD) as siblings.
2024-10-21 13:24:50,438 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:97 - Successfully completed MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:69 - Invoking MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD), aspect name: siblings, entity type: dataset, change type: RESTATE
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,439 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,440 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: BXh3SoWBZKt7JlYWQUbs+w==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-21 13:24:50,441 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Aclickhouse%2C_dds.dist_5_dds_crm_issues%2CPROD%29, operation type: UPDATE, index: datasetindex_v2
2024-10-21 13:24:50,473 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IncidentsSummaryHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook IngestionSchedulerHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook EntityChangeEventGeneratorHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook SiblingAssociationHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:97 - Successfully completed MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:69 - Invoking MCL hooks for consumer: generic-mae-consumer-job-client urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD), aspect name: upstreamLineage, entity type: dataset, change type: RESTATE
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook FormAssignmentHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,474 [ThreadPoolTaskExecutor-1] INFO  c.l.metadata.kafka.MCLKafkaListener:79 - Invoking MCL hook UpdateIndicesHook for urn: urn:li:dataset:(urn:li:dataPlatform:clickhouse,_dds.dist_5_dds_crm_issues,PROD)
2024-10-21 13:24:50,478 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: YxtQxbiPZDnFzc4S31sl0A==, operation type: UPDATE, index: system_metadata_service_v1
2024-10-21 13:24:50,478 [ThreadPoolTaskExecutor-1] INFO  c.l.m.s.e.update.ESBulkProcessor:82 - Added request id: urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Aclickhouse%2C_dds.dist_5_dds_crm_issues%2CPROD%29, operation type: UPDATE, index: datasetindex_v2
2024-10-21 13:24:51,778 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener:61 - Successfully fed bulk request 198. Number of events: 10 Took time ms: 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

3 participants