Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dedicated seqr sync module to db layers #869

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft

Conversation

EddieLF
Copy link
Contributor

@EddieLF EddieLF commented Jul 12, 2024

Seqr Sync module

This PR contains a module to manage seqr syncing between Metamist and seqr. The idea is to abstract the aspects of the seqr sync process so that the data synced to a seqr project can be highly customized / tailored.

Metamist already has an existing seqr db layer . This module looks to replace the existing layer with a more robust and abstracted implementation while reusing existing code where possible. A lot of the code in this module was also lifted from the sync_seqr.py script in /scripts.

To break down the module:

  • seqr_sync.py The main part of the module that takes the transformed data and posts it to seqr. Any script used to sync data to seqr should instantiate an instance of this class and call the sync_dataset methods.
  • data_fetchers.py Contains classes MetamistFetcher and FileFetcher. These classes contain methods to get data from Metamist and files respectively. The data will then need to be transformed into seqr's expected formats before being loaded.
  • data_transformers.py Contains the SeqrTransformer class which converts from data formats output by Metamist into the data formats expected by seqr. e.g. processing ped sex & affected values, processing hpo terms, formatting the es-index json post, etc.
  • config.py The definitions and global variables that clutter the top of the sync_seqr.py Metamist script have been put into this file for cleaner access.
  • utils.py Contains helper methods needed by parts of the sync process, e.g. writing the SG - PID map to the bucket, diffing sequencing groups when loading a new es-index.
  • logging_config.py Neatly contain the logging initialization for simple import and use.

Still TODO:

  • Formalise how and where exactly this module belongs in the Metamist repo.
    • A lot of the methods added in this PR leverage GraphQL queries, however it seems like the convention is that db layers should not use GQL API, and instead just use the standard REST APIs. Is it ok to add a module that leverages GQL like this? Why / why not?
  • Integrate the existing seqr layer methods into this module - e.g. generate_seqr_auth_token, send_slack_notification
  • Implement simple and painless bulk syncing, i.e. select a number of projects & types to sync and execute all of them
  • Update existing sync UI to work with the new module

@EddieLF EddieLF marked this pull request as draft July 12, 2024 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant