Skip to content

Development: Pipeline

Ian Sillitoe edited this page Feb 4, 2019 · 1 revision

CATH-FunFam Pipeline

In theory, we should be able to run the GeMMA-FunFHHMER process from start to finish via a managed data pipeline.

Ideally, this pipeline will take into account the developments in processing large superfamilies (MDA partitioning, iterative clustering, etc), integrating external data (CSA, IBIS, FunSite) and outputting expected files in standard formats.

Having all of this managed by a single code base (running from a single config file) will potentially solve lots of problems with consistency, documentation and "what happens if someone gets run over by a bus".

Current plan:

  • define each stage of this pipeline (name, dependencies, input file(s), output file(s))
  • generate boilerplate code (see Luigi)
  • generate mock data for each stage
  • test each stage (based on mock data)
  • implement code at each stage to process live data
  • Run on example set of small superfamilies
  • Run on example set of large superfamilies (kinases)
  • Run on all superfamilies
  • Map old FunFams to new FunFams

Flow stages:

  1. Retrieve raw input data (Gene3D, UniProtKB-GO API)
    • Gene3D: uniprot_acc, domain_id, mda, sequence
    • UniProt API: uniprot_acc, go_acc, go_branch, go_evidence
  2. Partition sequences:
    • (If large superfamily?) split into initial rough clusters via unique MDA
  3. For each partition:
    • Form non-redundant clusters (CD-HIT90)
    • Generate starting clusters (filter clusters without GO annotations)
    • Run GeMMA (build tree)
    • Run FunFHMMER (cut tree)
  4. All non-clustered sequences are new partitions (repeat 3 until no new clusters)
  5. Create output FunFam files
    • Alignments (reps by 'importance' or sequence ids cf. Pfam)
    • Trees (for each alignment)
    • GO Summary (per superfamily, per funfam)
    • MDA Summary (per superfamily, per funfam)