Development: Pipeline

CATH-FunFam Pipeline

In theory, we should be able to run the GeMMA-FunFHHMER process from start to finish via a managed data pipeline.

Ideally, this pipeline will take into account the developments in processing large superfamilies (MDA partitioning, iterative clustering, etc), integrating external data (CSA, IBIS, FunSite) and outputting expected files in standard formats.

Having all of this managed by a single code base (running from a single config file) will potentially solve lots of problems with consistency, documentation and "what happens if someone gets run over by a bus".

Current plan:

define each stage of this pipeline (name, dependencies, input file(s), output file(s))
generate boilerplate code (see Luigi)
generate mock data for each stage
test each stage (based on mock data)
implement code at each stage to process live data
Run on example set of small superfamilies
Run on example set of large superfamilies (kinases)
Run on all superfamilies
Map old FunFams to new FunFams

Flow stages:

Retrieve raw input data (Gene3D, UniProtKB-GO API)
- Gene3D: uniprot_acc, domain_id, mda, sequence
- UniProt API: uniprot_acc, go_acc, go_branch, go_evidence
Partition sequences:
- (If large superfamily?) split into initial rough clusters via unique MDA
For each partition:
- Form non-redundant clusters (CD-HIT90)
- Generate starting clusters (filter clusters without GO annotations)
- Run GeMMA (build tree)
- Run FunFHMMER (cut tree)
All non-clustered sequences are new partitions (repeat 3 until no new clusters)
Create output FunFam files
- Alignments (reps by 'importance' or sequence ids cf. Pfam)
- Trees (for each alignment)
- GO Summary (per superfamily, per funfam)
- MDA Summary (per superfamily, per funfam)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development: Pipeline

CATH-FunFam Pipeline

Current plan:

Flow stages:

Clone this wiki locally