Skip to content

Tony Notes: Home

Tony E Lewis edited this page Mar 15, 2017 · 6 revisions

General

  • Are there simple tools for visualising the trees?
  • Attempt to make steps succeed/fail atomically (ie generate file to temp filename and rename on success)
  • Make some attempt to avoid potential problems with concurrent attempts to do the same thing.
  • Include mechanism to start HPC work on biggest superfamilies first so that the longest running S/F is started ASAP (and so that the small jobs will neatly fill in idle slots towards the end)

Generating data

  • As soon as benchmark superfamilies are agreed, submit the existing code on those superfamilies to the CS cluster.
  • Create code tools for submitting/checking/gathering/locally-performing lists of GeMMA computation tasks, independently of their provenance. Perhaps support freezing/thawing these lists.
  • Generate all alignments/profiles twice for both starting-cluster-orderings (depth-first-tree-ordering because that matches previous results; numeric-ordering because that allows equivalent groups' data to be reused, no matter the provenance)
  • Make all non-starting clusters' filenames be identified only by a hash of the starting clusters (with the ordering preserved), without reference to the name of the new node in the current set-up.
  • Immediately generate alignments/profiles for all nodes in all trees in Dave's directories and then calculate all the evalues for all those trees' pairwise merges. Then use these to make fixed up versions of the same trees.
  • For now, keep alignment files, because we may want to re-use alignments when merging clusters so as to not have to re-align the entire cluster from scratch.

Symmetry

Think carefully about how to handle symmetry. It would be good to avoid doubling the run-time by doing every comparison twice but don't want to over-complicate or miss results. Likely to be worth making symmetry-handling switchable so that it's easy to test results aren't being missed.

The design needs to handle the initial scan procedures and then the batch update procedures.

Batch-handling

The proposed new scheme for batch handling is a bit more complicated than initially described in the main doc because, in principle all ASAP merges (except the first) are putative until the results are all in because, say, the first merged cluster might turn out to get a better score to one of the mergees in any given pair than the other does, in which case the pair won't be merged.

For this reason, when scanning a batch, all the newly formed nodes should be scanned against both the "after" cluster but also the two "before" clusters.

Data-loading

Exactly what needs to be loaded into memory for each pass

Compute Cluster Issues

Q. How to handle having to build all the alignments and profiles before any of the new batch's scans can be performed?

A. Use -hold_jid (or -hold_jid_ad for pairwise dependency between job-arrays — probably not relevant here).

Files

For filenames:

  • when referring to a starting-cluster, use the starting-cluster's ID:
  • when referring to a higher-level node, use an MD5 of the member starting-cluster IDs (in the correct order) Create a term to describe this sort of ID and use the term consistently (including here).

For each node, store an alignment, profile and a human-readable summary .txt file (describing the membership, the commands and the executables' versions). Use an ID as above.

For each scan, store the results in a file that has a name with two ids - one for the query node (as above) and an MD5 of the sorted IDs (as above) being scanned against. May as well always sort that list. No need to have a .txt file describing the scanlist - the results file does that job itself.

This allows data to be re-used (only) where that can be done cleanly.

Steps

  • How to write the partial progress? Will need list of merge-ops that have been committed and possibly a separate list of merge-ops that have been (or are being) computed. Or does that latter list add nothing over the checking for the presence of files that'll have to be done anyway?

Code architecture

  • Trawl have_a_play_around.pl for the sorts of data that will be required and design it in from the start.