Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cath.superpose ssaps files optimization ? #76

Open
tubiana opened this issue Aug 22, 2021 · 1 comment
Open

cath.superpose ssaps files optimization ? #76

tubiana opened this issue Aug 22, 2021 · 1 comment
Assignees

Comments

@tubiana
Copy link

tubiana commented Aug 22, 2021

Dear all,

I'm facing challenging alignments (several times 1000+ structures).
Since cath.superpose check if ssaps files exit, I found a way to speed up alignments by re-executing the cath.superpose command with random file order in the arguments (with a code example bellow, if it can be useful to someone);

But here's my question, I actually realised that all ssaps files pairs are computed

(base) thibault@XXX [XXX]/ssaps $ ls -l | grep A1A4S6 | grep B1AVH7
-rw-r--r-- 1 thibault ansatt     3080 Aug 21 10:20 A1A4S6.pdbB1AVH7.pdb.list
-rw-r--r-- 1 thibault ansatt       62 Aug 21 10:20 A1A4S6.pdbB1AVH7.pdb.scores
-rw-r--r-- 1 thibault ansatt     3080 Aug 21 15:37 B1AVH7.pdbA1A4S6.pdb.list
-rw-r--r-- 1 thibault ansatt       62 Aug 21 15:37 B1AVH7.pdbA1A4S6.pdb.scores
(base) thibault@XXX [XXX]/ssaps $ cat A1A4S6.pdbB1AVH7.pdb.scores
A1A4S6.pdb  B1AVH7.pdb  108   99  85.49   97   89   15   3.34
(base) thibault@XXX [XXX]/ssaps $ cat B1AVH7.pdbA1A4S6.pdb.scores
B1AVH7.pdb  A1A4S6.pdb   99  108  85.49   97   89   15   3.34

In some cases, I can have more than 10 million files in the same folder...
I was thinking if there is a particular reason to generate all pairs? Maybe cath.superpose could gain in efficiency and storage if only one file for each pair is generated?

Wishing you a nice day 🙂
Best regards,
Thibault.


Code example for running cath.superpose with random files order

export CATH_TOOLS_PDB_PATH=$WORKDIR
pdbinfile=""
for pdb in `ls $WORKDIR/*.pdb |sort -R`
do
  pdbinfile+="--pdb-infile $pdb "
done
#echo $pdbinfile
cath-superpose --do-the-ssaps ssaps --sup-to-pdb-files-dir output $pdbinfile
@tonyelewis
Copy link
Contributor

Thank you for using cath-superpose and for giving us some of your feedback - much appreciated.

I'm not 100% clear about your point about things being sped up by randomising the order of the inputs. Is the point that you're using the --do-the-ssaps option of cath-superpose and you're running several of these at the same time? So you're using the randomisation as a way to parallelise the SSAPs that generate the alignments? In which case, it sounds like it would be valuable to you if there was an option to tell --do-the-ssaps to run n SSAP jobs in parallel. Is that correct?

In general, I think you're right that this area feels like it could be improved. We did enough work in this area to start generating good multiple structural alignments and to build something usable but we think we could do much better on the current trade-off between quality and computation time and on figuring out which SSAPs don't need to be performed.

However, for the issue you're talking about, I think we've already exploited the symmetry of only needing one alignment for each pair of structures: the code only SSAPs+uses the pair in the order of the first-specified-on-the-command-line first. So I suspect what's happening is that your randomisation also randomises the ordering it requires for each pair.

Does that sound right? Does this reinforce the idea that you'd benefit from an in-built way to parallelise the --do-the-ssaps?

@tonyelewis tonyelewis self-assigned this Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants