Skip to content

Trace Recalling is a novel method for deconvoluting double traces that result from simultaneously sequencing two DNA templates. Trace Recalling identifies up to two bases at each position of such a trace.

License

Notifications You must be signed in to change notification settings

BrentLab/trace_recalling

Repository files navigation

TraceRecalling

Install

Before you run trace recalling you will need the basecaller phred installed on your system. It is freely availible to academic users, look at the website below to find out how to get and install phred on your system:

http://www.phrap.org/consed/consed.html#howToGet

Once that is done and you can run phred from the command line you can install trace recalling (actually you can install trace recalling without installing phred first but when you try to run trace recalling it won't work)

You will also need perl installed on your system which shouldnt be a problem on most unix systems, if you dont have perl go to www.cpan.org.

I'm releasing trace recalling as an rpm so it should be as easy as (as root):

rpm -Uvh trace_recalling-0.5.rpm

This will create file called trace_recalling.pl in your /usr/bin directory and a trace_recalling subdirectory with some files needed for trace recalling.

Hopefully all the perl modules you will need are included but if not e-mail me.

Running

You can run trace recalling with no arguments to get the following usage statement:

trace_recalling.pl [--mode=single,<numeric threshold>] <trace file> <genomic sequence file>

<trace file> is the fully qualified path to a .scf formatted trace file

<genome sequence file> is the fully qualified path to a sequence file used as the reference genomic sequence in fasta format

<numeric threshold> is the peak area ratio threshold described in the paper. It is a cutoff for relative area of primary to secondary peak. If the ratio of the areas of secondary to primary peak (area_secondary_peak / area_primary_peak) is less than the threshold the secondary peak will be ignored. This filters out secondary peaks which are probably noise. 0.1 is probably a good value to use for this but you can experiment.

In practice you should always run with the --mode flag set. Running without the --mode flag will run the 20 threshold analysis presented in the trace recalling paper but this will not be useful for most applications. So a typical run of trace recalling will look something like:

trace_recalling.pl --mode=single,0.1 /home/tenney/foo.scf /home/tenney/bar.fa

This will create a new directory /home/tenney/foo.scf_dir in which 5 output files will be stored.

In the example above these would be:

  • foo.scf.poly:

This is the poly file containing primary and secondary peak information generated by phred. It is used by trace recalling. This file can usually be ignored.

  • foo.scf.ambig:

This is the ambiguity sequence file in fasta format. It contains information about primary and secondary peaks collapsed into a single sequence.

  • foo.scf.first_align:

This is the primary alignment file, the result of aligning the ambiguity sequence to the reference genomic sequence.

  • foo.scf.recall:

This is the recalled sequence file, it is the result of "subtracting" the reference genomic sequence from the aligned ambiguity sequence.

There is a reason some of the bases are uppercase and others are lowercase in the recalled sequence. Uppercase characters are bases that were single peaks in the trace which were recalled with the same base as in the abmiguity sequence, lowercase characters are bases that were secondary peaks in the trace that would not have been called by phred running in default mode. So if you see long runs of lowercase letters here that means that sequence came from a less abundant template.

  • foo.scf.second_align:

This is the result of realigning the recalled sequence to the reference genomic sequence. Unless you expect your recalled sequence to align near your primary sequence (in an alternate splice finding application for example) this may not be of much use.

The files you probably want to look at are the first and second align files and the recalled sequence.

If you are brave or foolhardy enough to run WITHOUT the --mode=single flag set you will get the 20 threshold analysis performed in the paper. This will generate a LOT of output in your result directory. The only file you will probably want to look at in that mess has a .result_log extension. There is a line in that file that starts with "module_report:" which will indicate which type of alternate splice (if any) was detected. The types of altenate splices reported here are the same as the ones discussed in the paper. The method for deriving these is also described in the paper (the regular expression matching part of the methods). If there is enough interest I will write a more detailed description of this part of the code and release a new readme file. Otherwise take a look at the paper. If it still does not make sense, please raise an issue.

About

Trace Recalling is a novel method for deconvoluting double traces that result from simultaneously sequencing two DNA templates. Trace Recalling identifies up to two bases at each position of such a trace.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages