-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
206 lines (156 loc) · 10.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
COPYRIGHT:
==========
Copyright 2012 Lars Steinbrueck under the GPL.
See file LICENSE.txt for details.
The software includes other software written by third
parties. This has been distributed according to the
licenses provided by the respective authors (se below).
JAMA (http://math.nist.gov/javanumerics/jama/):
This software is a cooperative product of The MathWorks
and the National Institute of Standards and Technology (NIST)
which has been released to the public domain. Neither The
MathWorks nor NIST assumes any responsibility whatsoever for
its use by other parties, and makes no guarantees, expressed
or implied, about its quality, reliability, or any other
characteristic.
BVLS (http://people.sc.fsu.edu/~jburkardt/f_src/bvls/bvls.html):
Charles Lawson, Richard Hanson,
Solving Least Squares Problems,
Revised edition,
SIAM, 1995,
ISBN: 0898713560,
LC: QA275.L38.
GENERAL USAGE:
==============
The software is written in Jjava and depending on the version
includes further C and Fortran code. The software is distributed
either as (i) a jar file or as (ii) java, c and fortran source code.
The first version solves the non-negative least-squares (NNLS)
problem using an implementation in java. This version is already
compiled and can be used from scratch. The second version solves
the NNLS problem with a fortran implementation of C. Lawson and
R. Hanson. We strongly recommend the use of the second version,
as this implementation is magnitudes faster than the java
implementation.
INSTALLATION:
-------------
Version (i) needs no installation and can be used directly. To compile
version (ii) adjust the file 'makefile', such that the neccessary
libraries 'jni.h' and 'jni_md.h' can be included. After adjustment
simply type 'make' and the source code will be compiled into the
'bin/' directory.
RUN THE PROGRAM:
----------------
Version (i): java -jar AntigenicTreeTools.jar [options]
Version (ii): java -cp [path to software folder]/bin/:[path to software folder]/jar/Jama.jar -Djava.library.path=[path to software folder]/bin/ phyloDriver [options]
Options:
........
-h will print the help mesage below.
Use the following options: (options indicated with [] are optional)
TREE INPUT
[-f strategy -- infer intermediate sequences (AccTran/DelTran)]
[-g -- count gaps as changes (when ancestral states are reconstructed)]
[-i file -- input file with intermediate sequences in fasta format]
[-l file -- file with node linkage]
[-m file -- file with leaf node mapping]
-n file -- file with tree in newick format
-o name -- output name
[-p -- given tree is in phylip format (default is nexus)]
-t file -- input file with leave sequences in fasta format
TREE MANIPULATION
[-col -- permit branch collapsing]
[-not list -- comma separated list of nodes to be pruned]
[-r name -- reroot tree at leaf 'name']
NNLS FIT
-ls file -- input matrix for least squares fit
[-d -- HI input matrix contains already log2 normalized distance values]
[-loo -- do loo for fit?]
[-cv "x,y" -- do x-fold cross validation y times for fit?]
Output:
.......
The program will output three files:
[output name].leastSquares.distance The squared training and testing (if applied) error for
each element (HI titer / distance) and the total squarred
and absolut error. Each line is compossed of
distance label [tab] true value [tab] predicted value (training)
[tab] squarred error (training) [tab] predicted value (testing,
if applied) [tab] squarred error (testing, if applied)
[output name].leastSquares.mutationImpact Individual weigths of each branch. Each line is compossed of
branch ID [tab] weight [tab] mapped mutations
Positive branch IDs refer to up-weights, whereas negativ
branch IDs refer to down-weigths. 'NaN' indicates that no
weight could be inferred for that branch (e.g. in case of
no antiserum is present in the subtree, such that the
down weight is not defined).
[output name].leastSquares.withMuts.tre Antigenic tree in nexus format with mutations and antigenic
weights mapped to each branch. Branch lengths are set to the
maximum of the respective up- or down-weight. The tree can be
easily viewed using FigTree (http://tree.bio.ed.ac.uk/software/figtree/).
Options in detail:
..................
-col Collapse branches that are shorter than 1e-7. Results in multifurcating trees.
-cv Perform a x-fold cross validation for the specified data. Parameter passed as 'x,n': x-fold
cross-validation independently repeated n times. Folds are built randomly.
-d Input matrix for least-squares optimization already contains log2 transformed distances.
-f Strategy for ancestral character state reconstruction. Choose AccTran (accelerated transformation, default)
or DelTran (delayed transformation).
-g Count gaps as changes during ancestral character state reconstruction. If not specified, gaps will be
treated as missing.
-i Alignment file containing the sequences of intermediate nodes in fasta format.
-l Linkage file to map ancestral sequences to intermediate nodes. The file is compossed of pairs of nodes
of the following scheme: from [tab] to [line break]. The ordering of links is defined by the newick tree
(parsing the newick string from the left to the right).
-ls Either the HI titer matrix or already log2 transformed distance values. In the second case the option
'-d' has to be used, too. For the HI titer matrix the titers between antigen i and antiserum j
will be transformed into log2 distances: d(i,j) = log2(max(H(j))) - log2(H(i,j)). The general input
format follows this specification:
First row: Sera names, tab separated, starting with a tab ([tab] name 1 [tab] name 2 [tab] ...)
Second row: Reference values for normalization (REF [tab] value for serum 1 [tab] value
for serum 2 [tab] ...). If log2 distances are provided set these values to 0.0.
Remaining rows: Input values (antigen name [tab] value for serum 1 [tab] value for serum 2 [tab] ...)
If a value for a specific serum is not present use '*' (in case of HI titers)
or 'NaN' (in case of log2 transformed distances).
-loo Perform leave-one-out cross-validation for the specified data. For each element of the input matrix
train a model (antigenic tree) using all other elements and predict the distance for the left out
element.
-m Mapping of additional information to leaf nodes. This file is addapted to the needs of influenza virus
strains and allows to pass additional information to the program. Each line has top follow this scheme:
Node ID [tab] accession [tab] strain name [tab] serotype [tab] year of isolation [tab] host [tab] whole
identifier string [tab] exact date of isolation
In the current version of the program only column one and three are used. The remaining information can
be skipped (left blank). If this option is specified, the strain names will be output at the leaves of
the tree rather than the node identifiers.
-n The newick tree either in nexus format or in phylip format. If the tree is supplied in phylip format
you have to specify the option '-p', too.
-not Remove the specified leaf nodes from the tree. IDs should be passed comma separated (ID1,ID2,...)
-o Output prefix used for output files.
-p Input tree is in phylip format. If not specified, the input tree is assmed to be in nexus format.
-r Reroot the tree at the specified leaf node.
-t Alignment file for leave sequences in fasta format.
Ancestral character state reconstruction:
.........................................
For sake of simplicity we implemented a basic parsimony approach [1] for ancestral character state
reconstruction. Ties are resolved using either accelerated transition ('-f AccTran') or delayed transition
('-f DelTran'). However, the output of other ancestral character state reconstruction techniques can be used, too.
In this case do not specify the '-f' option. Instead provide the sequences of intermediate nodes ('-i') and
and a linkage file ('l') to specify where the sequences map in the tree.
Examples (called from within the software directory):
....................................................
(1) java -cp bin/:jar/Jama.jar -Djava.library.path=bin/ phyloDriver -n example_data/tree.phy -p -t example_data/aa.aln -f AccTran -m example_data/aa.map -col -r f0dp7 -ls example_data/HI_titers.txt -o WHO1988a
(2) java -cp bin/:jar/Jama.jar -Djava.library.path=bin/ phyloDriver -n example_data/tree.phy -p -t example_data/aa.aln -i example_data/aa.intermediate.aln -l example_data/aa.link -m example_data/aa.map -col -r f0dp7 -ls example_data/HI_distances.txt -d -o WHO1988b
(3) java -jar AntigenicTreeTools.jar -n example_data/tree.phy -p -t example_data/aa.aln -f AccTran -m example_data/aa.map -col -r f0dp7 -ls example_data/HI_titers.txt -loo -o WHO1988c
(4) java -cp bin/:jar/Jama.jar -Djava.library.path=bin/ phyloDriver -n example_data/tree.phy -p -t example_data/aa.aln -f Sankoff -m example_data/aa.map -col -r f0dp7 -ls example_data/HI_titers.txt -o WHO1988a -seed 4 -cost example_data/aa-cost.txt
These examples highlight the use of the different versions and parameters. The first and second example use the
Fortran library to solve the NNLS problem, whereas the last example uses a Java library. All examples produce the
same output. However differences are as follows:
- Example (1) infers the ancestral character states using an implemented parsimony approach and transformes the
HI titers into distances.
- Example (2) reads node linkage information and maps ancestral sequences that were inferred by a different
program and uses already log2-transformed distances.
- Example (3) is similar to example (1), but furthermore computes the leave-one-out error.
Example sequences were downloaded from the Influenza Virus Ressource [2] and HI data retrieved from [3].
References:
===========
[1] Fitch, W. (1971). Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool, 20 (4): 406-16.
[2] Bao, Y., P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky et al. (2008). The influenza virus resource at the National Center for Biotechnology Information. J Virol, 82 (2): 596-601.
[3] WHO (1988). Recommended composition of influenza virus vaccines for use in the 1988-1989 season. WHO Wkly Epidem Rec 63 (9): 57-60.