SARS-CoV-2 lineage importations and spread are reduced after nonpharmaceutical interventions in phylogeographic analyses
Goliaei S. (1,2), Foroughmand-Araabi M.H. (1,2), Roddy A. (1,2), Weber A. (3), Översti S. (3), Kühnert D. (3,4) McHardy A.C. (1,2,+)
- 1 Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany,
- 2 Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
- 3 Transmission, Infection, Diversification and Evolution Group, Max-Planck Institute of Geoanthropology, Jena, Germany
- 4 German COVID Omics Initiative (deCOI), Bonn, Germany
- + Corresponding author
The computational workflow for inference of importation lineages of SARS-Cov-2 into Germany. The workflow was adapted from [https://github.com/COG-UK/UK-lineage-dynamics-analysis], for processing a 20-fold larger dataset (original workflow for analysis of first UK wave processed 50K samples and this analysis was handling 1.8M sequence samples).
Note that because of the GISAID terms of use genomic sequences cannot be shared in this (public) repository. Instead, in the metadata files (all the tsv files under the analyses/*/results folders), we kept only the ID of sequences from GISAID data. The GISAID Accession ID for the samples after subsamping and filterings are available in files results of in/out ratio subsampling with mash identicals, results of in/out ratio subsampling, results of in/out ratio subsampling with identicals, results of fixed bucket size subsampling with mash identicals, results of fixed bucket size subsampling, results of fixed bucket size subsampling with identicals, results of case-ratio subsampling with mash identicals, results of case-ratio subsampling, results of case-ratio subsampling with identicals.
The preprint of the manuscript is available at medRxiv.
File analyses/phylogenetic/run-2.sh
is the script containing all the steps.
The workflow includes some optional steps, some steps for testing the results. Some steps are very time-consuming, so one may want to skip those steps.
First, we create a new folder and inside it clone this github repository.
mkdir covid-mcmc
cd covid-mcmc
git clone git@github.com:hzi-bifo/covid-germany-mcmc.git
We assume that following data are located at:
- Raw sequences at
data/data/gisaid-20210602-raw.fa
. - Metadata file at
data/data/gisaid-20210602-metadata.tsv
. - (Optional) Sequence aligment file at
data/data/mmsa_20210622_masked.fa
Note that, these files should be downloaded from the GISAID website after signing up and accepting the agreement form, thus these files are not provided in the repository. The files could be find in the GISAID website following these paths: Raw sequences from GISAID -> EpiCov -> Downloads -> FASTA, metadata file from GISAID -> EpiCov -> Downloads -> metadata, sequence alignment file from GISAID -> EpiCov -> Downloads -> MSA masked.
mkdir -p data/data/
Download previous files and put them in the mentioned locations.
Install the following applications:
- Thorney BEAST.
- The Thorney BEAST package contains the beast application.
- The Thorney BEAST package also contains the treeannotator application.
- (Optional) mash application, for reproducing results with added identical sequences. Create a conda environment with the following libraries:
boost
andboostcpp
Note: Environment variablePATH
should contain above mentioned installed applications.
mkdir bin/
cd bin
wget https://github.com/beast-dev/beast-mcmc/releases/download/v1.10.5pre_thorney_v0.1.1/BEASTv1.10.5pre_thorney_0.1.1.zip
unzip -q BEASTv1.10.5pre_thorney_0.1.1.zip
PATH=$PWD/BEASTv1.10.5pre_thorney_0.1.1/bin:$PATH
cd ../
# You can check if beast works fine by running the following command
beast -version
# Creating a conda environment and installing packages
conda create --name covid-uk
conda activate covid-uk
conda install boost-cpp boost -c conda-forge
Applications from the current repository should be compiled and executable files produces. Code of some applications are not located in the current repository. So clone the repository https://github.com/hzi-bifo/covid-germany-mcmc-phylogeo-tools
and put it at ../../../phylogeo-tools
(relative to the working directory). Then you can check if the symlinked files are pointing to the current source codes (e.g. scripts/state.h
file relative to the working directory).
# It should be executed inside the covid-mcmc folder, we created first.
git clone git@github.com:hzi-bifo/covid-germany-mcmc-phylogeo-tools.git phylogeo-tools
git clone git@github.com:hzi-bifo/covid-germany-mcmc-one-tree-public.git one-tree
To create executables, on the folder scripts/
it is enough to run following command:
cd covid-germany-mcmc/analyses/phylogenetic/scripts/
make
cd ../../../../
Note that the makefile uses environment variable CONDA_PREFIX
based on which include directory $(CONDA_PREFIX)/include
and lib directory $(CONDA_PREFIX)/lib
are addressed.
To start the executions, we first change directory to the working folder as
cd huge-lineage-dynamics/analyses/phylogenetic/
- Environment variable
LD_LIBRARY_PATH
should point to a place with libboost_programoptions. - All the executions could be done from working directory
analyses/phylogenetic/
relative to the home of the github repository. - Following variables are reporesenting current data timestamps. So, please keep it as so.
DATE_TREE=20210602
DATE_METADATA=20210602
DATE_SEQ=20210602
- State should be "Germany". For other countries, some modifications should be made.
STATE=Germany
TWD
points to a place with temporary but large free space.
TWD=/tmp
SUB_TREES
are the important subtrees found after subsampling.
SUB_TREES="A B.1.1.7 B.1.1.519 B.1.1.70 B.1.1.317 B.1.177 B.1.160 B.1.221 B.1.36 B.1.258 B.1.351 C"
- Input (not in repository):
../../../data/data/gisaid-$DATE_SEQ-raw.fa
- Output (not in repository):
../../../data/data/gisaid-$DATE_SEQ-Wu1.fa
$TWD/gisaid-$DATE_SEQ/gisaid-$DATE_SEQ-raw.fa
../../../data/data/gisaid-$DATE_SEQ.fa
Note that the file../../../data/data/gisaid-$DATE_SEQ-raw.fa
should be edited to contain only sequencehCoV-19/Wuhan/WH04/2020
after execution of the first step.
#cleaning
grep -A1000 -F 'hCoV-19/Wuhan/WH04/2020' ../../../data/data/gisaid-$DATE_SEQ-raw.fa > ../../../data/data/gisaid-$DATE_SEQ-Wu1.fa
#edit then ..., label: hCoV-19/Wuhan/WH04/2020|EPI_ISL_406801|2020-01-05
DIR=`pwd`
mkdir $TWD/gisaid-$DATE_SEQ/
cd $TWD/gisaid-$DATE_SEQ/
ln -s $DIR/../../../data/data/gisaid-$DATE_SEQ-raw.fa .
bash $DIR/sarscov2phylo/clean_gisaid.sh -i gisaid-$DATE_SEQ-raw.fa -o gisaid-$DATE_SEQ.fa -t 30
mv gisaid-$DATE_SEQ.fa $DIR/../../../data/data/
cd $DIR
- Input:
../../../data/data/gisaid-$DATE_METADATA-metadata.tsv
../../../data/data/gisaid-$DATE_SEQ.fa
- Output:
results/pangolin-$DATE_TREE-r0.tree
results/gisaid-$DATE_TREE-metadata-sampled.tsv
../../data/phylogenetic/gisaid-$DATE_SEQ-sampled0.fa
../../data/phylogenetic/gisaid-$DATE_SEQ-sampled.fa
results/gisaid-$DATE_TREE-?-samples0.txt
(results/gisaid-20210602-B.1.1.7-samples0.txt, ...) (for different subtrees with their names substitutes of$SUB_TREES
character '?')results/gisaid-$DATE_TREE-?-sampled0.tree
(results/gisaid-$DATE_TREE-B.1.1.7-sampled0.tree, ...) (for different subtrees of$SUB_TREES
with their names substitutes character '?')
## SUBSAMPLING:
#V3 (CURRENT): build tree as pangolin tree:
./scripts/tree-build-as-pangolin --out results/pangolin-$DATE_TREE-r0.tree --metadata ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv --seqs ../../../data/data/gisaid-$DATE_SEQ.fa -l Germany nonGermany
# we use following for filtering non-good samaple (e.g. sample without complete date)
./scripts/phylogeo_sankoff_general --in results/pangolin-$DATE_TREE-r0.tree --out results/gisaid-$DATE_TREE-$STATE.tree --metadata ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv --location_label $STATE non$STATE --cond 2 "==" Germany
./scripts/sample-evenly --in results/gisaid-$DATE_TREE-$STATE.tree --out results/gisaid-$DATE_TREE-sampled.tree~ -l $STATE non$STATE --samples-out results/gisaid-$DATE_TREE-samples.txt~ --metadata ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv --bucket-size 100 25 --special 5 $SUB_TREES --keep 1000 10000 $SUB_TREES
cp results/gisaid-$DATE_TREE-sampled.tree~ results/gisaid-$DATE_TREE-sampled.tree
cp results/gisaid-$DATE_TREE-samples.txt~ results/gisaid-$DATE_TREE-samples.txt
# We can check the statistics of the trees and subtrees with following commands. The SUB_TREES variable is calculated beased on following statistics.
./scripts/tree-stat --in results/gisaid-$DATE_TREE.tree -l $STATE non$STATE --large 1000 --depth 10
./scripts/tree-stat --in results/gisaid-$DATE_TREE-sampled.tree -l $STATE non$STATE --large 500 --depth 30
./scripts/extract-metadata.R results/gisaid-$DATE_TREE-samples.txt ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv results/gisaid-$DATE_TREE-metadata-sampled.tsv
cut -f1 results/gisaid-$DATE_TREE-metadata-sampled.tsv | tail -n +2 > results/gisaid-$DATE_TREE-sample-names.txt
./scripts/alignment-filter ../../../data/data/gisaid-$DATE_SEQ.fa results/gisaid-$DATE_TREE-sample-names.txt > ../../data/phylogenetic/gisaid-$DATE_SEQ-sampled0.fa
grep '>' ../../data/phylogenetic/gisaid-$DATE_SEQ-sampled0.fa | sort | uniq -c | grep -v '^ *1 ' | sort -nr
./scripts/rename-alignment-ids results/gisaid-$DATE_TREE-metadata-sampled.tsv < ../../data/phylogenetic/gisaid-$DATE_SEQ-sampled0.fa > ../../data/phylogenetic/gisaid-$DATE_SEQ-sampled.fa
# Extracting data of the SUB_TREES
./scripts/partition-by-name --in results/gisaid-$DATE_TREE-sampled.tree --par $SUB_TREES --samples "results/gisaid-$DATE_TREE-?-samples0.txt" --trees "results/gisaid-$DATE_TREE-?-sampled0.tree" -l $STATE non$STATE --print-annotation false --print-internal-node-label false
- Input:
../../data/phylogenetic/gisaid-$DATE_SEQ-sampled.fa
results/gisaid-$DATE_TREE-$X-samples0.txt
results/gisaid-$DATE_TREE-$X-seq0.fa
../../../data/data/gisaid-$DATE_SEQ-Wu1.fa
- Output:
results/gisaid-$DATE_TREE-$X-1.tree
# Creating sequences for each sub-tree
for X in $SUB_TREES; do
./scripts/alignment-filter ../../data/phylogenetic/gisaid-$DATE_SEQ-sampled.fa results/gisaid-$DATE_TREE-$X-samples0.txt > results/gisaid-$DATE_TREE-$X-seq0.fa
cat ../../../data/data/gisaid-$DATE_SEQ-Wu1.fa >> results/gisaid-$DATE_TREE-$X-seq0.fa
done
Following commands creates the phylogenies via [sarscov2phylo https://github.com/roblanf/sarscov2phylo] pipeline. Execution of following commands are time consuming.
for X in $SUB_TREES; do
DIR=`pwd` ;
rm -rf results/fasttree/$DATE_TREE-$X/ ;
mkdir -p results/fasttree/$DATE_TREE-$X/ ;
ln -s $DIR/results/gisaid-$DATE_TREE-$X-seq0.fa results/fasttree/$DATE_TREE-$X/gisaid-$DATE_TREE-$X-seq0.fa ;
cp scripts/qrun-fasttree.sh results/fasttree/$DATE_TREE-$X/qrun.sh ;
cd results/fasttree/$DATE_TREE-$X/ ;
# Following command was made for submitting a job. To be checked in this version
bash qrun.sh $DATE_SEQ $DATE_TREE $X ;
cd - ;
done
- Input:
results/fasttree/$DATE_TREE-$X/ft_FBP.tree
results/gisaid-$DATE_TREE-$X-samples0.txt
results/gisaid-$DATE_TREE-metadata-sampled.tsv
data/X.fixedRootPrior.skygrid-template-thorney.xml
- Output:
results/beast/run/$X-$i/
for$X
in$SUB_TREES
and$i
in 31-35,results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.log
- Intermediate files:
results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.xml
Declaring initial clock rate for MCMCs.
declare -A SUB_TREE_CLOCK_RATE=(
[B.1.1.7]=0.00075
[B.1.1.317]=0.00075
[B.1.1.214]=0.00075
[B.1.160]=0.00075
[B.1.221]=0.00075
[B.1.36]=0.00075
[B.1.351]=0.00075
[P]=0.00075
[A]=0.00075
[C]=0.00075
[B.1.617.2]=0.00075
[B.1.1.519]=0.00075
[B.1.258]=0.00075
[B.1.1.70]=0.00075
[B.1.177]=0.00075
)
Creating XMLs for BEAST.
for X in $SUB_TREES; do
cp results/fasttree/$DATE_TREE-$X/ft_FBP.tree results/gisaid-$DATE_TREE-$X-1.tree ;
./scripts/extract-metadata.R results/gisaid-$DATE_TREE-$X-samples0.txt results/gisaid-$DATE_TREE-metadata-sampled.tsv results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
# B.1.1.70
grep -v EPI_ISL_2333117 results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv > x; cat x > results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
grep -v EPI_ISL_2333526 results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv > x; cat x > results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
#B.1.1.177
grep -v EPI_ISL_2333528 results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv > x; cat x > results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
# B.1.258
grep -v EPI_ISL_2333525 results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv > x; cat x > results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
# B.1.1.7
grep -v EPI_ISL_2357883 results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv > x; cat x > results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv
./scripts/phylogeo_sankoff_general --in results/gisaid-$DATE_TREE-$X-1.tree --out results/gisaid-$DATE_TREE-$X-2.tree --metadata results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv --location_label $STATE non$STATE --cond 2 "==" Germany --print-annotation false --print-internal-node-label false --single-child false --samples results/gisaid-$DATE_TREE-$X-samples1.txt
./scripts/contract_short_branch --in results/gisaid-$DATE_TREE-$X-2.tree --out results/gisaid-$DATE_TREE-$X-cont.tree --short 5e-6 --location_label $STATE non$STATE --print-annotation false --print-internal-node-label false --contract_leaf_enabled false
./scripts/fill-template.py --template data/X.fixedRootPrior.skygrid-template-thorney.xml --name $X --sample results/gisaid-$DATE_TREE-$X-samples1.txt --tree results/gisaid-$DATE_TREE-$X-cont.tree --tree_init results/gisaid-$DATE_TREE-$X-2.tree --metadata results/gisaid-$DATE_TREE-metadata-sampled.tsv --out results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.xml --loc Germany --date $DATE_TREE --clock_rate ${SUB_TREE_CLOCK_RATE[$X]} --chain 300000000 # --cutoff 2.25 --grid_points $[ 225 * 52 / 100 ]
done
Executing BEAST MCMCs.
for X in $SUB_TREES; do
#current jobs number are 31..35, previously it was 1..5 for all and 6..20 for B.1.1.7
for i in {31..35}; do
# For job sbumission: qsub -cwd -N beast-$X-$i -M foroughmand@gmail.com -l h_vmem=10G,mem_free=10G,s_vmem=10G -pe smp 3 -o results/beast/run/out/$X-$i -e results/beast/run/out/$X-$i run-beast.sh run/$X-$i $DATE_TREE $X
# To be checked:
bash run-beast.sh run/$X-$i $DATE_TREE $X
done
done
Analysing the logs:
for X in $SUB_TREES; do
for i in {31..35}; do
loganalyser results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE.log
done
done
Combining logs:
#combining and making a log file for each subtree
for X in $SUB_TREES; do
logcombiner `
for i in {31..35}; do
echo results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE-fixed.log
done` results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.log
done
- Input:
results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE.trees
results/gisaid-$DATE_TREE-metadata-sampled.tsv
- Output:
results/beast/run/$X-$i/
for$X
in$SUB_TREES
and$i
in 31-32. - Intermediate files:
results/beast/$X-DTA-$DATE_TREE.xml
results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.trees
Resampling from sampled MCMC trees for next MCMC steps:
for X in $SUB_TREES; do
MCMC_FRP_LOG_FILES=""
RANGE=`seq 31 35`
for i in $RANGE; do
python scripts/resample.py --burnin 15000000 --rate 100000 --tree results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE.trees --out results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE-sampled.trees
python scripts/resample.py --burnin 15000000 --count 500 --tree results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE.trees --out results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE-sub500.trees
MCMC_FRP_LOG_FILES="$MCMC_FRP_LOG_FILES results/beast/run/$X-$i/$X.fixedRootPrior.skygrid-$DATE_TREE-sub500.trees"
done
logcombiner -trees $MCMC_FRP_LOG_FILES results/beast/$X.fixedRootPrior.skygrid-$DATE_TREE.trees
done
Generating XML files:
for X in $SUB_TREES; do
./scripts/fill-template.py --template data/X-DTA-template.xml --name $X --sample results/gisaid-$DATE_TREE-$X-samples1.txt --metadata results/gisaid-$DATE_TREE-metadata-sampled.tsv --out results/beast/$X-DTA-$DATE_TREE.xml --loc Germany --date $DATE_TREE
done
Executing BEAST MCMC:
for X in $SUB_TREES; do
for i in {31..32}; do
# Submission command: qsub -cwd -N DTA-$X-$i -l h_vmem=64G,mem_free=20G,s_vmem=20G -pe smp 5 -o results/beast/run/out/$X-$i -e results/beast/run/out/$X-$i run-beast-DTA.sh run/$X-$i $DATE_TREE $X
# To be checked
bash run/$X-$i $DATE_TREE $X
done
done
- Input:
results/beast/run/$X-$i/$X-DTA-$DATE_TREE.sub4500.trees
for$X
in$SUB_TREES
and$i
in 31-32results/beast/run/$X-$i/$X-DTA-$DATE_TREE.log
- Output:
results/beast/run/all/$X-DTA-$DATE_TREE.MCC.tree
for$X
in$SUB_TREES
Combining logs of DTA MCMC:
for X in $SUB_TREES; do
MCMC_DTA_LOG_FILES=""
MCMC_DTA_TREE_FILES=""
for i in {31..32}; do
python scripts/resample.py --burnin 500000 --rate 4500 --tree results/beast/run/$X-$i/$X-DTA-$DATE_TREE.trees --out results/beast/run/$X-$i/$X-DTA-$DATE_TREE.sub4500.trees
MCMC_DTA_LOG_FILES="$MCMC_DTA_LOG_FILES results/beast/run/$X-$i/$X-DTA-$DATE_TREE.log"
MCMC_DTA_TREE_FILES="$MCMC_DTA_TREE_FILES results/beast/run/$X-$i/$X-DTA-$DATE_TREE.sub4500.trees"
done
echo logcombiner -trees $MCMC_DTA_TREE_FILES results/beast/run/all/$X-DTA-$DATE_TREE.combined.trees
logcombiner -trees $MCMC_DTA_TREE_FILES results/beast/run/all/$X-DTA-$DATE_TREE.combined.trees
echo logcombiner $MCMC_DTA_LOG_FILES results/beast/run/all/$X-DTA-$DATE_TREE.combined.log
logcombiner $MCMC_DTA_LOG_FILES results/beast/run/all/$X-DTA-$DATE_TREE.combined.log
done
Running tree-annotator from BEAST command-line tools.
mkdir results/beast/run/all
for i in {1..1}; do
for X in $SUB_TREES; do
# Job submission command: qsub -cwd -N ta-$X-$i -l h_vmem=64G,mem_free=20G,s_vmem=20G -pe smp 2 -o results/beast/run/out/$X-$i -e results/beast/run/out/$X-$i run-treeannotator.sh $DATE_TREE $X $i
# To be checked
bash run-treeannotator.sh $DATE_TREE $X $i
done
done
(Optional) Compressing log files and sampled trees.
for X in $SUB_TREES; do
xz -f -k -v -T0 results/beast/run/all/$X-DTA-$DATE_TREE.combined.trees
xz -f -v -T0 results/beast/run/all/$X-DTA-$DATE_TREE.combined.log
done
For the sub-sampled sequences not the results are at ../results/beast/run/all/
(relative to the working directory).
The reports on the reports
folder could be executed and lineages for this tree could be generated and evaluated.
Extracting importation lineages and adding unsampled sequences from Germany to the importation lienages
- Input:
results/gisaid-$DATE_TREE-samples.txt
results/gisaid-$DATE_TREE-unsampled.txt
results/gisaid-$DATE_TREE-unsampled-subtree.txt
../../../data/data/gisaid-$DATE_METADATA-metadata.tsv
../../../data/data/mmsa_20210622_masked.fa
- Output:
results/beast/run/lin-ius/clusters_DTA_MCC_NA.tsv
results/beast/run/lin-ius/clusterSamples_DTA_MCC_NA.tsv
results/beast/run/lin-ius/clusters_DTA_MCC_0.5.tsv
results/beast/run/lin-ius/clusterSamples_DTA_MCC_0.5.tsv
- Intermediate files:
results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_NA.tsv
results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_NA.tsv
results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_singles.tsv
results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_0.5.tsv
results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_0.5.tsv
results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_singles_0.5.tsv
scripts/extract-unsampled --location_label $STATE non$STATE --in results/gisaid-$DATE_TREE-$STATE.tree --samples results/gisaid-$DATE_TREE-samples.txt --unsampled results/gisaid-$DATE_TREE-unsampled.txt $SUB_TREES --cond 2 "==" Germany --metadata ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv
cat results/gisaid-$DATE_TREE-unsampled.txt | grep -v "NA$" > results/gisaid-$DATE_TREE-unsampled-subtree.txt
./scripts/extract-metadata.R results/gisaid-$DATE_TREE-unsampled-subtree.txt ../../../data/data/gisaid-$DATE_METADATA-metadata.tsv results/gisaid-$DATE_TREE-metadata-unsampled.tsv
cat results/gisaid-$DATE_TREE-metadata-sampled.tsv results/gisaid-$DATE_TREE-metadata-unsampled.tsv > results/gisaid-$DATE_TREE-metadata-sampled-unsampled.tsv
Creating alignment file for unsampled sequences from Germany.
cut -f3 results/gisaid-$DATE_TREE-metadata-unsampled.tsv | tail -n +2 > results/mmsa-$DATE_TREE-unsample-ids.txt
for X in $SUB_TREES; do cat results/gisaid-$DATE_TREE-$X-samples0.txt; done > results/gisaid-$DATE_TREE-all-samples0.txt
./scripts/alignment-filter ../../../data/data/mmsa_20210622_masked.fa <(cat results/mmsa-$DATE_TREE-unsample-ids.txt results/gisaid-$DATE_TREE-all-samples0.txt) > ../../data/phylogenetic/mmsa-$DATE_SEQ-sampled-unsampled0.fa
for X in $SUB_TREES; do
echo "Enriching tree $X"
cp results/beast/run/all/$X-DTA-$DATE_TREE.MCC.tree results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.nexus
./scripts/phylogeo_sankoff_general --in results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.nexus --out results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree --metadata results/gisaid-$DATE_TREE-metadata-unsampled.tsv --location_label \"$STATE\" \"non$STATE\" --cond 2 "==" Germany --nosankoff --print-allow-annotation false --single-child false --ilabel true --set-tip-location false
./scripts/phylogeo_sankoff_general --in results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.nexus --out results/beast/unsampled/$X-DTA-$DATE_TREE-rich.MCC.tree --metadata results/gisaid-$DATE_TREE-metadata-unsampled.tsv --location_label \"$STATE\" \"non$STATE\" --cond 2 "==" Germany --nosankoff --print-allow-annotation true --single-child false --ilabel true --set-tip-location false
./scripts/alignment-filter ../../data/phylogenetic/mmsa-$DATE_SEQ-sampled-unsampled0.fa <( grep "$X$" results/gisaid-$DATE_TREE-unsampled-subtree.txt | cut -f 1 ) > results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query-mmsa.fa
./scripts/alignment-filter ../../data/phylogenetic/mmsa-$DATE_SEQ-sampled-unsampled0.fa results/gisaid-$DATE_TREE-$X-samples0.txt > results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference-mmsa.fa
done
Some initializations:
for X in $SUB_TREES; do
mv results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference-mmsa.fasta
mv results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query-mmsa.fasta
mv results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference-mmsa.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference-mmsa.fa
mv results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query-mmsa.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query-mmsa.fa
done
mkdir results/beast/run/lin-ius/ results/beast/run/lin-ius/out/
mkdir results/beast/run/lin-ius/aln/ results/beast/run/lin-ius/alnq/
for X in $SUB_TREES; do
echo "Tree $X "
#calculate distances between sequences
csplit -q --prefix=results/beast/run/lin-ius/aln/alignment-$DATE_TREE-$X-epa-reference- --suffix-format=%06d.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/reference-mmsa.fa '/^>/' '{*}'
rm results/beast/run/lin-ius/aln/alignment-$DATE_TREE-$X-epa-reference-000000.fasta
csplit -q --prefix=results/beast/run/lin-ius/alnq/alignment-$DATE_TREE-$X-epa-query- --suffix-format=%06d.fasta results/beast/unsampled/alignment-$DATE_TREE-$X-epa/query-mmsa.fa '/^>/' '{*}'
rm results/beast/run/lin-ius/alnq/alignment-$DATE_TREE-$X-epa-query-000000.fasta
ls -S results/beast/run/lin-ius/aln/ | grep "alignment-$DATE_TREE-$X-epa-reference-" | awk '{print "'results/beast/run/lin-ius/aln/'" $0}' | grep -v '\-000000.fasta' > results/beast/run/lin-ius/out/alignment-files-$X-reference.txt
ls -S results/beast/run/lin-ius/alnq/ | grep "alignment-$DATE_TREE-$X-epa-query-" | awk '{print "'results/beast/run/lin-ius/alnq/'" $0}' | grep -v '\-000000.fasta' > results/beast/run/lin-ius/out/alignment-files-$X-query.txt
mash sketch -o results/beast/run/lin-ius/alignment-$DATE_TREE-$X-epa-reference.fasta results/beast/run/lin-ius/aln/alignment-$DATE_TREE-$X-epa-reference-*.fasta 2>>e
mash sketch -o results/beast/run/lin-ius/alignment-$DATE_TREE-$X-epa-query.fasta -l results/beast/run/lin-ius/out/alignment-files-$X-query.txt 2>>e
time mash dist -v 0.05 -p 120 results/beast/run/lin-ius/alignment-$DATE_TREE-$X-epa-reference.fasta.msh results/beast/run/lin-ius/alignment-$DATE_TREE-$X-epa-query.fasta.msh | ~/bin/pigz-2.6/pigz -k -3 -p10 > results/beast/run/lin-ius/alignment-distance-$DATE_TREE-$X.txt.gz
zcat results/beast/run/lin-ius/alignment-distance-$DATE_TREE-$X.txt.gz | scripts/convert-filename-to-id | gzip > results/beast/run/lin-ius/alignment-distance-$DATE_TREE-$X-id.txt.gz
sed 's/Germany+nonGermany/nonGermany/g' results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.nexus > results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.2.nexus
scripts/lineage-importation-inject-unsampled --tree results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.2.nexus --dist <( zcat results/beast/run/lin-ius/alignment-distance-$DATE_TREE-$X-id.txt.gz ) -l \"$STATE\" \"non$STATE\" --out-folder results/beast/run/lin-ius/ --metadata results/gisaid-$DATE_TREE-metadata-sampled-unsampled.tsv --lin-prefix LIN-Germany-$X-"$DATE_TREE"_DTA_MCC_ --treefile results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.nexus --out-clusters results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_NA.tsv --out-cluster-samples results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_NA.tsv --out-clusters-single results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_singles.tsv --out-0.5 results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_0.5.tsv results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_0.5.tsv results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_singles_0.5.tsv
done
Creating output files (to be used by R scripts for generating the reports):
(
X=`echo $SUB_TREES | head -n1 | awk '{print $1;}'`
head -n 1 results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_NA.tsv
for X in $SUB_TREES; do
tail -n +2 results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_NA.tsv
done ) > results/beast/run/lin-ius/clusters_DTA_MCC_NA.tsv
(
X=`echo $SUB_TREES | head -n1 | awk '{print $1;}'`
head -n 1 results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_NA.tsv
for X in $SUB_TREES; do
tail -n +2 results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_NA.tsv
done ) > results/beast/run/lin-ius/clusterSamples_DTA_MCC_NA.tsv
(
X=`echo $SUB_TREES | head -n1 | awk '{print $1;}'`
head -n 1 results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_0.5.tsv
for X in $SUB_TREES; do
tail -n +2 results/beast/run/lin-ius/out/$X-clusters_DTA_MCC_0.5.tsv
done ) > results/beast/run/lin-ius/clusters_DTA_MCC_0.5.tsv
(
X=`echo $SUB_TREES | head -n1 | awk '{print $1;}'`
head -n 1 results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_0.5.tsv
for X in $SUB_TREES; do
tail -n +2 results/beast/run/lin-ius/out/$X-clusterSamples_DTA_MCC_0.5.tsv
done ) > results/beast/run/lin-ius/clusterSamples_DTA_MCC_0.5.tsv
The internal movement analysis is a Bayesian DTA method that assigns location, state of Germany, to internal nodes of each imported lineage. This is done on the main folders for each subsampling method, e.g. analyses/phylogenetic-test-snake-2/
, analyses/phylogenetic-test-subsampling-3/
, and analyses/phylogenetic-test-subsampling-5/
.
First, we remove nodes with unknown state.
for X in $SUB_TREES; do
./scripts/metadata-remove-na.R results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv results/dtamulti/gisaid-$DATE_TREE-$X-metadata-sampled-no-na.tsv
./scripts/phylogeo_sankoff_general --in results/beast/unsampled/$X-DTA-$DATE_TREE.MCC.tree.2.nexus --save-nexus --out results/dtamulti/$X-DTA-$DATE_TREE.MCC.tree.2.nexus --metadata results/gisaid-$DATE_TREE-$X-metadata-sampled.tsv --location_label \"$STATE\" \"non$STATE\" --cond 2 "==" Germany --print-annotation false --print-internal-node-label false --single-child false --nosankoff --remove-invalid-children true
done
for X in $SUB_TREES; do
cat results/dtamulti/gisaid-$DATE_TREE-$X-metadata-sampled-no-na.tsv
done > results/dtamulti/gisaid-$DATE_TREE-all-metadata-sampled-no-na.tsv
Then, we merge all the imported lineages to one tree with a root with long branches. This helps us in having one set of parameters for all the different lienages.
./scripts/merge-trees-for-multidta --trees `for X in $SUB_TREES; do echo results/dtamulti/$X-DTA-$DATE_TREE.MCC.tree.2.nexus; done` --root-date 1900 -l \"Germany\" \"nonGermany\" --metadata results/gisaid-$DATE_TREE-metadata-sampled.tsv --metadata-in-filter results/dtamulti/gisaid-$DATE_TREE-all-metadata-sampled-no-na.tsv --samples results/dtamulti/sampled.fixedRootPrior.skygrid-$DATE_TREE.samples --out results/dtamulti/sampled.fixedRootPrior.skygrid-$DATE_TREE.trees
Then, we fill the template and execute the MCMC via BEAST.
./scripts/fill-template.py --template data/X-DTAmulti-template.xml --name sampled --sample results/dtamulti/sampled.fixedRootPrior.skygrid-$DATE_TREE.samples --metadata results/gisaid-$DATE_TREE-metadata-sampled.tsv --out results/dtamulti/sampled-DTAmulti-$DATE_TREE.0.xml --loc Germany --date $DATE_TREE --loc_depth
sed 's/Baden-Württemberg/Baden-Wurttemberg/g' < results/dtamulti/sampled-DTAmulti-$DATE_TREE.0.xml > results/dtamulti/sampled-DTAmulti-$DATE_TREE.xml
for i in {31..32}; do
bash run-beast-DTAmulti.sh run/sampled-$i $DATE_TREE sampled
done
Then, we merge the trees from two independent runs and subsample them for tree annotator.
mkdir results/dtamulti/run/all results/dtamulti/run/lin-ius-3/
MCMC_DTA_LOG_FILES=""
MCMC_DTA_TREE_FILES=""
for i in {31..32}; do
srun python scripts/resample.py --burnin 500000 --rate 4500 --tree results/dtamulti/run/sampled-$i/sampled-DTA-$DATE_TREE.trees --out results/dtamulti/run/sampled-$i/sampled-DTA-$DATE_TREE.sub4500.trees
MCMC_DTA_LOG_FILES="$MCMC_DTA_LOG_FILES results/dtamulti/run/sampled-$i/sampled-DTA-$DATE_TREE.log"
MCMC_DTA_TREE_FILES="$MCMC_DTA_TREE_FILES results/dtamulti/run/sampled-$i/sampled-DTA-$DATE_TREE.sub4500.trees"
done
srun logcombiner -trees $MCMC_DTA_TREE_FILES results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.trees
srun logcombiner $MCMC_DTA_LOG_FILES results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.log
loganalyser results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.log > results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.log.analyzed
Running tree annotator that finds MCC tree.
srun --mem=100G --cpus-per-task=2 --qos verylong --time=24:00:00 ../../bin/beast/bin/treeannotator -type mcc results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.trees results/dtamulti/run/all/sampled-DTA-$DATE_TREE.MCC.nexus
Finally, the lineages will be extracted for the R report generation scripts
srun ./scripts/lineage-importation-extract-multidta --tree results/dtamulti/run/all/sampled-DTA-$DATE_TREE.MCC.nexus --lineage-samples results/beast/run/lin-ius-3/clusterSamples_DTA_MCC_0.5.tsv --out results/dtamulti/run/lin-ius-3/clusterMovement_DTA_MCC_0.5.0.tsv -l \"Baden-Wurttemberg\" \"Bavaria\" \"Berlin\" \"Brandenburg\" \"Bremen\" \"Hamburg\" \"Hesse\" \"Lower Saxony\" \"Mecklenburg-Western Pomerania\" \"North Rhine-Westphalia\" \"Rhineland-Palatinate\" \"Saarland\" \"Saxony\" \"Saxony-Anhalt\" \"Schleswig-Holstein\" \"Thuringia\" --root-date 1900
sed 's/"//g' < results/dtamulti/run/lin-ius-3/clusterMovement_DTA_MCC_0.5.0.tsv > results/dtamulti/run/lin-ius-3/clusterMovement_DTA_MCC_0.5.tsv
- Input:
results/beast/run/lin-ius/clusters_DTA_MCC_NA.tsv
results/beast/run/lin-ius/clusterSamples_DTA_MCC_NA.tsv
results/beast/run/lin-ius/clusters_DTA_MCC_0.5.tsv
results/beast/run/lin-ius/clusterSamples_DTA_MCC_0.5.tsv
- Output:
reports-ius/importationSummaryMultiState.pdf
reports-ius/lineageBreakdownMultiState.pdf
reports-ius/lineageSummaryMultiState.pdf
The R reports (Rmd files) on the folder reports-ius/
(relative to working directory) could be executed then.
The reports use the files generated from the last step (importation lineage extraction and adding unsampled sequences).
We checked robustness of results under different sub-sampling methods.
- Fixed in/out ratio: In this method, a fixed bucket size for each week for Germany is assumed as well as in- and out- ratios. For each week, we performed following steps iteratively. First we picked in-ratio sequences from Germany and out-ratio sequences from non-Germany sequences. If any of these two sets are completely sampled, or number of sampled sequences from Germany reached to the fixed bucket size, the loop is ended. Results of the analysis based on this subsampling are available in
analyses/phylogenetic-test-subsampling-3/
for Germany bucket size of 100 and in-ratio and out-ratio both equal to 1.analyses/phylogenetic-test-subsampling-3-nounsampled/
without identicalsanalyses/phylogenetic-test-subsampling-3-ridentical/
with injected identicals considering non-'n' locus
- Fixed country bucket size (100-25): In this method a fixed size for samples from each country is assumed for each week. If sequences of a country for a week was less than the threshould, all the available sequences for that week is sampled. Results are available in
analyses/phylogenetic-test-subsampling-5/
for Germany bucket size value of 100 and all non-Germany countries bucket size of 25.analyses/phylogenetic-test-subsampling-5-nounsampled/
without identicalsanalyses/phylogenetic-test-subsampling-5-ridentical/
with injected identicals considering non-'n' locus
- Sampling proportional to the number of cases (main method): A fixed size for the number of samples for each week is assumed. Then, this value is divided to the countries on each week proportioanl to their number of cases. The proportion for each country is filled with the sequences, or with all the sequences if number of available sequences is less than this value. Results are available in
analyses/phylogenetic-test-snake-2/
for 10000 as the number of sequences for each week.analyses/phylogenetic-test-snake-2-nounsampled/
without identicalsanalyses/phylogenetic-test-snake-2-ridentical/
with injected identicals considering non-'n' locus
In the main part of the study we inferred the importation lineages based on the MCC tree afthe the DTA Bayesian step. This could also be done directly based on the posterior samples. Here, for each posterior sample (tree and the location annotation), the importation lineages are inferred. The same methodology also used for the inter-state movement analysis of the importation lineages. The results are available as follows:
- Importation lineages for the posterior samples of the Bayesian method:
- Clusters per posterior samples (compressed)
- Cluster samples per posterior samples (compressed)
- Cluster with single samples per posterior samples (compressed)
- Inter-state movement for the posterior sampels of the Bayesian method
- Inter-state movement clusters per posterior samples (compressed)
Following codes infers the importation lineage for each Bayesian posterior sampled tree.
rm -rf results/beast/run/lin-ius-3e/
#extract trees results/beast/run/all/$X-DTA-$DATE_TREE.combined.trees to results/beast/run/lin-ius-3e/sample-tree
mkdir -p results/beast/run/lin-ius-3e/out results/beast/run/lin-ius-3e/sample-tree results/beast/run/lin-ius-3e/log results/beast/run/lin-ius-3e/mcc-tree/ results/beast/run/lin-ius-3e/out-tree/
for X in $SUB_TREES; do
echo $X ...
python scripts/tree-separate.py results/beast/run/all/$X-DTA-$DATE_TREE.combined.trees results/beast/run/lin-ius-3e/sample-tree/$X-NAME.nexus
done
for TREE_FILE in results/beast/run/lin-ius-3e/sample-tree/*.nexus ; do
TREE_NAME=`echo $(basename $TREE_FILE) | sed 's/\.nexus//'`
echo $TREE_NAME $TREE_FILE
bash scripts/generate-effectiveness.sh $DATE_TREE $STATE $TREE_NAME
done
# results in results/beast/run/lin-ius-3e/out/$TREE_NAME-effectiveness.tsv
Inferred importation lineages are calculated for each tree in the following code.
TREE_NAME=A-1000440000
head -n 1 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_0.5.tsv | sed 's/$/\ttree/' > results/beast/run/lin-ius-3e/out/all-clusters_DTA_MCC_0.5.tsv
head -n 1 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_0.5.tsv | sed 's/$/\ttree/' > results/beast/run/lin-ius-3e/all-clusters_DTA_MCC_0.5.tsv
head -n 1 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusterSamples_DTA_MCC_0.5.tsv| sed 's/$/\ttree/' > results/beast/run/lin-ius-3e/out/all-clusterSamples_DTA_MCC_0.5.tsv
head -n 1 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_singles_0.5.tsv | sed 's/$/\ttree/' >> results/beast/run/lin-ius-3e/out/all-clusters_DTA_MCC_singles_0.5.tsv
for TREE_FILE in results/beast/run/lin-ius-3e/sample-tree/*.nexus ; do
TREE_NAME=`echo $(basename $TREE_FILE) | sed 's/\.nexus//'`
#echo $TREE_NAME $TREE_FILE
tail -n+2 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_0.5.tsv | sed 's/$/\t'$TREE_NAME'/' >> results/beast/run/lin-ius-3e/out/all-clusters_DTA_MCC_0.5.tsv
tail -n+2 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_0.5.tsv | sed 's/$/\t'$TREE_NAME'/' >> results/beast/run/lin-ius-3e/all-clusters_DTA_MCC_0.5.tsv
tail -n+2 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusterSamples_DTA_MCC_0.5.tsv | sed 's/$/\t'$TREE_NAME'/' >> results/beast/run/lin-ius-3e/out/all-clusterSamples_DTA_MCC_0.5.tsv
tail -n+2 results/beast/run/lin-ius-3e/out/$TREE_NAME-clusters_DTA_MCC_singles_0.5.tsv | sed 's/$/\t'$TREE_NAME'/' >> results/beast/run/lin-ius-3e/out/all-clusters_DTA_MCC_singles_0.5.tsv
done
The following Rscript summarizes the calculated effectiveness values for the trees and calculates the 95% HPDs.
library(dplyr)
library(tidyr)
library(bfp)
d <- read.table('results/beast/run/lin-ius-3e/all.tsv', sep='\t', header=TRUE, stringsAsFactor=FALSE)
colnames(d) <- c('name', 'N1', 'N2', 'N3', 'N4', 'N5', 'N6', 'N7', 'N8', 'N9', 'N10', 'N11', 'N12', 'tree')
d<- d %>% filter(name != 'name')
ds <- d %>% mutate(name = ifelse(name == " no NRW,BV", "no NRW,BV", name)) %>% pivot_longer(num_range("N", 1:12), names_to='npi', values_to='eff') %>% mutate(eff = as.numeric(eff)) %>% group_by(name, npi) %>% summarise(eff.avg = mean(eff), eff.sd=sd(eff), hpd.l = empiricalHpd(eff, level=0.95)[1], hpd.u=empiricalHpd(eff, level=0.95)[2]) %>% pivot_wider(names_from=npi, values_from=c(eff.avg, eff.sd, hpd.l, hpd.u)) %>% select(sort(tidyselect::peek_vars())) %>% relocate(name)
write.table(ds, 'results/beast/run/lin-ius-3e/summ.tsv', sep='\t', quote=FALSE, row.names=FALSE)
The inter-state movement of the importation lineages could be inferred based on the posterior samples produced with the Bayesian method.
rm -rf results/dtamulti/run/lin-ius-3e
mkdir -p results/dtamulti/run/lin-ius-3e/sample-tree/ results/dtamulti/run/lin-ius-3e/movement/ results/dtamulti/run/lin-ius-3e/log/
python scripts/tree-separate.py results/dtamulti/run/all/sampled-DTA-$DATE_TREE.combined.trees results/dtamulti/run/lin-ius-3e/sample-tree/tree-NAME.single.nexus
for TREE_FILE in results/dtamulti/run/lin-ius-3e/sample-tree/*.single.nexus ; do
TREE_NAME=`echo $(basename $TREE_FILE) | sed 's/\.single\.nexus//'`
echo $TREE_NAME $TREE_FILE
bash run-2-movement.sh $TREE_NAME
done
TREE_NAME=tree-997416000
head -n 1 results/dtamulti/run/lin-ius-3e/movement/$TREE_NAME.clusterMovement_DTA_MCC_0.5.0.tsv | sed 's/$/\ttree/' > results/dtamulti/run/lin-ius-3e/all.clusterMovement_DTA_MCC_0.5.0.tsv
for TREE_FILE in results/dtamulti/run/lin-ius-3e/sample-tree/*.single.nexus ; do
TREE_NAME=`echo $(basename $TREE_FILE) | sed 's/\.single\.nexus//'`
tail -n+2 results/dtamulti/run/lin-ius-3e/movement/$TREE_NAME.clusterMovement_DTA_MCC_0.5.0.tsv | sed 's/$/\t'$TREE_NAME'/' >> results/dtamulti/run/lin-ius-3e/all.clusterMovement_DTA_MCC_0.5.0.tsv
done