This repository is made for eBook entitled “Bioinformatics Recipes for Plant Genomics: Data, Code, and Workflows” in Bio-101
This is an example workflow to analyze associations of environmental variables with SNP data using BayPass.
-
Running environment:
- The workflow was constructed based on Linux CentOS 7.7.1908 with the compiler gfortran (GCC) 4.8.5.
-
Required software and versions:
To download and compile BayPass
cd program/
wget http://www1.montpellier.inra.fr/CBGP/software/baypass/files/baypass_2.3.tar.gz
tar -zxvf baypass_2.3.tar.gz
cd baypass_2.3/sources/
make clean all FC=gfortran
# make clean all FC=ifort # if you use the Intel ifort Fortran compiler
g_baypass -help # check
# i_baypass -help # if you use the Intel ifort Fortran compiler
cd ..
The R packages can be installed simply using install.packages()
in the R or Rstudio environment:
install.packages("tidyverse")
install.packages("corrplot")
install.packages("mvtnorm")
install.packages("qqman")
-
VCF The first input file
input/GSD_RAD.vcf.gz
is a Variant Call Format (VCF) file that contains the genotypes at all markers for all samples, which will be used to generate the genotyping data file for BayPass. VCF is commonly used to represent variation calls. It contains meta-information lines (beginning with##
string), a header line (beginning with#
), and then data lines. Each data line contains information about a locus in the reference genome, including the position and genotype information on samples for each position. Specification of the VCF format can be found here or here. You can get this file for your samples from most variant calling pipelines with high-throughput sequencing technology. Here The file has been compressed to save space and therefore has the.gz
suffix. You can view the file byzcat input/GSD_RAD.vcf.gz | less -S
. -
Population information The file
input/pop_info.tsv
contains the popultion assignment for each sample in the VCF, which will be used in generating the genotyping data file. This is a tab-delimited file consisting of two columns: the first column contains the sample names, and the second one contains the population names. The first ten rows of the file:
1003 1003
1009 1003
1010 1003
1011 1003
1013 1003
1034 1033
1035 1033
1039 1033
1042 1033
1043 1033
Samples 1003, 1009, 1010, 1011 and 1013 are from population 1003, and 1034 , 1035, 1039, 1042 and 1043 are from population 1033. Each population has 5 individuals.
- Environmental dataset
The environmental dataset
input/GSD_env.tsv
is a tab-delimited file that records environmental data for each population. The first column gives population names that are consistent with those ininput/pop_info.tsv
. Each of the following columns contains values of an environmental variable. The first five rows of the example data:
Population coverage NO3-N
970 0.635670404 90
1003 0.39769942 6.855737705
1033 0.22551341 8.267213115
1063 0.6198923 43.35245902
zcat input/GSD_RAD.vcf.gz | perl script/vcf2baypass.pl input/pop_info.tsv cache/GSD_RAD.baypass
This step will generate 3 files in cache/
with the GSD_RAD.baypass
prefix. The main file cache/GSD_RAD.baypass.txt
is the one required by BayPass.
Two other files cache/GSD_RAD.baypass.pop
and cache/GSD_RAD.baypass.snp
record the order of populations and locations of SNPs, respectively, and will be used in the following steps.
Check if the numbers of populations and SNPs are correct:
wc -l cache/GSD_RAD.baypass.pop # number of populations, which should be 20 in this case
wc -l cache/GSD_RAD.baypass.txt # number of SNPs, which should be 8383
wc -l cache/GSD_RAD.baypass.txt # number of SNPs as well
cat input/GSD_env.tsv | perl script/env2baypass.pl cache/GSD_RAD.baypass.pop cache/GSD_env.baypass
The output covariate data file cache/GSD_env.baypass.txt
contains the values of the environmental variables for each population in the format required by BayPass.
The associated file cache/GSD_env.baypass.cov
records the names of the covariates, which will be used for plotting.
npop=$(wc -l cache/GSD_RAD.baypass.pop | cut -d " " -f1)
program/baypass_2.3/sources/g_baypass -npop $npop -gfile cache/GSD_RAD.baypass.txt -outprefix cache/GSD_RAD_core -nthreads 40
The parameter -npop
specifies the number of population. In this case, the number is counted automatically from the output in previous step.
The parameter -nthreads
gives the number of threads to be used for parallel computations. The example data take less than 10min with 40 threads.
The -outprefix
gives the prefix for output files. Multiple files will be generated by this command, but the *_mat_omega.out
is what we need for the next step.
The covariance matrix can be visualized using corrplot.R
:
Rscript script/corrplot.R
This will generate output/GSD_RAD_core_mat_omega.png
.
program/baypass_2.3/sources/g_baypass -npop 20 -gfile cache/GSD_RAD.baypass.txt -efile cache/GSD_env.baypass.txt -scalecov -omegafile cache/GSD_RAD_core_mat_omega.out -outprefix cache/GSD_RAD_standard -nthreads 40
The -scalecov
option asks the program to scale each covariable in the covariate data file.
The scaled values of the covariates can be found in cache/GSD_RAD_standard_covariate.std
.
This command takes less than 10min with 40 threads.
Rscript script/POD.R
This will generate the file cache/GSD_RAD_POD.baypass.txt
that has the same foramt as the genotyping data file.
You can check the number of SNPs wc -l cache/GSD_RAD_POD.baypass.txt
, which should be 1000 in the simulated POD sample.
Then run BayPass as in Step 4 using the 1,000 simulated SNPs:
program/baypass_2.3/sources/g_baypass -npop 20 -gfile cache/GSD_RAD_POD.baypass.txt -efile cache/GSD_env.baypass.txt -scalecov -omegafile cache/GSD_RAD_core_mat_omega.out -outprefix cache/GSD_RAD_POD_standard -nthreads 40
Rscript script/BayPass_plot.R
It is a free and open source software, licensed under GPLv3.