GitHub - hagax8/arabidopsis_viz: Clustering of 1,135 Arabidopsis thaliana genomes from 1001 Genomes Project (https://1001genomes.org/)

Clustering of Arabidopsis genomes using t-SNE and GTM (Generative Topographic Mapping)

Requirements

The following python packages are required:

ugtm
sklearn
altair
pycountry
matplotlib
h5py
numpy
pandas

Files in directory

country_convert.py: Python script, converts country codes to country names
worldmap_arabidopsis_admixed.py: Python script, creates interactive GTM and t-SNE maps with Admixed groups
worldmap_arabidopsis_admixed.py: Python script, creates interactive GTM and t-SNE maps with countries
h5_arabidopsis.py: Python script, selects filtered genotypes from h5py imputed SNP matrix
runGTM.py: runs GTM (using ugtm package) or t-SNE (using sklearn)
output_viz: directory, contains html visualizations of worldmap_arabidopsis_admixed.py and worldmap_arabidopsis_admixed.py
data: directory, contains csv data files
data/1001G_gen.csv: csv file, selected pruned genotypes from 1001G Project data (MAF 0.05)
data/1001G_snpids.csv: csv file, selected pruned SNPs from 1001G Project data (MAF 0.05)
data/dataframe_1001G.csv: csv file, 1001G project dataframe with corresponding t-SNE and GTM coordinates

Create plink files from 1001 Genomes Project

Download 1001genomes_snp-short-indel_only_ACGTN.vcf.gz, the 1135 Arabidopsis thaliana genomes from the 1001 Genomes Project.
Prune SNPs + apply genotyping rate and MAF filters:

plink --snps-only --geno 0.2 \
--vcf 1001genomes_snp-short-indel_only_ACGTN.vcf.gz \
--out 1001G --make-bed

plink --bfile 1001G --indep-pairwise 100 10 0.1 \
--maf 0.05 --out 1001G_MAF0.05 \
--set-missing-var-ids @:# --make-bed

Create csv file for pruned SNPs:

Download the imputed SNP matrix (h5py file) from the 1001 Genomes Project 1001_SNP_MATRIX.tar.gz
Python script to generate csv from pruned SNPs and the SNP matrix:

python h5_arabidopsis.py

Output genotype csv file = 1001G_gen.csv
Output SNP ids = 1001G_snpids.csv
Final number of pruned SNPs in 1001G_gen.csv = 28775

Generate GTM, t-SNE and PCA coordinates for the 1135 genomes:

Our runGTM.py script requires python packages ugtm (GTM implementation) and sklearn (t-SNE implementation).

GTM (with "estimated priors" option for imbalanced classes), output = out_gtm_matmeans.csv:

python runGTM.py --model GTM --data 1001G_gen.csv \
--labels countrynames.csv --labeltype discrete \
--out out_tsne \
--pca --n_components 20 --grid_size 25 \
--rbf_grid_size 5 \
--random_state 8 \
--verbose --prior estimated

t-SNE, output = out_tsne.csv:

python runGTM.py --model t-SNE \
--data 1001G_gen.csv --labels continents.csv --labeltype discrete \
--out out --pca --n_components 20 \
--random_state 8 --verbose

PCA, output = out_pca.csv:

python runGTM.py --model PCA --data 1001G_gen.csv \
--labels country_groups --labeltype discrete \
--out out --pca  --random_state 8

Script to convert country codes to country names and continents:

python country_convert.py the_codes_to_convert

Final pandas dataframe

File name: dataframe_1001G.csv
Variables:

accession
name
CS number
country code
latitude
longitude
collector
seq by
continent
country
admixed group
site
GTM axis 1
GTM axis 2
t-SNE axis 1
t-SNE axis 2
Principal component 1
Principal component 2

Generate interactive visualizations with world map:

t-SNE and GTM maps coloured by admixed group (result in output_viz folder):

python worldmap_arabidopsis_admixed.py data/dataframe_1001G.csv outputname

t-SNE and GTM maps coloured by countries (result in output_viz folder):

python worldmap_arabidopsis_countries.py data/dataframe_1001G.csv outputname

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering of Arabidopsis genomes using t-SNE and GTM (Generative Topographic Mapping)

Requirements

Files in directory

Create plink files from 1001 Genomes Project

Create csv file for pruned SNPs:

Generate GTM, t-SNE and PCA coordinates for the 1135 genomes:

Script to convert country codes to country names and continents:

Final pandas dataframe

Generate interactive visualizations with world map:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
output_viz		output_viz
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
country_convert.py		country_convert.py
h5_arabidopsis.py		h5_arabidopsis.py
runGTM.py		runGTM.py
worldmap_arabidopsis_admixed.py		worldmap_arabidopsis_admixed.py
worldmap_arabidopsis_countries.py		worldmap_arabidopsis_countries.py

License

hagax8/arabidopsis_viz

Folders and files

Latest commit

History

Repository files navigation

Clustering of Arabidopsis genomes using t-SNE and GTM (Generative Topographic Mapping)

Requirements

Files in directory

Create plink files from 1001 Genomes Project

Create csv file for pruned SNPs:

Generate GTM, t-SNE and PCA coordinates for the 1135 genomes:

Script to convert country codes to country names and continents:

Final pandas dataframe

Generate interactive visualizations with world map:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages