This document summarizes the two main components of the bioinformatics analysis that was used to generate and parsed data for the paper Deep diversification of an AAV capsid protein by machine learning. For machine learning models see this.
A processed version of the data is available in the data folder (this should look similar to what the processing pipeline outputs). For additional annotation (e.g. model scores), and training data browse through these datasets. For raw sequencing data see NCBI. Additional meta-data and artifacts to reproduce the results can be found in this Dropbox link (too big to host on github, NCBI did not support these directory structure).
- Synthesis pipeline
- Step 1: Assembles the nucleotide sequence for the corresponding protein sequence variants such that it can be generated and processe with the desired cloning strategy.
- Step 2: Tests the dataframe produced by Step 1 to ensure that the library has the intended composition, while additionally testing that the correct RE sites are in each sequence. This step produces the files that are sent to Agilent for synthesis.
- Step 3: Simulates the cloning process in silico, to ensure that the library can be successfully produced with the set of primers, plasmid backbone, and other molecular parameters.
- Parsing pipeline
- Step 1: Merge fastq files using PEAR.
- Step 2: Count the number of variants across sequencing files.
- Step 3: Compute selection scores based on the raw count files.
Details below.
Takes the AA sequences designed by ML and produces nucleotide sequences to be printed for synthesis such that it is compatible with our cloning strategy.
Pandas
Numpy
BioPython
PyDNA
editdistance
Assembles the nucleotide sequence for the corresponding protein sequence variants such that it can be generated and processe with the desired cloning strategy.
Note: We used barcodes in our original design but actually never used them as identifiers for variants.
Barcode designs:
-
barcodes16-1.txt
from John A. Hawkins et al. PNAS 2018 https://www.pnas.org/content/115/27/E6217 (not used for analysis) or if barcodes already chosen: -
c1barcodes16-1_app_BsrBI.txt
these are a selected group of barcodes compatible with our cloning strategy.
Designed Variants:
-
chip1_GAS_nredundant.csv
the ML designed variants -
backfill_random_doubles.csv
random doubles to backfill the chip if there is room -
singles.csv
set of all single mutations to the WT
Primer files:
-
skpp15-forward.fasta
forward primers -
skpp15-reverse.fasta
reverse primers
-
chip_df.csv
contains the library sequences -
[Optional]
c1barcodes16-1_app_BsrBI.txt
as selected barcodes
Tests the dataframe produced by Step 1 to ensure that the library has the intended composition, while additionally testing that the correct RE sites are in each sequence. This step produces the files that are sent to Agilent for synthesis.
chip_df.csv
contains the library sequences
chip_for_agilent.txt
this is what is sent to Agilent
Simulates the cloning process in silico, to ensure that the library can be successfully produced with the set of primers, plasmid backbone, and other molecular parameters.
Primer files
-
skpp15-forward.fasta
forward primer -
skpp15-reverse.fasta
reverse primer -
chip_df.csv
contains the library sequences
Takes the fastq nucleotide sequences from experimental sequencing runs and maps them back to original AA sequences and computes selection scores (We performed two sequencing runs, hence step 1 and 2 should be run on both sets before combining them on step 3)
PEAR
Pandas
Biopython
Merge fastq files using PEAR.
fastq files in experimental run folder
contains all the fastq filesmanifest file for samples
contains the mapping between file names and the relevant samples
merged files in Parsed_data/merged
merged fastq files
Count the number of variants across sequencing files.
merged files in Parsed_data/merged
merged fastq filesdesigned_variants.csv
set of designed AAs and corresponding coding nucleotides
files in Parsed_data/library
merged fastq filesraw_counts_raw_counts_NextSeq_run<run_num>.csv
raw counts
Compute selection scores based on the count files.
raw_counts_raw_counts_NextSeq_run1.csv
raw counts from run1 sequencingraw_counts_raw_counts_NextSeq_run2.csv
raw counts from run2 sequencing (3x)chip_df.csv
[this is the output of the synthesis pipeline] set of designed AAs and corresponding coding nucleotides
library_w_selection_scores.csv
computed selection scores for the libraries together.