Convolutional Neural Networks for Genome Accessibility Classification in T-cells and Lymphoblasts.

This is to determine the locations of accessible regions in a bio sample's chromatin.
The cell types used in testing were T-cell and lymphoblast.
ATAC-seq method was used for accessibility profiling of the assays.
Provided a user has an ATAC-seq T-cell or lymphocyte sequence, they can feed the sequence to the trained model and the model will attempt to classify the passed in sequence as being from an accessible or inaccessible region of chromatin.
email me for the reference genome files if needed: zach7307@gmail.com

Testing was done in conjuction with a peer at SIUe under the supervision of Dr. Manas Jyoti Das.

The use of CNNs in conjuction with ATAC-seq data proves to be really interesting, since ATAC-seq requires less input in comparison to other sequencing techniques. This, to me, shows that it is a powerful tool when it comes to retrieving as much data as possible from rare cell types where the data output would typically be lacking in quantity. As we know, CNNs typically require a large set of data to be useful.

Data preprocessing and prep

Bed files are collected from a collection of biosamples.
The bed files are parsed for the ranges of the features where the chromatin is accessible.
The mean length of these regions is collected for each bed file.
These bed files are fed into bedtools getfasta to return the sequences from the accessible regions.
This is done by the help of utilizing reference genome hg38.
These sequences from the accessible region provided by the bed file are then chopped by the mean length of the regions (i.e. the mean sequence length)
The sequences are then stored in files relative to their original bed file.
The accessible regions are then used to determine the inaccessible regions of the chromatin.
These inaccessible regions are then fed into bedtools getfasta to retrieve the sequences of the inaccessible regions.
All the sequences from the inaccessible region are chopped to the mean length of the accessible regions, and the mean length is unique per initial bed file.
The data is then chopped to the mean of means of all the mean lengths of the accessible regions of every file.
This data is then labeled (0 for negative i.e. inaccessible, 1 for positive i.e. accessible)
The data is then shuffled for randomization in preparation of entering the CNN, and stored in one large file.
The data is one-hot-encoded for the purpose of testing in the convolutional neural network.

The data from both cell types were passed into NiN (Network in Network) and AlexNet networks (written using PyTorch).
The trained NiN models performed with an accuracy of 91-94%
The AlexNet models performed with 96-97% accuracy.
I made slight modifications to the NiN network for lymphoblast data by adding the sigmoid activation prior to the output layer, and ultimately found an increase in accuracy from 91% to 94% and a precision of 96%. This was a pretty significant change.

NOTE: You must be on a Linux machine for this project. NECESSARY:

Run source setup_env.sh (use Python 3.9 when prompted and ensure the environment is activated after running the script)
Run ./installs.sh
Run ./get_beds.sh and enter the path of the text file containing links to bed files.
Run ./prep.sh
Run genome_conv.ipynb as a jupyter notebook and run the cells. Make sure you have the environment activated and the kernel set to Python 3.9.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
CalebZachPresentation.pdf		CalebZachPresentation.pdf
README.md		README.md