Note: This repository and its contents support the coursework of the INM701 module at City, University of London.
The output of the code for
Library Versions: numpy == 1.24.3, sklearn == 1.2.2, tensorflow == 2.13.0
GPU for TensorFlow: NVIDIA P100 16GB VRAM (Kaggle GPU)
Protein structure prediction is an important area of bioinformatics and more specifically structural biology. By understanding the 3-D structure of a protein from its amino acid sequence, researchers and scientists can understand the biological function of proteins and how proteins can interact with one another. In recent years, powerful tools such as DeepMind's AlphaFold have been able to predict over 200 million protein structures to tackle the protein folding problem. We investigate the use of simple machine learning algorithms to predict the secondary structure of proteins, and how these compare in performance against each other. We also determine how oversampling can affect the predictive performance of our algorithms.
Specifically, we look at using (for classification) the:
- Non-parametric
$k$ -nearest neighbours - Non-parametric random forests
- Parametric neural networks
The dataset used is a subset of the RCSB Protein Data Bank, obtained jointly using various programs by @alfrandom and @kirkdco. The dataset can be found here. After some processing, we filtered the dataset by selecting proteins that had length in range [16, 100]. The filtered dataset containing the protein sequences for this study can be found here.
Random forests is a non-parametric machine learning algorithm and a type of ensemble method. The random forests algorithm works by constructing a multitude of decision trees during the training phase. We construct models based on random forests, and the related work can be found in the random-forests folder.
Artificial neural networks are parametric machine learning algorithms that try to mimic the way information is processed in the brain. The idea is to create models of biological neurons that have connections between nodes, with weights attached. These weights are to be optimised through the use of methods such as gradient descent with backpropagation. We construct models based on neural networks, and the related work can be found in the neural-networks folder.