Skip to content

Masters Thesis: Transfer Learning for Phenotype Prediction from Small Gene Expression Data Sets

License

Notifications You must be signed in to change notification settings

dmohorcic/masters_thesis_TLPPSGEDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Masters Thesis: Transfer Learning for Phenotype Prediction from Small Gene Expression Data Sets

This repository contains the code for my masters thesis (published at Repository of the University of Ljubljana), along with all trained models, used data sets, and figures with results.

Abstract

Recent advances in biotechnology have enabled researchers to collect huge amounts of data, such as gene expression profiles from patients, which provide a foundation for personalized medicine. Such an approach requires the use of machine learning, however, a significant limitation of many medical studies is the small sample size, typically having only a few hundred patients with tens of thousands of features. In this thesis, we addressed this issue by combining multiple small gene expression data sets into a larger one, regardless of the study type, and training deep learning models capable of producing informative gene expression encodings. We used transfer learning to predict the phenotypes on unseen data sets based on the created encodings. We experimented with two model architectures: autoencoders and multitask models. Although training multi-task models proved challenging, they achieved higher average results on test data sets than autoencoders but never surpassed the results of logistic regression. An examination of the encodings revealed that autoencoders maintained the original data structure whereas the multi-task models mixed samples from different studies, but both proved that the gene expression profile can be reduced to a few informative markers.

About this repository

The folder data contains all the Gene Expression Omnibus (GEO) data sets that we used. The folder figures contains all figures, further split into data, methods, and results. The folder models contains all trained models from autoencoder and multi-task architecture. The first filders indicate the latent layer size, while the folders in those represent the 10 different random trained models. The folder src contains the code for data download and parsing, model implementation, training, testing, and drawing final figures. The folder thesis contains the masters thesis along with presentation and two dissertations.

About

Masters Thesis: Transfer Learning for Phenotype Prediction from Small Gene Expression Data Sets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published