Skip to content

A tool for extracting MicroDNA from WGS sequencing data.

Notifications You must be signed in to change notification settings

liszt-c/MicroDNA_Hook

Repository files navigation

MicroDNA Hook

Folder structure

-|datasets- |eccDNA	    #Insert the eccDNA sequence, .fa
 |		   |
 | 		   |otherDNA	#Put in other DNA sequences, .fa
 |
-|cnvkit_do- |out	#cnvkit output file
 |		    |
 | 		    |fa	# stored intermediate and result file 
 |
-|identify- |fatsa_to_identify	#Short sequence identification method default folder
 |		   |
 | 		   |long_segment_to_identify #Long sequence identification method default folder
 |
-|other_model- |transformer2.py #transformer model
 |		          |
 | 		          |resnet50.py
 |
 |ROC			#ROC output and image drawing folder
 |
-|run			#Training intermediate parameter storage folder
 |
-|save		    #Model storage folder

### File
-|preprocessing-	|SamtoolBash.sh	#The bash command provides the samtools interface
 |			       |
 |			       |count*.py		#The xcel documentation is read and SamtoolBash.sh is called to cut the eccDNA sequence
 |			       |
 |			       |cout_other*.py	#Read the xcel documentation and call SamtoolBash.sh to cut other DNA sequences
 |
-|dataprocess*.py	#The DNA sequence was read and the sequence was converted to a matrix
 |
-|dataloader*.py	#The transformed matrices were read, datasets and dataloader were constructed by pytorch, and 20% were used as the test set
 |
-|ResNet_Attention.py	#Residual convolution model with attention mechanism
 |
-|ResAttention.py	#Network models of interspersed attention mechanisms
 |
-|train*.py		#Train and test
 |
-|run*.py		#User invocation interface
 |
-|cnvkit_run.py	#Methods for identification of eccDNA based on copy number variation
 |
-|verification.py	#Verification of accuracy
 |
ps: * Refers to having multiple files or version numbers for different models

Environment Configuration

Function Overview

WGS data preprocessing (count*.py, SamtoolBash.sh)
Model training (dataprocess.py, dataloader*.py, train_attention.py)
Model validation (verification.py)
Model invocation API (run.py)
Extraction of MicroDNA based on CNVs (cnvkit_run.py)

Configuration Commands

CUDA version > 11.7, or use another version of PyTorch.If python versions are not compatible, you may need to delete torch references from requirements.txt.

conda/source activate/creat YOUR_ENV_NAME
pip install -r requirements.txt

Model Usage

Identifying MicroDNA from Custom Long Segments

run.py command examples

#Rapid identification of all .fa sequence files in the ./identify/long_segment_to_identify folder

conda activate pytorch
cd YOUR_DIR_PATH
python run.py --pattern long_segment 

python run.py input parameters

--pattern (required), specifies the running mode, short sequence or long segment, InputTypes: short_sequence, long_segment
--model (optional), specifies the model name; model files are located in the './save/' folder
--file_path (optional), specifies the directory containing files; if not specified, default folder's files will be used
--manual_input (optional), manually input sequences
--limit (optional), sensitivity threshold, default is 0.9

#The run.py file requires model files to be provided in .\save\
#Trained model parameters are stored in the save folder and can also be specified via command line
#Default location for fa files is ./identify with two folders; this can be customized
#./identify/long_segment_to_identify - Place fa files here to automatically read and identify MicroDNA; results are saved in this folder
#./identify/fasta_to_identify - Folder for short sequence recognition mode result files

Model Training

Data Preprocessing

Based on NCBI experiments Series GSE68644, Series GSE124470
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68644
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124470
Download GSE68644_RAW files and extract eccDNA sequence positions
Use count*.py to obtain eccDNA sequences
Use cout_other*.py to obtain otherDNA sequences

#Standard sequences: GSE68644 requires hg19 reference sequence, GSE124470 requires hg38 reference sequence, place them in the runtime path
Modification details can be found within the corresponding py files

Training Commands

conda activate pytorch
cd dir
python train*.py

#train.py needs to invoke dataloader*.py, dataprocess*.py, and model files
#Training data is placed in the datasets directory

Analyzing training results using TensorBoard in the ./run folder

tensorboard --logdir=// --port 8130

Verifying results with verification.py

Developed based on train.py; place test data in the datasets directory
Run the python file directly

python verification.py

Drawing ROC curves using ROC_draw.py

Post-training ROC data is located in the ROC folder
Run the python file directly

python ROC_draw.py

MicroDNA Identification Based on CNVs Regions

Required Data

WGS sequencing data files (can be SRA files .sra or fastq files .fq), hg19 standard sequence
hg19.fa (from NCBI) and alignment reference files provided by CNVkit
hg19_cnvkit_filtered_ref.cnn (from CNVkit) should be placed in the MicroDNA_Hook\cnvkit\cnvkit_do folder

tip: CNVkit may have issues when invoked in Windows 11 systems or Windows Subsystem for Linux. Please refer to CNVkit usage instructions. After calculation, place the resulting result.call.cns file in MicroDNA_Hook\cnvkit_do\out and continue to run cnvkit_run.py.
If you start with fastq data, you need to install bowtie2. If you start with sra data, you need to install sra tools.

Command to Identify MicroDNA in CNVs Regions

python cnvkit_run.py

--model (optional), specifies the model name; model files are located in './save/', default is module.pth.
--run (optional), specifies the version to be called, default is run_v11.2.py.
--limit (optional), sensitivity threshold, default is 0.95

Merging Results for Further Analysis

python merge_file.py

#File Description: glob traverses all .txt files in cnvkit_do\fa, writes into cnvkit_do\connect folder

Model Description

  1. ResNet_Attention.py # Residual convolutional model with attention mechanism
    Used in the undergraduate thesis "eccDNA Identification based on Deep Learning"; works with module.pth.
  2. ResAttention.py # Network model with interleaved attention mechanism
    Performs better and should be used with 6.pth; modify import code in run.py accordingly.
  3. Transformer models have been tested but are not suitable for this task due to either insufficient data volume or low information density per sample, as detailed in the paper.

About

A tool for extracting MicroDNA from WGS sequencing data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published