-|datasets- |eccDNA #Insert the eccDNA sequence, .fa
| |
| |otherDNA #Put in other DNA sequences, .fa
|
-|cnvkit_do- |out #cnvkit output file
| |
| |fa # stored intermediate and result file
|
-|identify- |fatsa_to_identify #Short sequence identification method default folder
| |
| |long_segment_to_identify #Long sequence identification method default folder
|
-|other_model- |transformer2.py #transformer model
| |
| |resnet50.py
|
|ROC #ROC output and image drawing folder
|
-|run #Training intermediate parameter storage folder
|
-|save #Model storage folder
### File
-|preprocessing- |SamtoolBash.sh #The bash command provides the samtools interface
| |
| |count*.py #The xcel documentation is read and SamtoolBash.sh is called to cut the eccDNA sequence
| |
| |cout_other*.py #Read the xcel documentation and call SamtoolBash.sh to cut other DNA sequences
|
-|dataprocess*.py #The DNA sequence was read and the sequence was converted to a matrix
|
-|dataloader*.py #The transformed matrices were read, datasets and dataloader were constructed by pytorch, and 20% were used as the test set
|
-|ResNet_Attention.py #Residual convolution model with attention mechanism
|
-|ResAttention.py #Network models of interspersed attention mechanisms
|
-|train*.py #Train and test
|
-|run*.py #User invocation interface
|
-|cnvkit_run.py #Methods for identification of eccDNA based on copy number variation
|
-|verification.py #Verification of accuracy
|
ps: * Refers to having multiple files or version numbers for different models
WGS data preprocessing (count*.py, SamtoolBash.sh)
Model training (dataprocess.py, dataloader*.py, train_attention.py)
Model validation (verification.py)
Model invocation API (run.py)
Extraction of MicroDNA based on CNVs (cnvkit_run.py)
CUDA version > 11.7, or use another version of PyTorch.If python versions are not compatible, you may need to delete torch references from requirements.txt.
conda/source activate/creat YOUR_ENV_NAME
pip install -r requirements.txt
#Rapid identification of all .fa sequence files in the ./identify/long_segment_to_identify folder
conda activate pytorch
cd YOUR_DIR_PATH
python run.py --pattern long_segment
--pattern (required), specifies the running mode, short sequence or long segment, InputTypes: short_sequence, long_segment
--model (optional), specifies the model name; model files are located in the './save/' folder
--file_path (optional), specifies the directory containing files; if not specified, default folder's files will be used
--manual_input (optional), manually input sequences
--limit (optional), sensitivity threshold, default is 0.9
#The run.py file requires model files to be provided in .\save\
#Trained model parameters are stored in the save folder and can also be specified via command line
#Default location for fa files is ./identify with two folders; this can be customized
#./identify/long_segment_to_identify - Place fa files here to automatically read and identify MicroDNA; results are saved in this folder
#./identify/fasta_to_identify - Folder for short sequence recognition mode result files
Based on NCBI experiments Series GSE68644, Series GSE124470
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68644
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124470
Download GSE68644_RAW files and extract eccDNA sequence positions
Use count*.py to obtain eccDNA sequences
Use cout_other*.py to obtain otherDNA sequences
#Standard sequences: GSE68644 requires hg19 reference sequence, GSE124470 requires hg38 reference sequence, place them in the runtime path
Modification details can be found within the corresponding py files
conda activate pytorch
cd dir
python train*.py
#train.py needs to invoke dataloader*.py, dataprocess*.py, and model files
#Training data is placed in the datasets directory
tensorboard --logdir=// --port 8130
Developed based on train.py; place test data in the datasets directory
Run the python file directly
python verification.py
Post-training ROC data is located in the ROC folder
Run the python file directly
python ROC_draw.py
WGS sequencing data files (can be SRA files .sra or fastq files .fq), hg19 standard sequence
hg19.fa (from NCBI) and alignment reference files provided by CNVkit
hg19_cnvkit_filtered_ref.cnn (from CNVkit) should be placed in the MicroDNA_Hook\cnvkit\cnvkit_do folder
tip: CNVkit may have issues when invoked in Windows 11 systems or Windows Subsystem for Linux. Please refer to CNVkit usage instructions. After calculation, place the resulting result.call.cns file in MicroDNA_Hook\cnvkit_do\out and continue to run cnvkit_run.py.
If you start with fastq data, you need to install bowtie2. If you start with sra data, you need to install sra tools.
python cnvkit_run.py
--model (optional), specifies the model name; model files are located in './save/', default is module.pth.
--run (optional), specifies the version to be called, default is run_v11.2.py.
--limit (optional), sensitivity threshold, default is 0.95
python merge_file.py
#File Description: glob traverses all .txt files in cnvkit_do\fa, writes into cnvkit_do\connect folder
- ResNet_Attention.py # Residual convolutional model with attention mechanism
Used in the undergraduate thesis "eccDNA Identification based on Deep Learning"; works with module.pth. - ResAttention.py # Network model with interleaved attention mechanism
Performs better and should be used with 6.pth; modify import code in run.py accordingly. - Transformer models have been tested but are not suitable for this task due to either insufficient data volume or low information density per sample, as detailed in the paper.