Skip to content

BEstimate, a Python module that systematically analyses guide RNA (gRNA) targetable sites across given sequences for given Base Editors, and functional and clinical effects of the potential edits on the resulting proteins.

License

Notifications You must be signed in to change notification settings

CansuDincer/BEstimate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEstimate

BEstimate, a Python module that systematically identifies guide RNA (gRNA) targetable sites across given sequences for given Base Editors, functional and clinical effects of the potential edits on the resulting proteins and off target consequence of the found sequences. It has the ability to provide in silico analysis of the sequences to identify positions that can be editable by Base Editors, and their features before starting experiments.

Requirements

  • Python 3.8
  • pandas 1.1.3
  • argparse 1.4
  • biopython 1.78
  • requests 2.28.1

Or you can directly use BEstimate environment if you have conda. Please follow below:

  • git clone https://github.com/CansuDincer/BEstimate.git
  • cd BEstimate
  • conda-env create -n bestimate -f=bestimate.yml
  • conda activate bestimate

Program requirement for Off targets analysis

BEstimate is using mrsfast algorithm for genome alignment in off targets analysis. If you would like to find off targets, please follow below.

  • conda install -c bioconda mrsfast

Inputs

Sequence options:

-gene GENE NAME (Mandatory input): Hugo symbol of the gene of interest

-assembly GENOME ASSEMBLY (Mandatory input - hg19/GRCh38): The genome assembly for interested genomic coordinates

-transcript ENSEMBL TRANSCRIPT ID (Optional input): The interested transcript for filtering the result, otherwise it uses canonical transcript from Ensembl. BEstimate first tries to MANE selected transcript, if not the Ensembl canonical transcript is obtained.

-uniprot UNIPROT ID (Optional input): The interested uniprot id for annotating the result, otherwise it uses VEP derived Uniprot ID.

-mutation MUTATION (Optional input - default= None): In the case that there is a mutation on the interested gene that you need to integrate into sequence to design gRNAs according to that. The mutation style should be in <chromosome:g.genomic_location edited_nucleotide>new_nucleotide> e.g. (3:g.179218303G>A).

-mutation_file MUTATION FILE (Optional input - default= None): In the case that there are more than one mutations to be integrated into the sequence to design gRNAs according to them. The mutation style should be in <chromosome:g.genomic_location edited_nucleotide>new_nucleotide> e.g. (3:g.179218303G>A), and file should have one mutation in each row.

-flank FLANKING SEQUENCE (Optional input - True/False - default = False): The boolean input specifying whether the user wants to retrieve the flanking sequences 3' and 5' of gRNA sequences.

-flank3
3' FLANKING SEQUENCE LENGTH (Optional input - default= 7): If -flank input is provided, flank3 input specifies the number of nucleotides in the 3' flanking region.

-flank5
5' FLANKING SEQUENCE LENGTH (Optional input - default= 11): If -flank input is provided, flank5 input specifies the number of nucleotides in the 5' flanking region.

Base Editor options:

-pamseq PAM SEQUENCE (Mandatory input - default = NGG): The sequence preference of the Cas9 protein.

-pamwin PAM INDICES (Mandatory input - default = 21-23): The indices of the PAM sequence while counting 1 as the first nucleotide in the protospacer sequence.

-actwin ACTIVITY WINDOW INDICES (Mandatory input - default = 4-8): The indices of the activity window where the editable nucleotides will be searched on protospacer sequence.

-protolen PROTOSPACER SEQUENCE LENGTH (Mandatory input - default = 20): The length of the protospacer sequence.

-edit EDIT BASE (Mandatory input - default = C): The nucleotide which Base Editor can edit.

-edit_to EDITED BASE (Mandatory input - default = T): The nucleotide which Base Editor can alter the EDIT BASE into.

Annotation options:

-vep ENSEMBL VEP and PROTEIN ANALYSIS (Optional input - True/False - default = False): The boolean input for VEP and protein analysis. When it is True, the editable sites will be analysed by VEP API from Ensembl and Proteins API from Uniprot for their functional consequences on proteins. As well as the post translational modification and domain information, if the resulting edit is on the interface region of the corresponding protein is also given by using Interactome Insider.

-ot OFF TARGET (Optional input - True/False - default= False): The boolean input for identification of off targets.

-mm MISMATCH (Optional input - default= 4): In the case that -ot provided, Number of maximum mismatches allowed in off target analysis.

-genome GENOME (Optional input - default= Homo_sapiens_GRCh38_dna_sm_all_chromosomes): In the case that -ot provided, the name of the genome file in ./BEstimate/offtargets/genome/.

Output options:

-o OUTPUT PATH (Optional input - default = working directory): The interested output path where the files will be written.

-ofile OUTPUT INITIALS (Mandatory input): The initial name of the file before "_crispr_df.csv", "_edit_df.csv" or "_hgvs_df.csv", "_vep_df.csv", "_protein_df.csv", "_summary_df.csv".

Examples

python3 BEstimate.py -gene BRCA1 -assembly GRCh38 -pamseq NGG -pamwin 21-23 -actwin 4-8 -protolen 20 -edit C -edit_to T -o ./output/ -ofile BRCA1_CBE_NGG

The user also run the same analysis for different PAM only changing -pamseq NGN.

Warning: Be careful to write the PAM sequence to be in concordant with the length of the -pamwin. Here, NGN is in concordant with 21-23 (3 nucleotides). Otherwise, the user need to write NG -pamseq with 21-22 -pamwin.

If you would like to run for a specific transcript and run the protein analysis:

python3 BEstimate.py -gene BRAF -assembly GRCh38 -transcript ENST00000646891 -edit C -edit_to T -vep -o ./ -ofile BRAF_CBE_NGG

If you would like to run with a specific point mutation, with NGN PAM and with VEP and protein analysis:

python3 BEstimate.py -gene PIK3CA -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 4-8 -protolen 20 -mutation '3:g.179218303G>A' -edit A -edit_to G -vep -ofile ./PIK3CA_NGN_ABE_mE545K -o ./output/

If you would like to see the off targets of WRN gene: python3 BEstimate.py -gene BRAF -assembly GRCh38 -pamseq NGN -edit A -edit_to G -vep -ot -mm 3 -o ./output/ -ofile BRAF_ABE_NGN

Contact

BEstimate is the product of Cansu Dincer, Dr Matthew Coelho and Dr Mathew Garnett from Garnett Group at the Wellcome Sanger Institute.

For any problems or feedback on BEstimate, you can contact here.

License

BEstimate, a Python module that systematically identifies guide RNA (gRNA) on and off target sites across given sequences for given Base Editors, and functional and clinical effects of the potential edits on the resulting proteins.

Copyright (c) 2020-2023 Genome Research Ltd.

Author: Cansu Dincer cd7@sanger.ac.uk

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Further Disclaimer

For policies regarding the underlying data, please also refer to:

About

BEstimate, a Python module that systematically analyses guide RNA (gRNA) targetable sites across given sequences for given Base Editors, and functional and clinical effects of the potential edits on the resulting proteins.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages