BEstimate, a Python module that systematically identifies guide RNA (gRNA) targetable sites across given sequences for given Base Editors, functional and clinical effects of the potential edits on the resulting proteins and off target consequence of the found sequences. It has the ability to provide in silico analysis of the sequences to identify positions that can be editable by Base Editors, and their features before starting experiments.
- Python 3.8
- pandas 1.1.3
- argparse 1.4
- biopython 1.78
- requests 2.28.1
Or you can directly use BEstimate environment if you have conda. Please follow below:
git clone https://github.com/CansuDincer/BEstimate.git
cd BEstimate
conda-env create -n bestimate -f=bestimate.yml
conda activate bestimate
BEstimate is using mrsfast algorithm for genome alignment in off targets analysis. If you would like to find off targets, please follow below.
conda install -c bioconda mrsfast
-gene GENE NAME (Mandatory input): Hugo symbol of the gene of interest
-assembly GENOME ASSEMBLY (Mandatory input - hg19/GRCh38): The genome assembly for interested genomic coordinates
-transcript ENSEMBL TRANSCRIPT ID (Optional input): The interested transcript for filtering the result, otherwise it uses canonical transcript from Ensembl. BEstimate first tries to MANE selected transcript, if not the Ensembl canonical transcript is obtained.
-uniprot UNIPROT ID (Optional input): The interested uniprot id for annotating the result, otherwise it uses VEP derived Uniprot ID.
-mutation MUTATION (Optional input - default= None): In the case that there is a mutation on the interested gene that you need to integrate into sequence to design gRNAs according to that. The mutation style should be in <chromosome:g.genomic_location edited_nucleotide>new_nucleotide> e.g. (3:g.179218303G>A).
-mutation_file MUTATION FILE (Optional input - default= None): In the case that there are more than one mutations to be integrated into the sequence to design gRNAs according to them. The mutation style should be in <chromosome:g.genomic_location edited_nucleotide>new_nucleotide> e.g. (3:g.179218303G>A), and file should have one mutation in each row.
-flank FLANKING SEQUENCE (Optional input - True/False - default = False): The boolean input specifying whether the user wants to retrieve the flanking sequences 3' and 5' of gRNA sequences.
-flank3
3' FLANKING SEQUENCE LENGTH (Optional input - default= 7): If -flank input is provided, flank3 input specifies the number of nucleotides in the 3' flanking region.
-flank5
5' FLANKING SEQUENCE LENGTH (Optional input - default= 11): If -flank input is provided, flank5 input specifies the number of nucleotides in the 5' flanking region.
-pamseq PAM SEQUENCE (Mandatory input - default = NGG): The sequence preference of the Cas9 protein.
-pamwin PAM INDICES (Mandatory input - default = 21-23): The indices of the PAM sequence while counting 1 as the first nucleotide in the protospacer sequence.
-actwin ACTIVITY WINDOW INDICES (Mandatory input - default = 4-8): The indices of the activity window where the editable nucleotides will be searched on protospacer sequence.
-protolen PROTOSPACER SEQUENCE LENGTH (Mandatory input - default = 20): The length of the protospacer sequence.
-edit EDIT BASE (Mandatory input - default = C): The nucleotide which Base Editor can edit.
-edit_to EDITED BASE (Mandatory input - default = T): The nucleotide which Base Editor can alter the EDIT BASE into.
-vep ENSEMBL VEP and PROTEIN ANALYSIS (Optional input - True/False - default = False): The boolean input for VEP and protein analysis. When it is True, the editable sites will be analysed by VEP API from Ensembl and Proteins API from Uniprot for their functional consequences on proteins. As well as the post translational modification and domain information, if the resulting edit is on the interface region of the corresponding protein is also given by using Interactome Insider.
-ot OFF TARGET (Optional input - True/False - default= False): The boolean input for identification of off targets.
-mm MISMATCH (Optional input - default= 4): In the case that -ot provided, Number of maximum mismatches allowed in off target analysis.
-genome GENOME (Optional input - default= Homo_sapiens_GRCh38_dna_sm_all_chromosomes): In the case that -ot provided, the name of the genome file in ./BEstimate/offtargets/genome/.
-o OUTPUT PATH (Optional input - default = working directory): The interested output path where the files will be written.
-ofile OUTPUT INITIALS (Mandatory input): The initial name of the file before "_crispr_df.csv", "_edit_df.csv" or "_hgvs_df.csv", "_vep_df.csv", "_protein_df.csv", "_summary_df.csv".
python3 BEstimate.py -gene BRCA1 -assembly GRCh38 -pamseq NGG -pamwin 21-23 -actwin 4-8 -protolen 20 -edit C -edit_to T -o ./output/ -ofile BRCA1_CBE_NGG
The user also run the same analysis for different PAM only changing -pamseq NGN.
Warning: Be careful to write the PAM sequence to be in concordant with the length of the -pamwin. Here, NGN is in concordant with 21-23 (3 nucleotides). Otherwise, the user need to write NG -pamseq with 21-22 -pamwin.
If you would like to run for a specific transcript and run the protein analysis:
python3 BEstimate.py -gene BRAF -assembly GRCh38 -transcript ENST00000646891 -edit C -edit_to T -vep -o ./ -ofile BRAF_CBE_NGG
If you would like to run with a specific point mutation, with NGN PAM and with VEP and protein analysis:
python3 BEstimate.py -gene PIK3CA -assembly GRCh38 -pamseq NGN -pamwin 21-23 -actwin 4-8 -protolen 20 -mutation '3:g.179218303G>A' -edit A -edit_to G -vep -ofile ./PIK3CA_NGN_ABE_mE545K -o ./output/
If you would like to see the off targets of WRN gene:
python3 BEstimate.py -gene BRAF -assembly GRCh38 -pamseq NGN -edit A -edit_to G -vep -ot -mm 3 -o ./output/ -ofile BRAF_ABE_NGN
BEstimate is the product of Cansu Dincer, Dr Matthew Coelho and Dr Mathew Garnett from Garnett Group at the Wellcome Sanger Institute.
For any problems or feedback on BEstimate, you can contact here.
BEstimate, a Python module that systematically identifies guide RNA (gRNA) on and off target sites across given sequences for given Base Editors, and functional and clinical effects of the potential edits on the resulting proteins.
Copyright (c) 2020-2023 Genome Research Ltd.
Author: Cansu Dincer cd7@sanger.ac.uk
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For policies regarding the underlying data, please also refer to: