Skip to content
Robert J. Gifford edited this page Jun 23, 2024 · 7 revisions

Screening genomes in silico

Sequence similarity search tools, such as the Basic Local Alignment Search Tool (BLAST), are essential for biological sequence analysis. These tools detect regions of local similarity between molecular sequences and are invaluable for various purposes. They can be used to characterize a locus in detail, helping to identify the coordinates of specific sequence features at the protein or nucleic acid level (e.g., conserved protein motifs, oligonucleotide primer sites). Additionally, they serve as a 'search engine' for retrieving similar sequences from databases, which may indicate evolutionary relationships. This function is particularly useful for comparative and evolutionary studies, especially given the rapid accumulation of sequence data.

The basic functions of BLAST can be expanded into comprehensive investigative strategies for comparative analysis of genes and genomes. This might involve using different combinations of probe sequences and target databases or integrating BLAST searches with other sequence analysis methods (e.g., phylogenetic or statistical analysis).

BLAST-based approaches are particularly useful for investigating genomic features that are poorly annotated in public databases, such as small RNAs, pseudogenes, transposable elements, highly duplicated gene families, and endogenous viral elements (EVEs). More broadly, BLAST searches can underpin heuristic in silico investigations, where the overall strategy is loosely defined and requires multiple iterations of trial and error, using new information from each iteration to refine the approach.

While systematic BLAST screens of genome databases are crucial for many comparative genomics investigations, efficiently implementing these procedures and integrating them into bioinformatics workflows can be technically challenging.

The Database-Integrated Genome-Screening (DIGS) tool aims to provide a robust and extensible framework for systematic, BLAST-based in silico screens of molecular sequence databases and for interrogating the resulting data.

Input Data Components

  1. Target Database (TDb): A collection of whole genome sequence or transcriptome assemblies serving as the target for similarity searches.
  2. Query Sequences (Probes): Input sequences for similarity searches of the Target Database.
  3. Reference Sequence Library (RSL): Represents the genetic diversity associated with the genome feature(s) under investigation.