Skip to content
Viginesh Vaibhav edited this page Jun 18, 2019 · 1 revision

SiLiCO

Installation

Highly recommended that you create a new Conda environment for using SiLiCO, since its dependencies require Python version 2.7.16

Libraries required for installation:

  1. numpy
  2. pybedtools*
  3. natsort
  • NOTE: pybedtools requires another package called 'pysam'. However, SiLiCO requires pysam version 0.8.4, while that would make it incompatible with the current version of pybedtools. This incompatibility makes it difficult to install SiLiCO with the required dependencies. To work around this, use the following steps:
  1. pybedtools uses the BEDtools package, which can be installed by running the following command: conda install -c bioconda bedtools

If that doesn't work, run the following commands one after the other:

  1. If you run into some errors due to missing header files, find the required apt-get command to install that particular library on your version of Linux/MacOS. Some examples (which may be of use for you) are:
  • sudo apt-get install -y liblzma-dev
  • sudo apt-get install libbz2-dev
  • sudo apt install zlib1g-dev
  1. Install pybedtools with the following command: pip2 install pybedtools

  2. Then, uninstall the currently installed pysam package with the following command: pip2 uninstall pysam

  3. After that, install pysam version 0.8.4 using the following command: pip2 install --no-cache-dir pysam=0.8.4 (Here, the --no-cache-dir flag will ensure that pip doesn't use a cached version of pysam to install it, as the cached version may not be version 0.8.4)

  4. Having installed everything above, download the github repository for SiLiCO, unzip it and navigate to its directory (cd SiLiCO-master)

  5. Run the following commands from the SiLiCO directory:

  • python2 setup.py build
  • python2 setup.py install

You should now have SiLiCO successfully installed and ready to go.

Basic commands

INPUT

General Syntax

python2 SiLiCO.py -i </path/to/genome> [-o </path/to/outDir > -m < mean read length > -s < standard dev of read lengths > -c < coverage > --trials=< trials >]

Only thing required is the input genome. All other flags are optional. Default output directory is the current folder.

Input Genome File Formatting (IMPORTANT)

SiLiCO requires the input genomes to be fasta files, and they should be a chromosomal assembly with header lines in the style >chr1, >chr2, etc. Any other header lines will cause SiLiCO to ignore the genome sequence below that header.

In addition to the above, you will have to remove '\r' and other unexpected newline characters in the input FASTA file as follows:

  • tr -d '\r' < Input Genome FASTA> < Temporary FASTA > => command to remove the '\r' from the input FASTA file and store the output in a temporary FASTA file.
  • mv < Temporary FASTA > < Input Genome FASTA> => command to move the temporary FASTA file back into the original input FASTA file

File I/O flags

  • -i, --infile=< str >, REQ => Input genome fasta file. See README for formatting requirments**.
  • -o, --output=< str >, OPT: => Output directory for results. Default = Current directory

Distribution Parameters

  • -m, --mean_read_length=< int >, OPT => Mean read length for in-silico read generation. Default = 10000 bp
  • -s, --standard_dev=< int >, OPT => Standard deviation of in-silico reads. Default = 2050
  • -c, --coverage=< int >, OPT => Desired genome coverage of in-silico sequencing. Default = 8
  • --trials=< int >, OPT => Number of trials. Default = 1

Modes

  • -f, --fasta, OPT => FASTA Mode. When present, converts bed files to FASTA sequences using the provided reference genome.
  • -n, --nanopore, OPT => Generate Oxford Nanopore data. Calculates a gamma distribution.
  • -p, --pacbio, OPT => Generate PacBio data. Calculates a log normal distribution. Default mode if none specified.

Documentation/Help flags

  • -h, --help => Display this message.
  • --version => What version of SiLiCO are you using?
  • --contact => Report a bug or get more help.
  • --citation => View the citation for SiLiCO.

OUTPUT

SiLiCO will generate .bed files for each trial you run on the input genome. By default, it will run one trial, and generate one .bed output file. Each .bed file will contain start and end positions of the input genome's subsequences, as the simulated nanopore reads for that genome. The best way to parse through the generated bed files is to use the -f flag when running SiLiCONE. But before you do that, please make sure to remove '\r' and other unexpected newlines in the input FASTA file. You can do so as follows:

  • tr -d '\r' < Input Genome FASTA> < Temporary FASTA > => command to remove the '\r' from the input FASTA file and store the output in a temporary FASTA file.
  • mv < Temporary FASTA > < Input Genome FASTA> => command to move the temporary FASTA file back into the original input FASTA file

Another good way to automatically parse through the .bed files and get the required (simulated) reads is to make use of the getfasta command from the bedtools package.

Usage: bedtools getfasta [OPTIONS] -fi < input FASTA > -bed <BED/GFF/VCF> -fo < output FASTA >

Other options:

  • -name => Use the “name” column in the BED file for the FASTA headers in the output FASTA file.
  • -tab => Report extract sequences in a tab-delimited format instead of in FASTA format.
  • -bedOut => Report extract sequences in a tab-delimited BED format instead of in FASTA format.
  • -s => Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. Default: strand information is ignored.
  • -split => Given BED12 input, extract and concatenate the sequences from the BED “blocks” (e.g., exons)

NOTE

  • The -fo flag is optional, and by default, the output is given to stdout. The best way to store the output from getfasta is to use -fo and provide an output FASTA file name.
  • This command will have to be run for every .bed file individually. They cannot be batched into a single command all at once.

Example (Mycobacterium Tuberculosis Genome)

  1. Download the Mycobacterium Tuberculosis genome from here: wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/585/GCF_000008585.1_ASM858v1/GCF_000008585.1_ASM858v1_genomic.fna.gz

  2. Copy the contents of the file to a new file, and save the new file as a .fasta file (e.g. Mycobacterium_tuberculosis_cdc1551.fasta)

  3. Ensure that the genome is a chromosomal assembly, with headers of the style >chr1, >ch2, etc. If not, convert the header lines to this style. Ensure that >chr1, >chr2, etc. are all in lowercase.

  4. There may be issues with the newline convention used in the FASTA files (especially on Windows). By default, Windows adds a '\r' character to the end of a line to denote newline. However, this will be misinterpreted by the getfasta command and will throw errors, such as, "WARNING. chromosome (chr1) was not found in the FASTA file. Skipping". To avoid this, we need to remove the '\r' from the input FASTA file as follows:

    • tr -d '\r' <Mycobacterium_tuberculosis_cdc1551.fasta> tmp.fa => command to remove the '\r' from the FASTA file and store the output in a temporary FASTA file tmp.fa
    • mv tmp.fa Mycobacterium_tuberculosis_cdc1551.fasta => command to move the temporary FASTA file back into the original input FASTA file
  5. Create a new directory for your output folder: mkdir silico_output

  6. Now, run the following command: python2 SiLiCO.py -i Mycobacterium_tuberculosis_cdc1551.fasta -o silico_output -f

  7. The above command should generate an output file simulated_read_positions_trial_0.bed.gz in the silico_output directory. Unzip simulated_read_positions_trial_0.bed.gz and open the extracted simulated_read_positions_trial_0.bed file. It's output may look like this:

chr1 2463060 2472124 trial_0_sim_read_0 chr1 632692 643493 trial_0_sim_read_1 chr1 1564695 1575124 trial_0_sim_read_2 chr1 1372638 1382310 trial_0_sim_read_3 ............ and so on.

  1. Another example command that could be tried for the above genome: python2 SiLiCO.py -i Mycobacterium_tuberculosis_cdc1551.fasta -o silico_output -m 3000 -s 100 -c 20 --trials=10 -f

Here, we set the mean read length to 3000 base pairs, standard deviation to 100 base pairs, coverage to 20 and run SiLiCO for 10 trials. This will produce 10 output files simulated_read_positions_trial_0.bed.gz to simulated_read_positions_trial_9.bed.gz, with .bed files for each trial (from 0 to 9, for 10 trials in total). The output files with the simulated reads will be produced in the same folder as the .bed files.

Issues With SiLiCO

The -f/--fasta flag is supposed to generate a fasta files with all the reads as subsequences of the .bed files directly. However, this may not work as expected, especially on Windows systems, because of issues with unexpected newline characters (such as '\r') being added to the input FASTA file by the operating system. The workaround for this issue is to use getfasta from bedtools after removing the '\r' from the input FASTA file (as described in the previous sections).

Bioinformatics tools

These are a growing collection of manuals for commonly used bioinformatics tools.

How to use

Just go to the page for the tool you are trying to use, and scroll through the page to download and install. That simple. The goal is to add extra documentation for using these tools, in addition to what is already supplied by the manual pages for the programs.

Clone this wiki locally