Skip to content

iMARGI-Docker distributes the iMARGI sequencing data processing pipeline

Notifications You must be signed in to change notification settings

Zhong-Lab-UCSD/iMARGI-Docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iMARGI-Docker

iMARGI-Docker distributes the iMARGI sequencing data processing pipeline

License

1. Description

in situ MARGI (iMARGI) is a sequencing technique to genome-wide determine the potential genomic interaction loci of Chromatin associated RNAs (caRNAs). To minimize variations in data processing, we developed a complete data processing pipeline to improve analysis reproducibility by standardizing data processing steps. iMARGI-Docker, a Docker image, was built to perform the data processing pipeline in a more convenient way.

This repo hosts the iMARGI-Docker source code with brief introductions. For more detail of performing the iMARGI data analysis using iMARGI-Docker, please read our online comprehensive documentation.

We hope every user can perform the iMARGI pipeline with iMARGI-Docker. However, some old machines or operating systems might not support Docker technique, so users have to configure all the tools used in the pipeline to run it locally. It can only be done on Linux/macOS, and it requires solid experience in Linux system configuration. Please read the installation dependencies section of iMARGI pipeline documentation for details.

If you encounter any problems, please create issues in this GitHub repo.

2. Repository Contents

  • src: source code, such as the Dockerfile of iMARGI-Docker
  • data: small chunk of data for testing
  • docs: source file of documentation

3. Installation Guide

3.1. Hardware Requirements

There isn't specific high performance hardware requirements of running iMARGI-Docker. However, as iMARGI generates huge amount of sequencing data, usually more than 300 million read pairs, so a high performance computer will save you a lot of time. Generally, a faster multi-core CPU, larger memory and hard drive storage will benefits you a lot. We suggest the following specs:

  • CPU: At least dual core CPU. More CPU cores will speed up the processing.

  • RAM: 16 GB. Depends on the size of reference genome. For human genome, at least 8GB free memory are required by BWA, so the memory on the machine needs to be more than 8 GB, which usually is 16 GB. Out of memory will cause ERROR.

  • Hard drive storage: Depends on your data, typically at least 160 GB free space is required for 300M 2x100 read pairs. Besides, fast IO storage is better, such as SSD.

3.2. Software Requirements

iMARGI-Docker only requires Docker installed on your computer. You can use Docker Community Edition (CE).

Although Docker supports all the mainstream OS, such as Linux, Windows and macOS, we strongly recommend using Linux system, because it's much easier to setup and its filesystem is better for large file processing. You can install Docker CE with only two commands on well supported 64-bit Linux distributions, including Ubuntu, Debian, Fedora, and CentOS.

Keep in mind, all the example command lines here and in the documentation are based on a Linux system (Ubuntu). Most of time, the operations in macOS is the same as in Linux system, as it's also a Unix system. However, if you are using Windows system, some command lines need to be modified. Besides, you need to do additional configurations of Docker on Windows or macOS system.

3.2.1. Docker Installation

First of all, check whether you have installed Docker on your system. For Linux users, input command docker -v in terminal. If the output shows the Docker version, such as Docker version 18.09.5, build e8ff056, it means Docker has been installed on the system. For macOS and Windows users, you can check your Application / Program list to find Docker Desktop or Docker Toolbox.

Here are some essential instructions for installing Docker on different systems. Install Docker on Linux is the easiest.

If you are using macOS or Windows, you can check the Technical Notes of installing Docker on different systems to learn how to install Docker on other systems.

3.2.2. Start Docker service

After installation, you need to start the Docker service (Docker daemon).

For some Linux systems, such as Ubuntu, the Docker service might automatically start after installation. You can check it by run a demo hello-world test container by the command below. It will tell you "your installation appears to be working correctly" if your Docker service has been started.

# test Docker service
docker run --rm hello-world

If the service hasn't been started, you can choose a proper Linux command to start it. And then test again.

  • Ubuntu, Debian, Fedora: sudo service docker start

  • CentOS: sudo systemctl start docker

For macOS and Windows users, you need to start the Docker Desktop or Docker Toolbox application.

3.2.3. Docker settings (macOS or Windows)

For macOS and Windows, there are CPU and memory limitations to Docker, which are 1 CPU core and 2 GB memory as default. The memory is far from the requirement of BWA for human genome, which will cause ERROR. So it must be changed to more than 8 GB memory. If you have 4 CPU cores, it's better to increase the CPU limitation.

Here are simple instructions of how to change the settings.

  • If you are using Docker Desktop for Windows or macOS, you can easily change the settings by right click the Docker icon (Whale) in the task bar, then go to Settings -> Advanced to change memory and CPU limits. More detail can be found in the Docker official docs of Get started with Docker for Windows, and Get started with Docker Desktop for Mac.

  • If you are using Docker Toolbox for Windows or macOS, which uses VirtualBox as backend, so you need to open VirtualBox, then stop default VM, Select it and click on settings, then make changes as you want.

There isn't any limitation to Docker on Linux system, so don't worry about it.

3.3. iMARGI-Docker Installation

We recommend pulling the iMARGI-Docker image from Docker Hub. You can also re-build it on your own machine with the source files in src folder.

If you cannot use Docker, please read the installation section of iMARGI pipeline documentation for alternative instructions.

3.3.1. Pull from Docker Hub

When Docker was installed, it's easy to install iMARGI-Docker by pulling from Docker Hub. The latest version of iMARGI-Docker image in Docker Hub is based on the most recent released stable version. It takes about 10 seconds to install, which depends on your network speed.

docker pull zhonglab/imargi

3.3.2. Build with Dockerfile

Instead of pulling from Docker Hub, you can also build the iMARGI-Docker on your own computer. We provided all the source code for building iMARGI-Docker in the src folder, including Dockerfile and all the script tools. You can download the most recent stable release or git clone from the master branch. So you can modify and rebuild your own iMARGI-Docker image. It will take about several minutes to build, which depends on your computer performance and network speed. Currently, the stable release is v1.0, which is the master branch.

4. Software Testing Demo

To test whether you have successfully deployed iMARGI-Docker, you can follow instructions below to do a demo test run.

4.1. Testing Data

4.1.1. iMARGI sequencing data (paired FASTQ)

As real iMARGI sequencing data are always very big, so we randomly extracted a small chunk of real data for software testing. The data can be downloaded from the following links.

4.1.2. Reference genome data (FASTA)

Besides, you need to download a human genome reference FASTA file. We use the reference genome used by 4D Nucleome and ENCODE project.

The FASTA file of the reference genome is too large for us to host it in GitHub repo. You can be download it use the link:

It needs to be decompressed using gunzip -d or gzip -d command on Linux/macOS. If your system is Windows, you can use 7Zip or other software to decompress the .gz file. Besides, you can also use the gunzip tool delivered in iMARGI-Docker.

4.1.3. bwa index data

As bwa index process will cost a lot of time (more than 1 hour), we suggest to download our pre-built index files for the reference genome. Please download the following gzip compressed bwa_index folder and decompress it (tar zxvf) on your machine.

We assume that you put the data and reference files in the following directory structure.

~/imargi_example
    ├── data
    │   ├── sample_R1.fastq.gz
    │   └── sample_R2.fastq.gz
    ├── output
    └── ref
        ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
        └── bwa_index
            ├── bwa_index_hg38.amb
            ├── bwa_index_hg38.ann
            ├── bwa_index_hg38.bwt
            ├── bwa_index_hg38.pac
            └── bwa_index_hg38.sa

4.2. Testing Command

We can use one command line to perform the whole pipeline to the testing data.

cd ~/imargi_example

# replace "-u 1043" with your own UID, see the tips below
# replace "-v ~/imargi_example:/imargi" with your working directory if not ~/imargi_example

docker run --rm -t -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi \
    imargi_wrapper.sh \
    -r hg38 \
    -N test_sample \
    -t 4 \
    -g ./ref/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
    -i ./ref/bwa_index/bwa_index_hg38 \
    -1 ./data/sample_R1.fastq.gz \
    -2 ./data/sample_R2.fastq.gz \
    -o ./output

Tips:

  • --rm: By default a container’s file system persists even after the container exits. Hence, the container file systems can really pile up. --rm option will automatically clean up the container after the container exits.

  • -t: Allocate a pseudo-TTY. With -t, you can use Ctrl + c to stop the run in terminal. However, without -t, you have to use docker ps -a to check your run container id and then use docker stop <container_id> to stop the run.

  • -u 1043: Run docker with your own UID of your Linux system (use id command to check your UID) to avoid file/dir permission problem.

  • -v ~/imargi_example:/imargi: It mounts the ~/imargi_example directory in your host machine to workspace of the running docker container. The path must be a full path. The example was ran on a Linux computer. If you ran it on a Windows computer, the path is a little different. For example, Windows path D:\test\imargi_example needs to be rewritten as /d/test/imargi_example, so the -v argument needs to be -v /d/test/imargi_example:/imargi. When you executed it on Windows, a window might pop up to verify that you want to share the folder.

  • The command line is long, so \ was used for splitting it into multiple lines in the example. It's a Linux or macOS style. However, in Windows, you need to replace \ with ^.

  • -i: Building BWA index will cost a lot time, so we used the pre-built index files with -i argument. If you don't supply BWA index files, the imargi_wrapper.sh will generated it automatically based on the reference genome sequence supplied by -g parameter. Building BWA index needs large memory as we required (16 GB). There are some other arguments can be used for pre-generated files, such as -R for restriction fragment BED file (the automatically generated file is named as AluI_frags.bed.gz) and -c for chromsize file. See more details in the documentation of command line API section

4.3. Testing Results

4.3.1. Running Time Profile

It took about 10 minutes to perform the pipeline on our computer (with -i bwa index argument).

Step Time Speed up suggestion
Generating chromosome size file 10 sec It's fast, but you can also supply with -c once you've generated it before.
Generating bwa index (skipped) 75 min Supply with -i if you've pre-built index files.
Generating restriction fragment file 4 min Supply with -R when you've already created it before.
cleaning 10 sec It's fast and not parallelization.
bwa mapping 2 min More CPU cores with -t.
interaction pair parsing 1 min More CPU cores with -t.

4.3.2. Expected Result files

The output result files are in the folder assign with -o argument. The final output .pairs format file for further analysis is final_test_sample.pairs.gz. Besides, multiple intermediate output files of each step are in the clean_fastq, bwa_output, and parse_temp sub-directories of the output directory. In addition, the generated chromosome size file, bwa index folder and restriction fragment BED file are all in the ref directory, in which the reference genome FASTA file is. Besides, there is also a simple stats file, pipelineStats_test_sample.log, which reports the sequencing mapping QC result (passed or failed), total processed read pairs number, BWA mapping stats and number of valid RNA-DNA interaction in the final .pairs.gz file. For more detail, please check the documentation of output file descriptions.

Here is the final directory structure after completing the pipeline.

~/imargi_example/
    ├── data
    │   ├── sample_R1.fastq.gz
    │   └── sample_R2.fastq.gz
    ├── ref
    │   ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
    │   ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai
    │   ├── chromsize.hg38.txt
    │   ├── AluI_frags.bed.gz
    │   ├── AluI_frags.bed.gz.tbi
    │   └── bwa_index
    │       ├── bwa_index_hg38.amb
    │       ├── bwa_index_hg38.ann
    │       ├── bwa_index_hg38.bwt
    │       ├── bwa_index_hg38.pac
    │       └── bwa_index_hg38.sa
    └── output
        ├── bwa_output
        │   ├── bwa_log_test_sample.txt
        │   └── test_sample.bam
        ├── clean_fastq
        │   ├── clean_test_sample_R1.fastq.gz
        │   └── clean_test_sample_R2.fastq.gz
        ├── parse_temp
        │   ├── dedup_test_sample.pairs.gz
        │   ├── drop_test_sample.pairs.gz
        │   ├── duplication_test_sample.pairs.gz
        │   ├── sorted_all_test_sample.pairs.gz
        │   ├── stats_dedup_test_sample.txt
        │   ├── stats_final_test_sample.txt
        │   └── unmapped_test_sample.pairs.gz
        ├── final_test_sample.pairs.gz
        └── pipelineStats_test_sample.log

License

iMARGI-Docker source code is licensed under the BSD 2 license.