http://www.gnu.org/licenses/agpl-3.0.en.html
Short read connector enables the comparisons of two read sets B and Q. For each read from Q it provides either: * The number of occurrences of each k-mers of the read in the set B (SRC_counter) or * A list of reads from B that share enough k-mers with the (a window of) the tested read from A (SRC_linker)
Citation Camille Marchet, Antoine Limasset, Lucie Bittner, Pierre Peterlongo. A resource-frugal probabilistic dictionary and applications in (meta)genomics. 2016.
CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
c++ compiler; compilation was tested with gcc and g++ version>=4.5 (Linux) and clang version>=4.1 (Mac OSX).
# get a local copy of source code
git clone --recursive https://github.com/GATB/rconnector.git
# compile the code an run a simple test on your computer
cd gatb-rconnector
sh INSTALL
Binary release for Linux and Mac OSX are provided within the "Releases" tab on Github/rconnector web page.
Run a simple test looking for reads from data/c2.fasta.gz that share at least 20 kmers (k=25) with data/c1.fasta.gz. Kmers indexed from data/c1.fasta.gz are those occurring at least 2 times.
sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt
Calling SRC_linker between read sets bank and query:
sh short_read_connector.sh -b bank -q query
-c: use short_read_connector_counter (SRC_counter)
-r: with this option (incompatible with SRC_counter), no precision about pair of similar reads is output. Only ids of reads from queries similar to at least one read from bank are output.
-p prefix. All out files will start with this prefix. Default="short_read_connector_res"
-g: with this option, if a file of solid kmer exists with same prefix name and same k value, then it is re-used and not re-computed.
-k value. Set the length of used kmers. Must fit the compiled value. Default=31
-f value. Fingerprint size. Size of the key associated to each indexed value, limiting false positives. Default=12
-G value. gamma value. MPHF expert users parameter - Default=2
-a: kmer abundance min (kmer from bank seen less than this value are not indexed). Default=2
-s: Minimal percentage of shared kmer span for considering 2 reads as similar. The kmer span is the number of bases from the read query covered by a kmer shared with the target read. If a read of length 80 has a kmer-span of 60 with another read from the bank (of unkonwn size), then the percentage of shared kmer span is 75%. If a least a windows (of size "windows_size" contains at least kmer_threshold percent of positionf covered by shared kmers, the read couple is conserved.)
-w: size of the window. If the windows size is zero (default value), then the full read is considered
-t: number of thread used. Default=0
Command:
sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt -c
Two first lines of the output file:
#query_read_id mean median min max number of shared 31mers with banq read set data/c1.fasta.gz
0 3.614286 4 2 5
The first line is the file header. The second line can be decomposed as: * 0: id of the query read (from read set contained in fof.txt) * 3.614286: mean number of occurrences of its k-mers (here with k=31) in the read set data/c1.fasta.gz * 4: median number of occurrences of its k-mers (here with k=31) in the read set data/c1.fasta.gz * 2: minimal number of occurrences of at least a kmer from read 0 in the read set data/c1.fasta.gz * 5: maximal number of occurrences of at least a kmer from read 0 in the read set data/c1fasta.gz
Command:
sh short_read_connector.sh -b data/c1.fasta.gz -q data/fof.txt
Two first lines of the output file:
#query_read_id [target_read_id-kmer_span (k=31)-kmer_span query percentage]* or U (unvalid read, containing not only ACGT characters or low complexity read)
1:676-93-93.000000 809-89-89.000000
The first line is the file header. The second line can be decomposed as: * 1: id of the query read * 676-93-93.000000: a target read and its peaces of information: * 676: id of the targeted read * 93: kmer-span (number of position of read 1 that is covered by at least a solid kmer present in read 676) * 93.000000: kmer-span ratio wrt to read 1 length (here 100) * 809-89-89.000000: a second targeted read and its pieces of information (and so on).
Note that with the -r option, only the id of the queried and shared read is output. In this example the line would be limited to
#query_read_id
1
We use file of files format. The input read sets are provided using a file of file(s). The file of file(s) contains on each line a read file or another file of file(s). Let's look to a few usual cases (italic strings indicate the composition of a file): * Case1: I've a unique read set composed of a unique read file (reads.fq.gz). * fof.txt: * reads.fq.gz * Case2: I've a unique read set composed of a couple of read files (reads_R1.fq.gz and reads_R2.fq.gz). This may be the case in case of pair end sequencing. * fof.txt: * fof_reads.txt:
with fof_reads.txt:
* reads_R1.fq.gz
* reads_R2.fq.gz
-
Case3: I've two read sets each composed of a unique read file: reads1.fq.gz and reads2.fq.gz:
-
fof.txt:
-
reads1.fq.gz
-
reads2.fq.gz
-
-
Case4: I've two read sets each composed two read files: reads1_R1.fq.gz and reads1_R2.fq.gz and reads2_R1.fq.gz and reads2_R2.fq.gz:
-
fof.txt:
-
fof_reads1.txt
-
fof_reads2.txt
-
with fof_reads1.txt:
* reads1_R1.fq.gz
* reads1_R2.fq.gz
with fof_reads2.txt: * reads2_R1.fq.gz * reads2_R2.fq.gz * and so on...
Contact: Pierre Peterlongo: pierre.peterlongo@inria.fr