FindNeighbour is a server application for investigating bacterial relatedness. Accessible via RESTful webservices, FindNeighbour maintains an in-memory distance matrix on thes server for a sequence collection, which is automatically cached to disc. It supports incremental addition of samples, and, for a given sample, allows queries identifying similar sequences with millisecond response times.
The inputs to the service are strings containing DNA sequence information, typically generated by mapping and basecalling, followed by storage in FASTA or other formats. The service can be queried with strings containing DNA sequence information and a single nucleotide polymorphism threshold; it returns a list of similar samples. The software is designed for, has been extensively tested with, mapped data from bacterial genome sequencing.
FindNeighbour comprises several components.
- Python code, built on web.py, handles API calls. The component doing this is webservice-server.py.
- A C++ daemon which is called by webservice-server.py. This code is findNeighbour.cpp.
Tools are also provided to launch one or more instances of the webservice.
findNeighbour.cpp can be compiled and run on Linux (using gcc). We have not tested it on Windows, but expect it would work.
It uses the C++ Standard Library 14. OpenMP is required for parallelisation.
Server memory requirements are dependent the number of samples stored, the amount of variation between them, and their length.
We have tested it with bacterial genomes. Approximate server memory requirements are 2GB for 400 samples, and 20GB for 4,000 of mapped M. tuberculosis data (4.4 million bases/genome).
Hosted on a Ubuntu Linux server with 128G of RAM, using 16 threads, the server takes about 1 second to add a sample to a collection of 1,000 M. tuberculosis samples. Addition time scales linearly with the number of samples in the sequence store.
Queries requesting samples similar to a sample return with ~ 50 msec response times.
First of all you should check if the system has gcc compiler and openmp library, the examples are in linux. Compilation has been tested on Windows with DevC++ and Visual Studio 15.
Linux compilation instructions: 1- Check openMP library: echo | cpp -fopenmp -dM | grep -i open 2- Check gcc compiler: gcc --version ## get compiler version 3- install lz with sudo apt-get install zlib1g-dev
Install gcc: sudo apt-get install gcc-4.2
Install openmp: apt-get install libgomp1
http://openmp.org/wp/openmp-compilers/ https://huseyincakir.wordpress.com/2009/11/05/installing-openmp-in-linux-debian/
Install web.py library http://webpy.org/install
First, compile the C++ component.
make clean make
or
Internally, the make file does this:
g++ -std=c++11 -fopenmp -O3 findNeighbour.cpp -lz -o findNeighbour
We do not recommend that you do this. You can skip this step and go to step 3.
To start the daemon, do one of:
./findNeighbour
./findNeighbour -t 8
./findNeighbour --threads 8 --name /path/to/writable/directory
- --threads is the variable to determine the number of threads to use when processing samples.
- --name determines the location where the daemon will store files. It must be writable. By default the value of threads is 8, recovery is 0, and name is the current working directory.
The findNeighbour daemon will now be running. It accepts several commands, including the following:
Tables | Possible Responses |
---|---|
INSERT id_sample fasta_sample | Err or OK |
GETVALUE IDS id_sample threshold | Err, or a list containing ids of samples within threshold snps of id_sample: ['id_sample1',..,'id_sampleN'] |
GETVALUE SNP id_sample threshold | Err, or a list containing pairs of samples including id_sample, and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]] |
GETALLVALUES IDS threshold | Err, or a list of all ids in the store: 'id_sample1',..,'id_sampleN'] |
GETALLVALUES SNP threshold | Err, or Err, or a list containing all pairs of samples and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]] |
BACKUP | Err or OK |
EXIT | Exits |
# insert four sequences into the server
INSERT 1 ACCTGNCCTG
INSERT 2 ACAAGNCTCG
INSERT 3 ACCTGNNNAG
INSERT 4 ANANTNNNGG
# get pairs of samples, which include id 1, and have pairwise distance with id 1 <= 10 SNP
GETVALUE SNP 1 10
# get ids of samples, and have pairwise distance with id 1 <= 10 SNP
GETVALUE IDS 1 10
# get all pairs of samples with SNP distance <= 10
GETALLVALUES SNP 10
# get all the ids which have neighbours with SNP distances <=10
GETALLVALUES IDS 10
# save the contents
BACKUP
# exit
EXIT
python webservice_server.py ip port path_to_store_files
Example:
python webservice-server.py localhost 8185 R00000039
On the client:
python webservice-client.py # this will run some queries against the server
# example use of the FindNeighbour web server.
# these commands are found in webservice-client.py
import xmlrpclib
client=xmlrpclib.ServerProxy("http://localhost:8185") # or wherever your server is running
# insert four sequences, each comprising 10 nucleotides
print client.insert('1','ACCTGNCCTG')
print client.insert('2','ACAAGNCTCG')
print client.insert('3','ACCTGNNNAG')
print client.insert('4','ANANTNNNGG')
# query the server service
print client.query_get_value_ids('1','5')
print client.query_get_value_snp('1','5')
print client.query_get_values_ids('5')
print client.query_get_values_snp('5')
# force save all results
print client.save()
# stop the service
# print client.exit()
This completes the process for launching a single server. Various scripts are provided which provide examples of how to programmatically launch multiple services, for the purposes to demonstrating the sharding functionality we describe in the paper.
For example: push_samples : recovers fasta files, loads them into a findNeighbour instance; create_fn_branches.py: makes multiple instances of servers webservice-populate-branches.py : load samples into various branches, depending on their classification (which is computed by external scripts).