Abstract

FindNeighbour is a server application for investigating bacterial relatedness. Accessible via RESTful webservices, FindNeighbour maintains an in-memory distance matrix on thes server for a sequence collection, which is automatically cached to disc. It supports incremental addition of samples, and, for a given sample, allows queries identifying similar sequences with millisecond response times.

The inputs to the service are strings containing DNA sequence information, typically generated by mapping and basecalling, followed by storage in FASTA or other formats. The service can be queried with strings containing DNA sequence information and a single nucleotide polymorphism threshold; it returns a list of similar samples. The software is designed for, has been extensively tested with, mapped data from bacterial genome sequencing.

Requirements

FindNeighbour comprises several components.

Python code, built on web.py, handles API calls. The component doing this is webservice-server.py.
A C++ daemon which is called by webservice-server.py. This code is findNeighbour.cpp.

Tools are also provided to launch one or more instances of the webservice.

findNeighbour.cpp can be compiled and run on Linux (using gcc). We have not tested it on Windows, but expect it would work. It uses the C++ Standard Library 14. OpenMP is required for parallelisation.
Server memory requirements are dependent the number of samples stored, the amount of variation between them, and their length. We have tested it with bacterial genomes. Approximate server memory requirements are 2GB for 400 samples, and 20GB for 4,000 of mapped M. tuberculosis data (4.4 million bases/genome).

Performance

Hosted on a Ubuntu Linux server with 128G of RAM, using 16 threads, the server takes about 1 second to add a sample to a collection of 1,000 M. tuberculosis samples. Addition time scales linearly with the number of samples in the sequence store.

Queries requesting samples similar to a sample return with ~ 50 msec response times.

Setting up the server

0 Prepare the server

First of all you should check if the system has gcc compiler and openmp library, the examples are in linux. Compilation has been tested on Windows with DevC++ and Visual Studio 15.

Linux compilation instructions: 1- Check openMP library: echo | cpp -fopenmp -dM | grep -i open 2- Check gcc compiler: gcc --version ## get compiler version 3- install lz with sudo apt-get install zlib1g-dev

Install gcc: sudo apt-get install gcc-4.2

Install openmp: apt-get install libgomp1

http://openmp.org/wp/openmp-compilers/ https://huseyincakir.wordpress.com/2009/11/05/installing-openmp-in-linux-debian/

Install web.py library http://webpy.org/install

1 Compile the application

First, compile the C++ component.

make clean make

or

Internally, the make file does this:

g++ -std=c++11 -fopenmp -O3 findNeighbour.cpp -lz -o findNeighbour

2 Optionally, you can interact directly with the findNeighbour daemon.

We do not recommend that you do this. You can skip this step and go to step 3.

To start the daemon, do one of:

./findNeighbour

./findNeighbour -t 8

./findNeighbour --threads 8 --name /path/to/writable/directory

--threads is the variable to determine the number of threads to use when processing samples.
--name determines the location where the daemon will store files. It must be writable. By default the value of threads is 8, recovery is 0, and name is the current working directory.

The findNeighbour daemon will now be running. It accepts several commands, including the following:

Tables	Possible Responses
INSERT id_sample fasta_sample	Err or OK
GETVALUE IDS id_sample threshold	Err, or a list containing ids of samples within threshold snps of id_sample: ['id_sample1',..,'id_sampleN']
GETVALUE SNP id_sample threshold	Err, or a list containing pairs of samples including id_sample, and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]]
GETALLVALUES IDS threshold	Err, or a list of all ids in the store: 'id_sample1',..,'id_sampleN']
GETALLVALUES SNP threshold	Err, or Err, or a list containing all pairs of samples and their pairwise distances: [['id_sample1',snp],..,['id_sampleN',snpN]]
BACKUP	Err or OK
EXIT	Exits

Examples of use

# insert four sequences into the server
INSERT 1 ACCTGNCCTG
INSERT 2 ACAAGNCTCG
INSERT 3 ACCTGNNNAG
INSERT 4 ANANTNNNGG

# get pairs of samples, which include id 1, and have pairwise distance with id 1 <= 10 SNP
GETVALUE SNP 1 10

# get ids of samples, and have pairwise distance with id 1 <= 10 SNP
GETVALUE IDS 1 10

# get all pairs of samples with SNP distance <= 10
GETALLVALUES SNP 10

# get all the ids which have neighbours with SNP distances <=10
GETALLVALUES IDS 10

# save the contents
BACKUP

# exit
EXIT

3 Start the findNeighbour web service

Server:

python webservice_server.py ip port path_to_store_files

Example:

python webservice-server.py localhost 8185 R00000039

On the client:

python webservice-client.py # this will run some queries against the server

Client

# example use of the FindNeighbour web server.
# these commands are found in webservice-client.py
import xmlrpclib

client=xmlrpclib.ServerProxy("http://localhost:8185")  # or wherever your server is running

# insert four sequences, each comprising 10 nucleotides
print client.insert('1','ACCTGNCCTG')
print client.insert('2','ACAAGNCTCG')
print client.insert('3','ACCTGNNNAG')
print client.insert('4','ANANTNNNGG')

# query the server service
print client.query_get_value_ids('1','5')
print client.query_get_value_snp('1','5')
print client.query_get_values_ids('5')
print client.query_get_values_snp('5')

# force save all results
print client.save()

# stop the service
# print client.exit()

This completes the process for launching a single server. Various scripts are provided which provide examples of how to programmatically launch multiple services, for the purposes to demonstrating the sharding functionality we describe in the paper.

For example: push_samples : recovers fasta files, loads them into a findNeighbour instance; create_fn_branches.py: makes multiple instances of servers webservice-populate-branches.py : load samples into various branches, depending on their classification (which is computed by external scripts).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test		test
LICENSE		LICENSE
Readme.md		Readme.md
branches.txt		branches.txt
branches_config.txt		branches_config.txt
create_ew_branches.py		create_ew_branches.py
create_fn_branches.py		create_fn_branches.py
findNeighbour		findNeighbour
findNeighbour.cpp		findNeighbour.cpp
findNeighbour_backup_full.sh		findNeighbour_backup_full.sh
get_samples.sh		get_samples.sh
makefile		makefile
oew		oew
open_findNeighbour_walk.cpp		open_findNeighbour_walk.cpp
populate_fn_branches.py		populate_fn_branches.py
push_new_samples.py		push_new_samples.py
push_samples.py		push_samples.py
samples_fastas_10000_assignations.txt		samples_fastas_10000_assignations.txt
update_samples.sh		update_samples.sh
webservice-client.py		webservice-client.py
webservice-populate-branches.py		webservice-populate-branches.py
webservice-server.py		webservice-server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Requirements

Performance

Setting up the server

0 Prepare the server

1 Compile the application

2 Optionally, you can interact directly with the findNeighbour daemon.

Examples of use

3 Start the findNeighbour web service

Server:

Client

About

Releases

Packages

Contributors 2

Languages

License

davidhwyllie/findNeighbour

Folders and files

Latest commit

History

Repository files navigation

Abstract

Requirements

Performance

Setting up the server

0 Prepare the server

1 Compile the application

2 Optionally, you can interact directly with the findNeighbour daemon.

Examples of use

3 Start the findNeighbour web service

Server:

Client

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages