-
Notifications
You must be signed in to change notification settings - Fork 22
Spreadsheet validation
This page is for data submitters. It has instructions for how to validate an import spreadsheet and data on disk.
There are two options: install the entire clockwork codebase, or just a minimal install of a script to only do the validation. If you do not need the whole codebase, then the minimal install is recommended.
This option is recommended if you only want to run the validation, and none of the pipelines.
Get the code:
git clone --recursive https://github.com/iqbal-lab-org/clockwork_validate_spreadsheet.git
cd clockwork_validate_spreadsheet
Then install:
python3 setup.py install
Alternatively, if that command gave errors with permissions, install into your home directory:
python3 setup.py install --user
You should now have a script installed called
clockwork_validate_spreadsheet
.
Important: the validation script is part of the main clockwork code. However, you do not need to follow the main installation instructions for installing everything if you only want to run the validation. Instead, use the recommended method below.
If you only want to run the validation, please see
the recommended method above. Alternatively, if you
want to use the Singularity container instead, then
replace each call to clockwork_validate_spreadsheet
below, with
singularity exec clockwork_container.img validate_spreadsheet
(replace clockwork_container.img
with the name of your container file).
You should now have a script installed called
clockwork_validate_spreadsheet
.
If the installation was
successful, then you should be able to get the help
message:
$ clockwork_validate_spreadsheet -h
usage: clockwork_validate_spreadsheet [options] <in.xlsx> <data_dir> <errors_out.txt>
Validates import spreadsheet, and checks files in the sheet are on disk, in
the directory data_dir. Writes errors to errors_out.txt -- this will be empty
if no errors found
positional arguments:
infile Name of input spreadsheet file
data_dir Name of directory containing FASTQ files
outfile Name of output file containing any errors found
optional arguments:
-h, --help show this help message and exit
--threads INT Number of files to calculate md5 sum in parallel. Set to zero
to skip md5 checks [1]
--version show program's version number and exit
All of the FASTQ files referred to in the spreadsheet should be in the same directory (or subdirectories of that directory). For example:
my_data/sample.1_1.fq.gz
my_data/sample.1_2.fq.gz
my_data/foo/sample.2_1.fq.gz
my_data/foo/sample.2_1.fq.gz
The FASTQ files are all in the root of my_data/
, or in
a subdirectory called foo/
.
The filenames used in the spreadsheet should be
sample.1_1.fq.gz
sample.1_2.fq.gz
foo/sample.2_1.fq.gz
foo/sample.2_1.fq.gz
and the validation script is pointed at the
directory my_data/
.
Assuming the files are in a directory called my_data
(like
in the previous section), and the spreadsheet is called
import.xlsx
, run:
clockwork_validate_spreadsheet import.xlsx my_data/ output.txt
This will check the consistency of the spreadsheet and the
files on disk, writing any errors found to the file
output.txt
. If no errors were found then output.txt
will be empty.
The script will check that the md5 sums of all the files
on disk are the same as the md5 sums in the spreadsheet.
This can be very time consuming. There is an option
--threads
to
run the md5 calculation in parallel, or to skip
the md5 check completely. We recommend that you:
-
run the validation without checking the md5 sums, until there are no more errors, using the option
--threads 0
. -
run the validation including the md5 sum checks, using the option
--threads 1
,--threads 2
, or possibly a bigger number. But beware that setting this number too high will not give any benefit because there is a limit to reading files from disk in parallel.
Most messages in the output file take the tab-delimited form (possibly without the column_name):
error_type column_name value row_number_of_input_spreadsheet
The types of error are as follows.
-
File_not_found
. Example:File_not_found foo.1.fq.gz 10
This means the file
foo.1.fq.gz
on row 10 was not found on disk. -
md5_error
. Example:md5_error foo.1.fq.gz abcde 12345
This means that the file
foo.1.fq.gz
has md5 sum '12345', but the spreadsheet said 'abcde'. -
Empty_field
. Example:Empty_field subject_id 20
This means that the subject_id in row 20 was empty.
-
Non_unique
. Example:Non_unique reads_file_1 foo.fq.gz 2,3
This means that the filename
foo.fq.gz
was given forreads_file_1
in rows 2 and 3 of the spreadsheet. Several other columns in the spreadsheet are similarly checked. -
Filename_in_both_columns
. Example:Filename_in_both_columns foo.fq
This means that the filename appears somewhere in the column
reads_file_1
and inreads_file_2
. -
md5_in_both_columns
. This is similar toFilename_in_both_columns
, except for the columnsreads_file_1_md5
,reads_file_2_md5
. -
Not_an_integer
. Example:Not_an_integer sequence_replicate_number x 42
This means that the
sequence_replicate_number
given in row 42 was "x", but it should be an integer. -
Integer_out_of_range
. Example:Integer_out_of_range sequence_replicate_number -1 42
This means that the
sequence_replicate_number
given in row 42 was out of the allowed range. In this case it should always be a positive number. -
Unknown_instrument_model
. Example:Unknown_instrument_model HiSeq50 111
This means that the instrument model in row 111 was not recognised.