Skip to content

Spreadsheet validation

Martin Hunt edited this page Jun 15, 2022 · 9 revisions

Spreadsheet validation

This page is for data submitters. It has instructions for how to validate an import spreadsheet and data on disk.

Install the validation script

There are two options: install the entire clockwork codebase, or just a minimal install of a script to only do the validation. If you do not need the whole codebase, then the minimal install is recommended.

Install method 1: only install the validation

This option is recommended if you only want to run the validation, and none of the pipelines.

Get the code:

git clone --recursive https://github.com/iqbal-lab-org/clockwork_validate_spreadsheet.git
cd clockwork_validate_spreadsheet

Then install:

python3 setup.py install

Alternatively, if that command gave errors with permissions, install into your home directory:

python3 setup.py install --user

You should now have a script installed called clockwork_validate_spreadsheet.

Important: the validation script is part of the main clockwork code. However, you do not need to follow the main installation instructions for installing everything if you only want to run the validation. Instead, use the recommended method below.

Install method 2: Singularity container (not recommended)

If you only want to run the validation, please see the recommended method above. Alternatively, if you want to use the Singularity container instead, then replace each call to clockwork_validate_spreadsheet below, with singularity exec clockwork_container.img validate_spreadsheet (replace clockwork_container.img with the name of your container file).

Check the installation

You should now have a script installed called clockwork_validate_spreadsheet. If the installation was successful, then you should be able to get the help message:

$ clockwork_validate_spreadsheet -h
usage: clockwork_validate_spreadsheet [options] <in.xlsx> <data_dir> <errors_out.txt>

Validates import spreadsheet, and checks files in the sheet are on disk, in
the directory data_dir. Writes errors to errors_out.txt -- this will be empty
if no errors found

positional arguments:
  infile         Name of input spreadsheet file
  data_dir       Name of directory containing FASTQ files
  outfile        Name of output file containing any errors found

optional arguments:
  -h, --help     show this help message and exit
  --threads INT  Number of files to calculate md5 sum in parallel. Set to zero
                 to skip md5 checks [1]
  --version      show program's version number and exit

A note on filenames

All of the FASTQ files referred to in the spreadsheet should be in the same directory (or subdirectories of that directory). For example:

my_data/sample.1_1.fq.gz
my_data/sample.1_2.fq.gz
my_data/foo/sample.2_1.fq.gz
my_data/foo/sample.2_1.fq.gz

The FASTQ files are all in the root of my_data/, or in a subdirectory called foo/.

The filenames used in the spreadsheet should be

sample.1_1.fq.gz
sample.1_2.fq.gz
foo/sample.2_1.fq.gz
foo/sample.2_1.fq.gz

and the validation script is pointed at the directory my_data/.

Validating the spreadsheet and data

Assuming the files are in a directory called my_data (like in the previous section), and the spreadsheet is called import.xlsx, run:

clockwork_validate_spreadsheet import.xlsx my_data/ output.txt

This will check the consistency of the spreadsheet and the files on disk, writing any errors found to the file output.txt. If no errors were found then output.txt will be empty.

Saving time

The script will check that the md5 sums of all the files on disk are the same as the md5 sums in the spreadsheet. This can be very time consuming. There is an option --threads to run the md5 calculation in parallel, or to skip the md5 check completely. We recommend that you:

  1. run the validation without checking the md5 sums, until there are no more errors, using the option --threads 0.

  2. run the validation including the md5 sum checks, using the option --threads 1, --threads 2, or possibly a bigger number. But beware that setting this number too high will not give any benefit because there is a limit to reading files from disk in parallel.

Error messages

Most messages in the output file take the tab-delimited form (possibly without the column_name):

error_type   column_name   value   row_number_of_input_spreadsheet

The types of error are as follows.

  • File_not_found. Example:

    File_not_found   foo.1.fq.gz  10
    

    This means the file foo.1.fq.gz on row 10 was not found on disk.

  • md5_error. Example:

    md5_error   foo.1.fq.gz   abcde   12345
    

    This means that the file foo.1.fq.gz has md5 sum '12345', but the spreadsheet said 'abcde'.

  • Empty_field. Example:

    Empty_field   subject_id   20
    

    This means that the subject_id in row 20 was empty.

  • Non_unique. Example:

    Non_unique   reads_file_1   foo.fq.gz   2,3
    

    This means that the filename foo.fq.gz was given for reads_file_1 in rows 2 and 3 of the spreadsheet. Several other columns in the spreadsheet are similarly checked.

  • Filename_in_both_columns. Example:

    Filename_in_both_columns   foo.fq
    

    This means that the filename appears somewhere in the column reads_file_1 and in reads_file_2.

  • md5_in_both_columns. This is similar to Filename_in_both_columns, except for the columns reads_file_1_md5, reads_file_2_md5.

  • Not_an_integer. Example:

    Not_an_integer    sequence_replicate_number   x    42
    

    This means that the sequence_replicate_number given in row 42 was "x", but it should be an integer.

  • Integer_out_of_range. Example:

    Integer_out_of_range    sequence_replicate_number   -1    42
    

    This means that the sequence_replicate_number given in row 42 was out of the allowed range. In this case it should always be a positive number.

  • Unknown_instrument_model. Example:

    Unknown_instrument_model    HiSeq50   111
    

    This means that the instrument model in row 111 was not recognised.