We provides a utility to merge a large number of VCF files (possibly too many to open at once) incrementally, that only use almost as much memory as one merged line takes.
- All input VCFs are positionally sorted, and the values for the FILTER column of each position are the same for all samples.
- All input VCFs have the same headers and the same number of positions.
- All input VCFs have the FORMAT column.
Since the FILTER value for each sample is different, we omit (set to .
) the FILTER column in the merged result, and append the original FILTER value for each sample in their call data. Its format is described by the FT
field in the FORMAT column.
For example: These two input lines:
NC_000962.3 11 . A C . MIN_GCP . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE 0/0:6:6,0:73.54:0.74
and
NC_000962.3 11 . A C . MIN_DP;MIN_GCP . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE 0/0:3:3,0:36.98:0.01
will produce this output line:
NC_000962.3 11 . A C . . . GT:DP:COV:GT_CONF:GT_CONF_PERCENTILE:FT 0/0:6:6,0:73.54:0.74:MIN_GCP 0/0:3:3,0:36.98:0.01:MIN_DP;MIN_GCP
You can use the utility as either:
from contextlib import ExitStack
from ivcfmerge import ivcfmerge
filenames = [...] # List/iterator of relative/absolute paths to input files
output_path = '...' # Where to write the merged VCF to
with ExitStack() as stack:
files = map(lambda fname: stack.enter_context(open(fname)), filenames)
with open(output_path) as outfile:
ivcfmerge(files, outfile)
from ivcfmerge import ivcfmerge_batch
filenames = [...] # List/iterator of relative/absolute paths to input files
output_path = '...' # Where to write the merged VCF to
batch_size = 1000 # How many files to open and merge at once
ivcfmerge_batch(filenames, output_path, batch_size)
That has at least as much space as that occupied by the input files to store intermediate results, in the batch processing version.
...
temp_dir = '...' # for example, a directory on a mounted disk like /mnt/big_disk/tmp or /media/big_disk/tmp
ivcfmerge_batch(filenames, output_path, batch_size, temp_dir)
# Prepare a file of paths to input VCF files
> cat input_paths.txt
1.vcf
2.vcf
...
> python3 ivcfmerge.py input_paths.txt path/to/output/file
# Prepare a file of paths to input VCF files
> cat input_paths.txt
1.vcf
2.vcf
...
> python3 ivcfmerge_batch.py --batch-size 1000 input_paths.txt path/to/output/file
That has at least as much space as that occupied by the input files to store intermediate results, in the batch processing version.
...
> python3 ivcfmerge_batch.py --batch-size 1000 --temp-dir /path/to/tmp/dir input_paths.txt path/to/output/file
pip3 install .
ivcfmerge -h
ivcfmerge_batch -h
All CLI arguments & options are the same as described in 4.2, i.e. just replace python3 ivcfmerge.py
with ivcfmerge
, similarly for the batch version.
Indicates how many files to open and merge each batch, for the batch processing version.
The default value for this parameter is 1000.
- Total memory usage will not exceed
batch_size * size(one line of input VCFs)
. - Batch size equals the number of files the utility will open at once.
- Bigger batch size will reduce the total time taken, but requires more memory and file handles from the OS.
For the batch processing version, the utility needs to store the intermediate results somewhere with as much space as the total space occupied by the input files.
By default, the choice is left to the tempfile library. On Unix/Linux, this is usually /tmp
.
pip3 install -r requirements/dev.txt
pip3 install -e .
pytest