VectorizedKmers
VectorizedKmers.jl is a Julia package for fast $k$-mer counting of biological sequences. The core idea is that $k$-mers with an alphabet size $A$ are essentially integers in base $A$, and can be used as indices in a vector of size $A^k$ to count the corresponding $k$-mers.
The KmerCount
type is a wrapper for AbstractVector
, which means that these vector k-mer counts are not limited to Julia's Base.Vector
type; other kinds of vectors can be used as well, such as CUDA.CuVector
, SparseArrays.SparseVector
or even matrix views. To efficiently group k-mer counts together, the KmerCountVector
stores them in a matrix as rows or columns. It can wrap any AbstractMatrix
, such as Matrix
or CuMatrix
, and accessing its elements by index returns a KmerCount
wrapped around a view of a row or column of the original matrix.
This data structure can be used to quickly approximate distances between sequences. Most notably, the squared Euclidean distance was used to estimate edit distance in this paper. The dot product has also proven to be a useful metric for comparing correlation between sequences.
Installation
You can install VectorizedKmers in your environment from the Julia REPL by entering pkg mode with ]
and then running:
add VectorizedKmers
VectorizedKmers.AbstractKmerCount
VectorizedKmers.AbstractKmerCountVector
VectorizedKmers.KmerCount
VectorizedKmers.KmerCountColumns
VectorizedKmers.KmerCountRows
VectorizedKmers.count_kmers
VectorizedKmers.count_kmers!
VectorizedKmers.AbstractKmerCount
— TypeAbstractKmerCount{A, K, T <: Real, V <: AbstractVector{T}}
Abstract type for K-mer counts. A
is the alphabet size, K
is the K-mer size, and T
is the element type of the underlying counts
field, which in turn has type V{T}
.
VectorizedKmers.AbstractKmerCountVector
— TypeAbstractKmerCountVector{A, K, T <: Real, M <: AbstractMatrix{T}} <: AbstractVector{KmerCount{A, K, T, V} where {V <: AbstractVector{T}}}
Yup... that's indeed an abomination of a type. A container for K-mer counts, where K-mer counts are stored together as rows or columns in a matrix. A
is the alphabet size, K
is the K-mer size, T
is the element type of the counts, and M
is the type of the matrix in which the K-mer counts are stored.
VectorizedKmers.KmerCount
— TypeKmerCount{A, K, T, V} <: AbstractKmerCount{A, K, T, V}
A concrete type for K-mer counts with vector type V
and element type T
.
VectorizedKmers.KmerCountColumns
— TypeKmerCountColumns{A, K, T, M} <: AbstractKmerCountVector{A, K, T, M}
A container for K-mer counts, where K-mer counts are stored together as columns in a matrix. This is more efficient than storing K-mer counts as rows in a matrix, since the elements in a column are contiguous in memory.
VectorizedKmers.KmerCountRows
— TypeKmerCountRows{A, K, T, M} <: AbstractKmerCountVector{A, K, T, M}
A container for K-mer counts, where K-mer counts are stored together as rows in a matrix. This is not as efficient as storing K-mer counts as columns in a matrix, since the elements in a row are not contiguous in memory.
VectorizedKmers.count_kmers!
— Methodcount_kmers!(kmer_count, kmers; reset=true)
Mutate the counts
vector in kmer_count
by counting K-mers in kmers
. The K-mers in kmers
must be represented as integers between 0 and length(kmer_count) - 1.
If reset
is true
, the counts
vector will be zero-ed before counting.
VectorizedKmers.count_kmers
— Methodcount_kmers(KmerCount{A, K, T}, kmers; zeros_func=zeros)
Create a new A^K sized vector using zeros_func
and count the K-mers in kmers
. The K-mers in kmers
must be represented as integers between 0
and length(kmer_count) - 1
.
If reset
is true
, the counts
vector will be zero-ed before counting.