From 95bc4e4d54ec506df50fcc2ec6d76be3107ce60d Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Sun, 24 Mar 2024 16:55:22 +0000 Subject: [PATCH] build based on 5bb8db9 --- dev/.documenter-siteinfo.json | 2 +- dev/index.html | 2 +- dev/kmer_int_repr/index.html | 2 +- dev/objects.inv | Bin 494 -> 537 bytes dev/references/index.html | 2 +- dev/search_index.js | 2 +- 6 files changed, 5 insertions(+), 5 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index d518445..1011b9e 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-24T02:13:23","documenter_version":"1.3.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-24T16:55:19","documenter_version":"1.3.0"}} \ No newline at end of file diff --git a/dev/index.html b/dev/index.html index e762b8e..7999e8c 100644 --- a/dev/index.html +++ b/dev/index.html @@ -26,4 +26,4 @@ ⋮ 0 1 - 0

Limitations

The main downside of counting $K$-mers this way is that the arrays grow exponentially with respect to $K$. The 31-mer array of a DNA sequence would have a length of $4^{31} = 4,611,686,018,427,387,904$, which is equivalent to four exbibytes of memory, if the values are stored with 8-bit integers — which is just not feasible, really. Not only does allocating a lot of memory take up a lot of memory, but it can also take a substantial amount of time! This method of counting $K$-mers therefore works best for lower $K$-values.

+ 0

Limitations

The main downside of counting $K$-mers this way is that the arrays grow exponentially with respect to $K$. The 31-mer array of a DNA sequence would have a length of $4^{31} = 4,611,686,018,427,387,904$, which is equivalent to four exbibytes of memory, if the values are stored with 8-bit integers — which is just not feasible, really. Not only does allocating a lot of memory take up a lot of memory, but it can also take a substantial amount of time! This method of counting $K$-mers therefore works best for lower $K$-values.

diff --git a/dev/kmer_int_repr/index.html b/dev/kmer_int_repr/index.html index 249527d..d2244d3 100644 --- a/dev/kmer_int_repr/index.html +++ b/dev/kmer_int_repr/index.html @@ -1,2 +1,2 @@ -Integer representation of k-mers · VectorizedKmers.jl

Integer representation of K-mers

This package relies on representing K-mers as integers for indexing.

For DNA, each non-ambiguous nucleotide is assigned a number between 0 and 3:

NucleotideBase-4Base-2
A000
C101
G210
T311

Any ordering works, but this is the one used by BioSequences.jl. It also has some nice properties, like being in alphabetical order, and that XOR-ing a base with 3 gives you its complement.

We could theoretically convert any DNA sequence to an integer, but 64-bit unsigned integers limit us to 32-mers.

Consider the DNA sequence GATTACA. If we convert it to an integer using the table above, we get $2033010_4 = 10001111000100_2 = 9156_{10}$, so the integer value of GATTACA is 9156. Since Julia uses 1-based indexing, we would add 1 to this value to get the index for the value in a vector associated with GATTACA.

+Integer representation of k-mers · VectorizedKmers.jl

Integer representation of K-mers

This package relies on representing K-mers as integers for indexing.

For DNA, each non-ambiguous nucleotide is assigned a number between 0 and 3:

NucleotideBase-4Base-2
A000
C101
G210
T311

Any ordering works, but this is the one used by BioSequences.jl. It also has some nice properties, like being in alphabetical order, and that XOR-ing a base with 3 gives you its complement.

We could theoretically convert any DNA sequence to an integer, but 64-bit unsigned integers limit us to 32-mers.

Consider the DNA sequence GATTACA. If we convert it to an integer using the table above, we get $2033010_4 = 10001111000100_2 = 9156_{10}$, so the integer value of GATTACA is 9156. Since Julia uses 1-based indexing, we would add 1 to this value to get the index for the value in a vector associated with GATTACA.

diff --git a/dev/objects.inv b/dev/objects.inv index e5e1bec015dde105345aa920d6086503357a3fba..1c462bdc0bb10b14eb9b42a29a055c7890d283ed 100644 GIT binary patch delta 414 zcmV;P0b%~`1DOPng@3J7O;5r=5WV+TjM3g$`8tpgE)`8AHN=q0m6UdnNr1hJTKT=H_b5~#x<7Wj@F z@20^oUwj9HS=^9(gS5Gsr%P zaMu%>?8seSZNX-%dC`uG=F6UFM(*E%bxz6UoT7lrco&LSSp2_j6g^5~1Z^7j|2xe1 I20~%5#IJzMfB*mh delta 370 zcmV-&0ge8d1nvWng@1ig%WlFj5WMFrwrX?3<#j-cxTLgFQL3myv{w|8O-V^?{8O502%y0gHmq1MM+eG*KR+{_vni~nd@89 zJNCgzl`Z!)lFZ3_Q10AjrzN(O-Z67~;1JpaJBObQ^$yu`YAw)<37T5=I#ts1RY@kA zlaqFH$)LzbtF|}CGjnX~wrWZNblNL;l^q$>*HW|fv2E#|sJCB3k0#u6zj)pAKX<8Y Qy3;y!+~F0?9}LhYmJd&~ga7~l diff --git a/dev/references/index.html b/dev/references/index.html index fcae7b2..c6af41f 100644 --- a/dev/references/index.html +++ b/dev/references/index.html @@ -1,2 +1,2 @@ -API Reference · VectorizedKmers.jl

API Reference

VectorizedKmers.KmerArrayType
KmerArray{N, K, T <: Real, A <: AbstractArray{T, K}} <: StaticArray{NTuple{K, N}, T, K}
  • N is the alphabet size
  • K is the K-mer size
  • T is the element type
  • A is the array type
source
VectorizedKmers.count_kmers!Method
count_kmers!(kmer_array, sequence; reset=true)

Requires method axis_index(::KmerArray{N}, ::eltype(sequence)) where N to be defined

source
+API Reference · VectorizedKmers.jl
diff --git a/dev/search_index.js b/dev/search_index.js index 9b0f9d6..c4c5d33 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"references/#API-Reference","page":"API Reference","title":"API Reference","text":"","category":"section"},{"location":"references/","page":"API Reference","title":"API Reference","text":"Modules = [VectorizedKmers]","category":"page"},{"location":"references/#VectorizedKmers.KmerArray","page":"API Reference","title":"VectorizedKmers.KmerArray","text":"KmerArray{N, K, T <: Real, A <: AbstractArray{T, K}} <: StaticArray{NTuple{K, N}, T, K}\n\nN is the alphabet size\nK is the K-mer size\nT is the element type\nA is the array type\n\n\n\n\n\n","category":"type"},{"location":"references/#VectorizedKmers.count_kmers","page":"API Reference","title":"VectorizedKmers.count_kmers","text":"count_kmers(sequence, K, T=Int, zeros=zeros; N=default_alphabet_size(eltype(sequence)))\n\n\n\n\n\n","category":"function"},{"location":"references/#VectorizedKmers.count_kmers!-Union{Tuple{K}, Tuple{N}, Tuple{KmerArray{N, K, T, A} where {T<:Real, A<:AbstractArray{T, K}}, Any}} where {N, K}","page":"API Reference","title":"VectorizedKmers.count_kmers!","text":"count_kmers!(kmer_array, sequence; reset=true)\n\nRequires method axis_index(::KmerArray{N}, ::eltype(sequence)) where N to be defined\n\n\n\n\n\n","category":"method"},{"location":"kmer_int_repr/#Integer-representation-of-K-mers","page":"Integer representation of k-mers","title":"Integer representation of K-mers","text":"","category":"section"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"This package relies on representing K-mers as integers for indexing.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"For DNA, each non-ambiguous nucleotide is assigned a number between 0 and 3:","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Nucleotide Base-4 Base-2\nA 0 00\nC 1 01\nG 2 10\nT 3 11","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Any ordering works, but this is the one used by BioSequences.jl. It also has some nice properties, like being in alphabetical order, and that XOR-ing a base with 3 gives you its complement.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"We could theoretically convert any DNA sequence to an integer, but 64-bit unsigned integers limit us to 32-mers.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Consider the DNA sequence GATTACA. If we convert it to an integer using the table above, we get 2033010_4 = 10001111000100_2 = 9156_10, so the integer value of GATTACA is 9156. Since Julia uses 1-based indexing, we would add 1 to this value to get the index for the value in a vector associated with GATTACA.","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = VectorizedKmers\nDocTestSetup = quote\n using VectorizedKmers\nend","category":"page"},{"location":"#VectorizedKmers","page":"Home","title":"VectorizedKmers","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Documentation) (Image: Status) (Image: Coverage)","category":"page"},{"location":"","page":"Home","title":"Home","text":"VectorizedKmers.jl is a Julia package primarily designed for fast K-mer counting of biological sequences. The core idea is that K-mers with an alphabet size of N are essentially integers in base N, and can be used as indices in a vector of size N^K to count the corresponding K-mers.","category":"page"},{"location":"","page":"Home","title":"Home","text":"This data structure can be used to quickly approximate distances between sequences. Notably, the squared Euclidean distance was used to approximate edit distance in this paper. The dot product has also proven to be a useful metric for comparing correlation between sequences.","category":"page"},{"location":"#Examples","page":"Home","title":"Examples","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"julia> using VectorizedKmers, BioSequences\n\njulia> kmer_array = count_kmers(dna\"ACCGGGTTTT\", 1)\nKmerArray{4, 1, Int64, Vector{Int64}} with size (4,)\n\njulia> kmer_array |> values\n4-element Vector{Int64}:\n 1\n 2\n 3\n 4\n\njulia> count_kmers(dna\"AATT\", 2) |> values # 2-mers of AATT\n4×4 Matrix{Int64}:\n 1 0 0 0\n 0 0 0 0\n 0 0 0 0\n 1 0 0 1\n\njulia> count_kmers(aa\"AY\", 1) |> values\n20-element Vector{Int64}:\n 1\n 0\n 0\n ⋮\n 0\n 1\n 0","category":"page"},{"location":"#Limitations","page":"Home","title":"Limitations","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The main downside of counting K-mers this way is that the arrays grow exponentially with respect to K. The 31-mer array of a DNA sequence would have a length of 4^31 = 4611686018427387904, which is equivalent to four exbibytes of memory, if the values are stored with 8-bit integers — which is just not feasible, really. Not only does allocating a lot of memory take up a lot of memory, but it can also take a substantial amount of time! This method of counting K-mers therefore works best for lower K-values.","category":"page"}] +[{"location":"references/#API-Reference","page":"API Reference","title":"API Reference","text":"","category":"section"},{"location":"references/","page":"API Reference","title":"API Reference","text":"Modules = [VectorizedKmers]","category":"page"},{"location":"references/#VectorizedKmers.KmerArray","page":"API Reference","title":"VectorizedKmers.KmerArray","text":"KmerArray{N, K, T <: Real, A <: AbstractArray{T, K}} <: StaticArray{NTuple{K, N}, T, K}\n\nN is the alphabet size\nK is the K-mer size\nT is the element type\nA is the array type\n\n\n\n\n\n","category":"type"},{"location":"references/#VectorizedKmers.count_kmers!-Union{Tuple{K}, Tuple{N}, Tuple{KmerArray{N, K, T, A} where {T<:Real, A<:AbstractArray{T, K}}, Any}} where {N, K}","page":"API Reference","title":"VectorizedKmers.count_kmers!","text":"count_kmers!(kmer_array, sequence; reset=true)\n\nRequires method axis_index(::KmerArray{N}, ::eltype(sequence)) where N to be defined\n\n\n\n\n\n","category":"method"},{"location":"references/#VectorizedKmers.count_kmers-Union{Tuple{K}, Tuple{N}, Tuple{Any, Val{N}, Val{K}}, Tuple{Any, Val{N}, Val{K}, Type{<:Real}}, Tuple{Any, Val{N}, Val{K}, Type{<:Real}, Any}} where {N, K}","page":"API Reference","title":"VectorizedKmers.count_kmers","text":"count_kmers(sequence, [N,] K, T=Int, zeros=zeros)\n\n\n\n\n\n","category":"method"},{"location":"kmer_int_repr/#Integer-representation-of-K-mers","page":"Integer representation of k-mers","title":"Integer representation of K-mers","text":"","category":"section"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"This package relies on representing K-mers as integers for indexing.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"For DNA, each non-ambiguous nucleotide is assigned a number between 0 and 3:","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Nucleotide Base-4 Base-2\nA 0 00\nC 1 01\nG 2 10\nT 3 11","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Any ordering works, but this is the one used by BioSequences.jl. It also has some nice properties, like being in alphabetical order, and that XOR-ing a base with 3 gives you its complement.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"We could theoretically convert any DNA sequence to an integer, but 64-bit unsigned integers limit us to 32-mers.","category":"page"},{"location":"kmer_int_repr/","page":"Integer representation of k-mers","title":"Integer representation of k-mers","text":"Consider the DNA sequence GATTACA. If we convert it to an integer using the table above, we get 2033010_4 = 10001111000100_2 = 9156_10, so the integer value of GATTACA is 9156. Since Julia uses 1-based indexing, we would add 1 to this value to get the index for the value in a vector associated with GATTACA.","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = VectorizedKmers\nDocTestSetup = quote\n using VectorizedKmers\nend","category":"page"},{"location":"#VectorizedKmers","page":"Home","title":"VectorizedKmers","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"(Image: Latest Release) (Image: MIT license) (Image: Documentation) (Image: Documentation) (Image: Status) (Image: Coverage)","category":"page"},{"location":"","page":"Home","title":"Home","text":"VectorizedKmers.jl is a Julia package primarily designed for fast K-mer counting of biological sequences. The core idea is that K-mers with an alphabet size of N are essentially integers in base N, and can be used as indices in a vector of size N^K to count the corresponding K-mers.","category":"page"},{"location":"","page":"Home","title":"Home","text":"This data structure can be used to quickly approximate distances between sequences. Notably, the squared Euclidean distance was used to approximate edit distance in this paper. The dot product has also proven to be a useful metric for comparing correlation between sequences.","category":"page"},{"location":"#Examples","page":"Home","title":"Examples","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"julia> using VectorizedKmers, BioSequences\n\njulia> kmer_array = count_kmers(dna\"ACCGGGTTTT\", 1)\nKmerArray{4, 1, Int64, Vector{Int64}} with size (4,)\n\njulia> kmer_array |> values\n4-element Vector{Int64}:\n 1\n 2\n 3\n 4\n\njulia> count_kmers(dna\"AATT\", 2) |> values # 2-mers of AATT\n4×4 Matrix{Int64}:\n 1 0 0 0\n 0 0 0 0\n 0 0 0 0\n 1 0 0 1\n\njulia> count_kmers(aa\"AY\", 1) |> values\n20-element Vector{Int64}:\n 1\n 0\n 0\n ⋮\n 0\n 1\n 0","category":"page"},{"location":"#Limitations","page":"Home","title":"Limitations","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The main downside of counting K-mers this way is that the arrays grow exponentially with respect to K. The 31-mer array of a DNA sequence would have a length of 4^31 = 4611686018427387904, which is equivalent to four exbibytes of memory, if the values are stored with 8-bit integers — which is just not feasible, really. Not only does allocating a lot of memory take up a lot of memory, but it can also take a substantial amount of time! This method of counting K-mers therefore works best for lower K-values.","category":"page"}] }