rm Protein type, Chain -> ProteinChain

MurrellGroup · Dec 14, 2023 · 8b046f7 · 8b046f7 · AntonOresten · Dec 14, 2023
1 parent be6cb39
commit 8b046f7
Show file tree

Hide file tree

Showing 10 changed files with 55 additions and 91 deletions.
diff --git a/README.md b/README.md
@@ -20,24 +20,24 @@ Pkg.add("Backboner")
 
 ## Types and functions
 
-The `Protein` type wraps a vector of `Chain`s, which in turn wraps the `Backbone{4}` type (4, because it stores the positions of 4 atoms per residue: N, CA, C, O). The `Backbone{N}` type has the `N` type parameter in order to remain flexible. It allows one pass only the N, CA, and C atoms of a backbone, such that the O atom positions can added in using the `add_oxygens` function.
+Proteins are represented as vectors of `ProteinChain`s, which in turn wrap the `Backbone{3}` type to store the coordinates of N, Ca, and C atoms.
 
-The secondary structure of an entire chain is described by a `Vector{Char}`, where '-' stands for coil/loop, 'H' for helix, and 'E' for strand. For assignment of secondary structure, this package uses the [AssigningSecondaryStructure.jl](https://github.com/MurrellGroup/AssigningSecondaryStructure.jl) package, which implements a simplified version of the DSSP algorithm.
+The secondary structures of a chain are described by a `Vector{Char}`, where '-' stands for coil/loop, 'H' for helix, and 'E' for strand. For assignment of secondary structure, this package uses the [AssigningSecondaryStructure.jl](https://github.com/MurrellGroup/AssigningSecondaryStructure.jl) package, which implements a simplified version of the DSSP algorithm.
 
-Protein backbones can be loaded from a PDB file using the `pdb_to_protein` function, which returns a `Protein` instance. Inversely, a `Protein` instance can be written to a PDB file using the `protein_to_pdb` function.
+Proteins can be loaded from a PDB file using the `pdb_to_protein` function, which returns a `Vector{ProteinChain}` instance. Inversely, a `Vector{ProteinChain}` instance can be written to a PDB file using the `protein_to_pdb` function.
 
 ## Example
 
 ```julia
 julia> using Backboner
 
 julia> protein = pdb_to_protein("test/data/1ZAK.pdb")
-2-element Protein{Float32}:
- Chain A with 220 residues
- Chain B with 220 residues
+2-element Vector{ProteinChain}:
+ ProteinChain A with 220 residues
+ ProteinChain B with 220 residues
 
 julia> chain = protein["A"]
-Chain A with 220 residues
+ProteinChain A with 220 residues
 
 julia> chain.backbone
 3×4×220 Backbone{4, Float32}:

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -33,25 +33,23 @@ The `Protein` type wraps a vector of `Chain`s.
 julia> using Backboner
 
 julia> protein = pdb_to_protein("test/data/1ZAK.pdb")
-2-element Protein{Float32}:
- Chain A with 220 residues
- Chain B with 220 residues
+2-element Vector{ProteinChain}:
+ ProteinChain A with 220 residues
+ ProteinChain B with 220 residues
 
 julia> chain = protein["A"] # chains can be accessed by name
-Chain A with 220 residues
+ProteinChain A with 220 residues
 
 julia> protein["A"] == protein[1] # numeric indexing also works
 true
 
-julia> new_protein = Protein([protein["A"]]) # create a new protein with a single chain
-1-element Protein{Float32}:
- Chain A with 220 residues
+julia> new_protein = [protein["A"]] # create a new protein with a single chain
+1-element Vector{ProteinChain}:
+ ProteinChain A with 220 residues
 
 julia> protein_to_pdb(new_protein, "test/data/1ZAK_A.pdb");
 ```
 
-The `Chain` type wraps the `Backbone{4}` type (4, because it stores the positions of 4 atoms per residue: N, CA, C, O).
-
 ## API Reference
 
 ```@autodocs

diff --git a/docs/src/oxygen.md b/docs/src/oxygen.md
@@ -6,12 +6,12 @@ The `Backbone` type has a type parameter `N` to represent the number of atoms pe
 julia> using Backboner
 
 julia> protein = pdb_to_protein("test/data/1ZAK.pdb")
-2-element Protein{Float32}:
- Chain A with 220 residues
- Chain B with 220 residues
+2-element Vector{ProteinChain}:
+ ProteinChain A with 220 residues
+ ProteinChain B with 220 residues
 
 julia> chain = protein["A"]
-Chain A with 220 residues
+ProteinChain A with 220 residues
 
 julia> backbone4 = chain.backbone
 3×4×220 Backbone{4, Float32}:

diff --git a/docs/src/types.md b/docs/src/types.md
@@ -6,23 +6,17 @@ The `Backbone` type is designed to efficiently store and manipulate the three-di
 
 `Backbone{N, T}` is a wrapper around a 3xNxL array, where:
 - **3** are the three spatial dimensions for the coordinates.
-- **N** is the number of atoms per residue.
+- **N** is the number of atoms in the backbone per residue.
 - **L** is the number of residues in the backbone.
 - **T** is the element type of the coordinate array.
 
 ## Chain
 
-A `Chain` represents a protein chain, and holds an identifier (usually a single letter), backbone atom coordinates, the amino acid sequence, and secondary structure information.
+A `ProteinChain` represents a protein chain, and holds an identifier (usually a single letter), backbone atom coordinates, the amino acid sequence, and secondary structure information.
 
+- `id`: A string identifier for the chain.
 - `backbone`: An instance of `Backbone{4}`, storing the coordinates of backbone atoms.
 - `aavector`: A vector for storing the amino acid sequence.
 - `ssvector`: A vector for storing the secondary structure.
 
-The `Chain` type is designed to provide a comprehensive and consistent representation of a protein chain, ensuring that the backbone coordinates align with the corresponding amino acid sequences and secondary structures.
-
-## Protein
-
-The `Protein` type holds multiple `Chain` instances, representing complete protein structures.
-
-- Stores a collection of `Chain` objects.
-- Includes a dictionary for quick access to chains via their identifiers.
+The `ProteinChain` type is designed to provide a comprehensive and consistent representation of a protein chain, ensuring that the backbone coordinates align with the corresponding amino acid sequences and secondary structures.
diff --git a/src/assign.jl b/src/assign.jl
@@ -7,7 +7,7 @@ import AssigningSecondaryStructure: assign_secondary_structure!, assign_secondar
 
 Uses a simplified version of DSSP to fill the secondary structure vector of each chain with '-' (coil/loop), 'H' (helix), and 'E' (strand).
 """
-function assign_secondary_structure!(protein::Protein)
+function assign_secondary_structure!(protein::Vector{ProteinChain})
     ss_vectors = assign_secondary_structure([chain.backbone.coords for chain in protein])
     for (chain, ssvector) in zip(protein, ss_vectors)
         @assert length(chain.ssvector) == length(ssvector)
@@ -21,7 +21,7 @@ end
 
 Returns a new protein with secondary structure assigned.
 """
-function assign_secondary_structure(protein::Protein)
+function assign_secondary_structure(protein::Vector{ProteinChain})
     new_protein = deepcopy(protein)
     assign_secondary_structure!(new_protein)
     return new_protein

diff --git a/src/chain.jl b/src/chain.jl
@@ -1,17 +1,17 @@
-export Chain
+export ProteinChain
 
 """
-    Chain <: AbstractVector{Residue}
+    ProteinChain <: AbstractVector{Residue}
 
 A chain has an identifier (usually a single letter) and holds the backbone atom coordinates, amino acid sequence, and secondary structures of a protein chain. 
 """
-struct Chain <: AbstractVector{Residue}
+struct ProteinChain <: AbstractVector{Residue}
     id::AbstractString
     backbone::Backbone{4}
     aavector::Vector{Char}
     ssvector::Vector{Char}
 
-    function Chain(
+    function ProteinChain(
         id::AbstractString,
         backbone::Backbone{N};
         aavector::Vector{Char} = fill('G', length(backbone)),
@@ -26,15 +26,15 @@ struct Chain <: AbstractVector{Residue}
         return new(id, backbone, aavector, ssvector)
     end
 
-    Chain(backbone::Backbone; kwargs...) = Chain("_", backbone; kwargs...) 
+    ProteinChain(backbone::Backbone; kwargs...) = ProteinChain("_", backbone; kwargs...) 
 end
 
-@inline Base.:(==)(chain1::Chain, chain2::Chain) = chain1.id == chain2.id && chain1.backbone == chain2.backbone && chain1.ssvector == chain2.ssvector
-@inline Base.length(chain::Chain) = length(chain.backbone)
-@inline Base.size(chain::Chain) = (length(chain),)
-@inline Base.getindex(chain::Chain, i::Integer) = Residue(i, chain.backbone, chain.aavector[i], chain.ssvector[i])
+@inline Base.:(==)(chain1::ProteinChain, chain2::ProteinChain) = chain1.id == chain2.id && chain1.backbone == chain2.backbone && chain1.ssvector == chain2.ssvector
+@inline Base.length(chain::ProteinChain) = length(chain.backbone)
+@inline Base.size(chain::ProteinChain) = (length(chain),)
+@inline Base.getindex(chain::ProteinChain, i::Integer) = Residue(i, chain.backbone, chain.aavector[i], chain.ssvector[i])
 
-Base.summary(chain::Chain) = "Chain $(chain.id) with $(length(chain)) residue$(length(chain) == 1 ? "" : "s")"
-Base.show(io::IO, chain::Chain) = print(io, summary(chain))
+Base.summary(chain::ProteinChain) = "ProteinChain $(chain.id) with $(length(chain)) residue$(length(chain) == 1 ? "" : "s")"
+Base.show(io::IO, chain::ProteinChain) = print(io, summary(chain))
 
-has_assigned_ss(chain::Chain) = has_assigned_ss(chain.ssvector)
+has_assigned_ss(chain::ProteinChain) = has_assigned_ss(chain.ssvector)
diff --git a/src/io.jl b/src/io.jl
@@ -37,29 +37,28 @@ function Backbone(atoms::Vector{PDBTools.Atom})
     return Backbone(coords)
 end
 
-function Chain(atoms::Vector{PDBTools.Atom})
+function ProteinChain(atoms::Vector{PDBTools.Atom})
     id = PDBTools.chain(atoms[1])
     @assert allequal(PDBTools.chain.(atoms)) "atoms must be from the same chain"
     backbone = Backbone(atoms)
     aavector = [get(ONE_LETTER_AA_CODES, atom.resname, 'X') for atom in atoms if atom.name == "CA"]
-    return Chain(id, backbone, aavector=aavector)
-end
-
-function Protein(atoms::Vector{PDBTools.Atom})
-    filter!(a -> a.name in ["N", "CA", "C", "O"], atoms)
-    ids = PDBTools.chain.(atoms)
-    chains = [Chain(atoms[ids .== id]) for id in unique(ids)]
-    return Protein(chains)
+    return ProteinChain(id, backbone, aavector=aavector)
 end
 
 """
     pdb_to_protein(filename::String)
 
 Assumes that each residue starts with four atoms: N, CA, C, O.
 """
-pdb_to_protein(filename::String) = Protein(PDBTools.readPDB(filename))
+function pdb_to_protein(filename::String)
+    atoms = PDBTools.readPDB(filename)
+    filter!(a -> a.name in ["N", "CA", "C", "O"], atoms)
+    ids = PDBTools.chain.(atoms)
+    chains = [ProteinChain(atoms[ids .== id]) for id in unique(ids)]
+    return chains
+end
 
-function protein_to_pdb(protein::Protein, filename, header=:auto, footer=:auto)
+function protein_to_pdb(protein::Vector{ProteinChain}, filename, header=:auto, footer=:auto)
     atoms = PDBTools.Atom[]
     index = 0
     residue_index = 0

diff --git a/src/protein.jl b/src/protein.jl
@@ -1,28 +1,3 @@
-export Protein
+@inline Base.getindex(protein::AbstractVector{ProteinChain}, id::AbstractString) = protein[findfirst(c -> c.id == id, protein)]
 
-"""
-    Protein <: AbstractVector{Chain}
-
-A wrapper for a vector of chains.
-Chains can be accessed by index or by ID.
-"""
-struct Protein <: AbstractVector{Chain}
-    chains::Vector{Chain}
-    id_dict::Dict{AbstractString, Chain}
-
-    function Protein(chains::Vector{Chain})
-        @assert length(unique([chain.id for chain in chains])) == length(chains)
-        id_dict = Dict{AbstractString, Chain}(chain.id => chain for chain in chains)
-        return new(chains, id_dict)
-    end
-end
-
-@inline Base.:(==)(protein1::Protein, protein2::Protein) = protein1.chains == protein2.chains
-@inline Base.size(protein::Protein) = size(protein.chains)
-@inline Base.length(protein::Protein) = length(protein.chains)
-@inline Base.getindex(protein::Protein, i) = protein.chains[i]
-@inline Base.getindex(protein::Protein, id::AbstractString) = protein.id_dict[String(id)]
-
-Base.summary(protein::Protein) = "Protein with $(length(protein)) chain$(length(protein) == 1 ? "" : "s")"
-
-has_assigned_ss(protein::Protein) = all(has_assigned_ss, protein.chains)
+has_assigned_ss(protein::AbstractVector{ProteinChain}) = all(has_assigned_ss, protein)
diff --git a/test/chain.jl b/test/chain.jl
@@ -4,20 +4,20 @@
 
         coords = randn(3, 4, 5)
         backbone = Backbone(coords)
-        chain = Chain("A", backbone)
+        chain = ProteinChain("A", backbone)
         @test chain.id == "A"
         @test chain.backbone.coords == coords
         @test chain.aavector == fill('G', length(chain))
         @test chain.ssvector == fill(' ', length(chain))
         @test !has_assigned_ss(chain)
         @test length(chain) == 5
         @test size(chain) == (5,)
-        @test Chain(remove_column(backbone, 4)).backbone == add_oxygens(remove_column(backbone, 4))
-        @test Chain(backbone).id == "_"
+        @test ProteinChain(remove_column(backbone, 4)).backbone == add_oxygens(remove_column(backbone, 4))
+        @test ProteinChain(backbone).id == "_"
 
         @test chain[1] == Residue(1, backbone, 'G', ' ')
 
-        @test summary(chain) == "Chain A with 5 residues"
+        @test summary(chain) == "ProteinChain A with 5 residues"
 
         io = IOBuffer()
         show(io, chain)

diff --git a/test/protein.jl b/test/protein.jl
@@ -1,14 +1,12 @@
 @testset "protein.jl" begin
 
     @testset "Protein" begin
-        A = Chain("A", Backbone(randn(3, 4, 5)))
-        B = Chain("B", Backbone(randn(3, 4, 6)))
-        protein = Protein([A, B])
+        A = ProteinChain("A", Backbone(randn(3, 4, 5)))
+        B = ProteinChain("B", Backbone(randn(3, 4, 6)))
+        protein = [A, B]
         @test protein[1] == protein["A"] == A
         @test protein[2] == protein["B"] == B
-        @test length(protein) == 2
         @test length.(protein) == [5, 6]
-        @test summary(protein) == "Protein with 2 chains"
         @test !has_assigned_ss(protein)
     end