Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Commit

Permalink
merged sequence-annotation feature into master
Browse files Browse the repository at this point in the history
That's 3 +1s. Thanks all!
  • Loading branch information
reece committed Apr 7, 2016
2 parents ffc7fb3 + f4edaa6 commit a12e3fd
Show file tree
Hide file tree
Showing 9 changed files with 1,157 additions and 3 deletions.
8 changes: 8 additions & 0 deletions doc/source/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,14 @@ system for reads and variants.
.. toctree::
references

Sequence Annotations
@@@@@@@@@@@@@@@@@@@@

Sequence annotations describe genomic features such as genes and exons,
using terms from an established sequence ontology.

.. toctree::
sequence_annotations

Metadata
@@@@@@@@
Expand Down
2 changes: 1 addition & 1 deletion doc/source/api/references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
References API
!!!!!!!!!!!!!!

See `References schema <../schemas/refernces.html>`_ for a detailed reference.
See `References schema <../schemas/references.html>`_ for a detailed reference.


References Data Model
Expand Down
44 changes: 44 additions & 0 deletions doc/source/api/sequence_annotations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
.. _sequence_annotations:

************************
Sequence Annotations API
************************
For the Sequence Annotation schema definitions, see `Sequence Annotation schema <../schemas/sequenceAnnotations.html>`_


------------------------
Feature Based Hierarchy
------------------------
The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature can have a parent Feature, and can have an ordered array of child Features, which enables the construction of more complex representations in a hierarchical way.

For example, a single gene Feature may be parent to several different transcript Features. The specific exons for each transcript would have that transcript Feature as parent. The same physical exon may occur as part of two different transcript Features, but in our notation, it would be
encoded as two separate exon Features, each with a different parent, both occupying the same genomic coordinates. This structure can also exend to annotating CDS, binding sites or any other sub-gene level features.


------------------------------
The Sequence Annotation Schema
------------------------------

This model is similar to that used by the standard `GFF3`_ file format.

.. _GFF3: http://sequenceontology.org/resources/gff3.html

The main differences concern the deprecation and replacement of discontinuous features, the replacing
of multi-parent features with multiple copies of that feature, and the ability to impose an explicit order on child features.

In the first case, a CDS composed of multiple regions is sometimes encoded as multiple rows of a GFF3 file, each with the same feature ID. This is translated in our hierarchy into a single CDS Feature with an ordered set of CDS_region Feature children, each corresponding to a single row of the original record.

In the second case, as explained above, features with multiple parents in a GFF3 record are simply replicated and assigned a new identifier as many times as needed to ensure a unique parent for every feature.

In the final case, an explicit mechanism is provided for ordering child Features. Most of the time this ordering is trivially derived from the genomic coordinate ordering of the children, but in some biologically important cases this order can differ, such as in non-canonical splicing of exomes into transcripts (also known as back splicing - see below).

A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose to gather all Features from the same GFF3 file into a common FeatureSet.


--------------------------------------
Annotation Design - RNA Considerations
--------------------------------------

Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic sequences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, but a dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated by the existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in this case) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotation API needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or even the whole gene level if assignment is not more specific than that.

Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will result in a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. The exon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more general case, the API should allow child features to have an ordered relationship.
4 changes: 3 additions & 1 deletion doc/source/schemas/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,6 @@ Schemas
variantmethods
variants
alleleAnnotationmethods
alleleAnnotations
alleleAnnotations
sequenceAnnotations
sequenceAnnotationmethods
Loading

0 comments on commit a12e3fd

Please sign in to comment.