From 0e73d2e20f688b899eae12661a2c6295887eec9b Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 22 Jan 2016 15:08:29 -0800 Subject: [PATCH 01/13] Re-added sequenceAnnotations, sequenceAnnotationMethods and OntologyTerm back into metadata. --- doc/source/api/index.rst | 8 + doc/source/api/references.rst | 2 +- doc/source/api/sequence_annotations.rst | 36 ++ doc/source/schemas/metadata.rst | 20 + doc/source/schemas/metadatamethods.rst | 20 + doc/source/schemas/readmethods.rst | 20 + doc/source/schemas/reads.rst | 20 + .../schemas/sequenceAnnotationmethods.rst | 364 ++++++++++++++++++ doc/source/schemas/sequenceAnnotations.rst | 296 ++++++++++++++ src/main/resources/avro/metadata.avdl | 26 ++ .../avro/sequenceAnnotationmethods.avdl | 115 ++++++ .../resources/avro/sequenceAnnotations.avdl | 124 ++++++ 12 files changed, 1050 insertions(+), 1 deletion(-) create mode 100644 doc/source/api/sequence_annotations.rst create mode 100644 doc/source/schemas/sequenceAnnotationmethods.rst create mode 100644 doc/source/schemas/sequenceAnnotations.rst create mode 100644 src/main/resources/avro/sequenceAnnotationmethods.avdl create mode 100644 src/main/resources/avro/sequenceAnnotations.avdl diff --git a/doc/source/api/index.rst b/doc/source/api/index.rst index d1b5d6b3..444c5c78 100644 --- a/doc/source/api/index.rst +++ b/doc/source/api/index.rst @@ -43,6 +43,14 @@ system for reads and variants. .. toctree:: references +Sequence Annotations +@@@@@@@@@@@@@@@@@@@@ + +Sequence annotations describe genomic features such as genes and exons, +using terms from an established sequence ontology. + +.. toctree:: + sequence_annotations Metadata @@@@@@@@ diff --git a/doc/source/api/references.rst b/doc/source/api/references.rst index a646b3e8..3fb82053 100644 --- a/doc/source/api/references.rst +++ b/doc/source/api/references.rst @@ -4,7 +4,7 @@ References API !!!!!!!!!!!!!! -See `References schema <../schemas/refernces.html>`_ for a detailed reference. +See `References schema <../schemas/references.html>`_ for a detailed reference. References Data Model diff --git a/doc/source/api/sequence_annotations.rst b/doc/source/api/sequence_annotations.rst new file mode 100644 index 00000000..327b170a --- /dev/null +++ b/doc/source/api/sequence_annotations.rst @@ -0,0 +1,36 @@ +.. _sequence_annotations: + +************************ +Sequence Annotations API +************************ +For the Sequence Annotation schema definitions, see `Sequence Annotation schema <../schemas/sequenceAnnotations.html>`_ + + +------------------------ +Feature Based Hierarchy +------------------------ +The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature has one or more parent Features which enables the construction of more complex representations in a hierarchical way. + +For example, a top level Feature may be a single Gene. The different transcripts would have the gene Feature as parent. Similarly, the specific exons for each transcript would have both gene and transcript as parent. This structure can also exend to annotating CDS, binding sites or any other sub-gene level features. + +This model is very similar to that used by `GFF3`_. + +.. _GFF3: http://sequenceontology.org/resources/gff3.html + +A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose to gather +all Features from the same GFF3 file into a common FeatureSet. + +--------------------------- +The Sequence Annotation Schema +--------------------------- +TODO: insert an example annotation translation from GFF3 to GA4GH + +--------------------------------------- +Annotation Design - RNA Considerations +--------------------------------------- + +Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic sequences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, but a dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated by the existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in this case) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotation API needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or even the whole gene level if assignment is not more specific than that. + +Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will result in a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. The exon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more general case, the API should allow child features to have an ordered relationship. + +The annotation API needs to also be flexible enough to handle multiple references in the same gene or transcript. This is needed to cover the cases of fusion genes or inter-chromosomal translocations. diff --git a/doc/source/schemas/metadata.rst b/doc/source/schemas/metadata.rst index 9864f3df..ae5b3d5f 100644 --- a/doc/source/schemas/metadata.rst +++ b/doc/source/schemas/metadata.rst @@ -104,6 +104,26 @@ This protocol defines metadata used in the other GA4GH protocols. A structure for an instance of a CIGAR operation. `FIXME: This belongs under Reads (only readAlignment refers to this)` +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + .. avro:record:: Experiment :field id: diff --git a/doc/source/schemas/metadatamethods.rst b/doc/source/schemas/metadatamethods.rst index 703c69cb..706ef8eb 100644 --- a/doc/source/schemas/metadatamethods.rst +++ b/doc/source/schemas/metadatamethods.rst @@ -125,6 +125,26 @@ Gets a `Dataset` by ID. A structure for an instance of a CIGAR operation. `FIXME: This belongs under Reads (only readAlignment refers to this)` +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + .. avro:record:: Experiment :field id: diff --git a/doc/source/schemas/readmethods.rst b/doc/source/schemas/readmethods.rst index 13714aa2..8a5927e8 100644 --- a/doc/source/schemas/readmethods.rst +++ b/doc/source/schemas/readmethods.rst @@ -163,6 +163,26 @@ Gets a `org.ga4gh.models.ReadGroup` by ID. A general exception type. +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + .. avro:record:: Experiment :field id: diff --git a/doc/source/schemas/reads.rst b/doc/source/schemas/reads.rst index 9201f867..8cfa7782 100644 --- a/doc/source/schemas/reads.rst +++ b/doc/source/schemas/reads.rst @@ -106,6 +106,26 @@ See {TODO: LINK TO READS OVERVIEW} for more information. A structure for an instance of a CIGAR operation. `FIXME: This belongs under Reads (only readAlignment refers to this)` +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + .. avro:record:: Experiment :field id: diff --git a/doc/source/schemas/sequenceAnnotationmethods.rst b/doc/source/schemas/sequenceAnnotationmethods.rst new file mode 100644 index 00000000..780a1cf6 --- /dev/null +++ b/doc/source/schemas/sequenceAnnotationmethods.rst @@ -0,0 +1,364 @@ +SequenceAnnotationMethods +************************* + + .. function:: getFeature(id) + + :param id: string: The ID of the `Feature`. + :return type: org.ga4gh.models.Feature + :throws: GAException + +Gets a `org.ga4gh.models.Feature` by ID. + `GET /features/{id}` will return a JSON version of `Feature`. + + .. function:: searchFeatures(request) + + :param request: SearchFeaturesRequest: This request maps to the body of `POST /features/search` as JSON. + :return type: SearchFeaturesResponse + :throws: GAException + +Gets a list of `Feature` matching the search criteria. + + `POST /features/search` must accept a JSON version of + `SearchFeaturesRequest` as the post body and will return a JSON version of + `SearchFeaturesResponse`. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:error:: GAException + + A general exception type. + +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + +.. avro:record:: Experiment + + :field id: + The experiment UUID. This is globally unique. + :type id: string + :field name: + The name of the experiment. + :type name: null|string + :field description: + A description of the experiment. + :type description: null|string + :field recordCreateTime: + The time at which this record was created. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordCreateTime: string + :field recordUpdateTime: + The time at which this record was last updated. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordUpdateTime: string + :field runTime: + The time at which this experiment was performed. + Granularity here is variable (e.g. date only). + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) + :type runTime: null|string + :field molecule: + The molecule examined in this experiment. (e.g. genomics DNA, total RNA) + :type molecule: null|string + :field strategy: + The experiment technique or strategy applied to the sample. + (e.g. whole genome sequencing, RNA-seq, RIP-seq) + :type strategy: null|string + :field selection: + The method used to enrich the target. (e.g. immunoprecipitation, size + fractionation, MNase digestion) + :type selection: null|string + :field library: + The name of the library used as part of this experiment. + :type library: null|string + :field libraryLayout: + The configuration of sequenced reads. (e.g. Single or Paired) + :type libraryLayout: null|string + :field instrumentModel: + The instrument model used as part of this experiment. + This maps to sequencing technology in BAM. + :type instrumentModel: null|string + :field instrumentDataFile: + The data file generated by the instrument. + TODO: This isn't actually a file is it? + Should this be `instrumentData` instead? + :type instrumentDataFile: null|string + :field sequencingCenter: + The sequencing center used as part of this experiment. + :type sequencingCenter: null|string + :field platformUnit: + The platform unit used as part of this experiment. This is a flowcell-barcode + or slide unique identifier. + :type platformUnit: null|string + :field info: + A map of additional experiment information. + :type info: map> + + An experimental preparation of a sample. + +.. avro:record:: Dataset + + :field id: + The dataset's id, locally unique to the server instance. + :type id: string + :field name: + The name of the dataset. + :type name: null|string + :field description: + Additional, human-readable information on the dataset. + :type description: null|string + + A Dataset is a collection of related data of multiple types. + Data providers decide how to group data into datasets. + See [Metadata API](../api/metadata.html) for a more detailed discussion. + +.. avro:record:: Attributes + + :field vals: + :type vals: map> + + Type defining a collection of attributes associated with various protocol + records. Each attribute is a name that maps to an array of one or more + values. Values can be strings, external identifiers, or ontology terms. + Values should be split into the array elements instead of using a separator + syntax that needs to parsed. + +.. avro:record:: Feature + + :field id: + Id of this annotation node. + :type id: string + :field parentIds: + Ids of the parents of this annotation node. + :type parentIds: array + :field featureSetId: + Identifier for the containing feature set. + :type featureSetId: string + :field referenceName: + The reference on which this feature occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this feature occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Features spanning the join of circular genomes are represented as + two features one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field featureType: + Feature that is annotated by this region. Normally, this will be a term in + the Sequence Ontology. + :type featureType: OntologyTerm + :field attributes: + Name/value attributes of the annotation. Attribute names follow the GFF3 + naming convention of reserved names starting with an upper cases + character, and user-define names start with lower-case. Most GFF3 + pre-defined attributes apply, the exceptions are ID and Parent, which are + defined as fields. Additional, the following attributes are added: + * Score - the GFF3 score column + * Phase - the GFF3 phase column for CDS features. + :type attributes: Attributes + + Node in the annotation graph that annotates a contiguous region of a + sequence. + +.. avro:record:: FeatureSet + + :field id: + The ID of this annotation set. + :type id: string + :field datasetId: + The ID of the dataset this annotation set belongs to. + :type datasetId: null|string + :field referenceSetId: + The ID of the reference set which defines the coordinate-space for this + set of annotations. + :type referenceSetId: null|string + :field name: + The display name for this annotation set. + :type name: null|string + :field sourceURI: + The source URI describing the file from which this annotation set was + generated, if any. + :type sourceURI: null|string + :field attributes: + Set of additional attributes + :type attributes: Attributes + +.. avro:record:: SearchFeaturesRequest + + :field featureSetId: + The annotation set to search within. Either `featureSetId` or + `parentId` must be non-empty. + :type featureSetId: null|string + :field parentId: + Restricts the search to direct children of the given parent `feature` + ID. Either `featureSetId` or `parentId` must be non-empty. + :type parentId: null|string + :field referenceName: + Only return features with on the reference with this name. One of this + field or `referenceId` is required. (case-sensitive, exact match) + :type referenceName: null|string + :field referenceId: + Only return feature on the reference with this ID. One of this field or + `referenceName` is required. + :type referenceId: null|string + :field start: + Required. The beginning of the window (0-based, inclusive) for which + overlapping features should be returned. Genomic positions are + non-negative integers less than reference length. Requests spanning the + join of circular genomes are represented as two requests one on each side + of the join (position 0). + :type start: long + :field end: + Required. The end of the window (0-based, exclusive) for which overlapping + features should be returned. + :type end: long + :field features: + If specified, this query matches only annotations which match one of the + provided feature types. + :type features: array + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /features/search` as JSON. + +.. avro:record:: SearchFeaturesResponse + + :field features: + The list of matching annotations, sorted by start position. Annotations which + share a start position are returned in a deterministic order. + :type features: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /features/search` expressed as JSON. + diff --git a/doc/source/schemas/sequenceAnnotations.rst b/doc/source/schemas/sequenceAnnotations.rst new file mode 100644 index 00000000..a382cf4a --- /dev/null +++ b/doc/source/schemas/sequenceAnnotations.rst @@ -0,0 +1,296 @@ +SequenceAnnotations +******************* + +This protocol defines annotations on GA4GH genomic sequences It includes two +types of annotations: continuous and discrete hierarchical. + +The discrete hierarchical annotations are derived from the Sequence Ontology +(SO) and GFF3 work + + http://www.sequenceontology.org/gff3.shtml + +The goal is to be able to store annotations using the GFF3 and SO conceptual +model, although there is not necessarly a one-to-one mapping in Avro records +to GFF3 records. + +The minimum requirement is to be able to accurately represent the current +state of the art annotation data and the full SO model. Feature is the +core generic record which corresponds to the a GFF3 record. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: OntologyTerm + + :field ontologySourceName: + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type ontologySourceName: null|string + :field ontologySourceID: + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + :type ontologySourceID: null|string + :field ontologySourceVersion: + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type ontologySourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + +.. avro:record:: Experiment + + :field id: + The experiment UUID. This is globally unique. + :type id: string + :field name: + The name of the experiment. + :type name: null|string + :field description: + A description of the experiment. + :type description: null|string + :field recordCreateTime: + The time at which this record was created. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordCreateTime: string + :field recordUpdateTime: + The time at which this record was last updated. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordUpdateTime: string + :field runTime: + The time at which this experiment was performed. + Granularity here is variable (e.g. date only). + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) + :type runTime: null|string + :field molecule: + The molecule examined in this experiment. (e.g. genomics DNA, total RNA) + :type molecule: null|string + :field strategy: + The experiment technique or strategy applied to the sample. + (e.g. whole genome sequencing, RNA-seq, RIP-seq) + :type strategy: null|string + :field selection: + The method used to enrich the target. (e.g. immunoprecipitation, size + fractionation, MNase digestion) + :type selection: null|string + :field library: + The name of the library used as part of this experiment. + :type library: null|string + :field libraryLayout: + The configuration of sequenced reads. (e.g. Single or Paired) + :type libraryLayout: null|string + :field instrumentModel: + The instrument model used as part of this experiment. + This maps to sequencing technology in BAM. + :type instrumentModel: null|string + :field instrumentDataFile: + The data file generated by the instrument. + TODO: This isn't actually a file is it? + Should this be `instrumentData` instead? + :type instrumentDataFile: null|string + :field sequencingCenter: + The sequencing center used as part of this experiment. + :type sequencingCenter: null|string + :field platformUnit: + The platform unit used as part of this experiment. This is a flowcell-barcode + or slide unique identifier. + :type platformUnit: null|string + :field info: + A map of additional experiment information. + :type info: map> + + An experimental preparation of a sample. + +.. avro:record:: Dataset + + :field id: + The dataset's id, locally unique to the server instance. + :type id: string + :field name: + The name of the dataset. + :type name: null|string + :field description: + Additional, human-readable information on the dataset. + :type description: null|string + + A Dataset is a collection of related data of multiple types. + Data providers decide how to group data into datasets. + See [Metadata API](../api/metadata.html) for a more detailed discussion. + +.. avro:record:: Attributes + + :field vals: + :type vals: map> + + Type defining a collection of attributes associated with various protocol + records. Each attribute is a name that maps to an array of one or more + values. Values can be strings, external identifiers, or ontology terms. + Values should be split into the array elements instead of using a separator + syntax that needs to parsed. + +.. avro:record:: Feature + + :field id: + Id of this annotation node. + :type id: string + :field parentIds: + Ids of the parents of this annotation node. + :type parentIds: array + :field featureSetId: + Identifier for the containing feature set. + :type featureSetId: string + :field referenceName: + The reference on which this feature occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this feature occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Features spanning the join of circular genomes are represented as + two features one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field featureType: + Feature that is annotated by this region. Normally, this will be a term in + the Sequence Ontology. + :type featureType: OntologyTerm + :field attributes: + Name/value attributes of the annotation. Attribute names follow the GFF3 + naming convention of reserved names starting with an upper cases + character, and user-define names start with lower-case. Most GFF3 + pre-defined attributes apply, the exceptions are ID and Parent, which are + defined as fields. Additional, the following attributes are added: + * Score - the GFF3 score column + * Phase - the GFF3 phase column for CDS features. + :type attributes: Attributes + + Node in the annotation graph that annotates a contiguous region of a + sequence. + +.. avro:record:: FeatureSet + + :field id: + The ID of this annotation set. + :type id: string + :field datasetId: + The ID of the dataset this annotation set belongs to. + :type datasetId: null|string + :field referenceSetId: + The ID of the reference set which defines the coordinate-space for this + set of annotations. + :type referenceSetId: null|string + :field name: + The display name for this annotation set. + :type name: null|string + :field sourceURI: + The source URI describing the file from which this annotation set was + generated, if any. + :type sourceURI: null|string + :field attributes: + Set of additional attributes + :type attributes: Attributes + diff --git a/src/main/resources/avro/metadata.avdl b/src/main/resources/avro/metadata.avdl index 60db1ba5..b9f01533 100644 --- a/src/main/resources/avro/metadata.avdl +++ b/src/main/resources/avro/metadata.avdl @@ -8,6 +8,32 @@ protocol Metadata { import idl "common.avdl"; +/** + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + */ + record OntologyTerm { + /** + ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + */ + union { null, string } ontologySourceName = null; + + /** + ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + */ + union { null, string } ontologySourceID = null; + + /** + ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + */ + union { null, string } ontologySourceVersion = null; + } + /** An experimental preparation of a sample. */ diff --git a/src/main/resources/avro/sequenceAnnotationmethods.avdl b/src/main/resources/avro/sequenceAnnotationmethods.avdl new file mode 100644 index 00000000..f96911cc --- /dev/null +++ b/src/main/resources/avro/sequenceAnnotationmethods.avdl @@ -0,0 +1,115 @@ +@namespace("org.ga4gh.methods") + +protocol SequenceAnnotationMethods { + + import idl "common.avdl"; + import idl "methods.avdl"; + import idl "sequenceAnnotations.avdl"; + + /** + This request maps to the body of `POST /features/search` as JSON. + */ + record SearchFeaturesRequest { + /** + The annotation set to search within. Either `featureSetId` or + `parentId` must be non-empty. + */ + union { null, string } featureSetId; + + /** + Restricts the search to direct children of the given parent `feature` + ID. Either `featureSetId` or `parentId` must be non-empty. + */ + union { null, string } parentId; + + /** + Only return features with on the reference with this name. One of this + field or `referenceId` is required. (case-sensitive, exact match) + */ + union { null, string } referenceName = null; + + /** + Only return feature on the reference with this ID. One of this field or + `referenceName` is required. + */ + union { null, string } referenceId = null; + + /** + Required. The beginning of the window (0-based, inclusive) for which + overlapping features should be returned. Genomic positions are + non-negative integers less than reference length. Requests spanning the + join of circular genomes are represented as two requests one on each side + of the join (position 0). + */ + long start; + + /** + Required. The end of the window (0-based, exclusive) for which overlapping + features should be returned. + */ + long end; + + // TODO: Fix this field. Clarify semantics around how OntologyTerm + // matching works, or change the Ontology term field on the feature. + /** + If specified, this query matches only annotations which match one of the + provided feature types. + */ + array features = []; + + /** + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + */ + union { null, int } pageSize = null; + + /** + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + */ + union { null, string } pageToken = null; + } + + /** This is the response from `POST /features/search` expressed as JSON. */ + record SearchFeaturesResponse { + /** + The list of matching annotations, sorted by start position. Annotations which + share a start position are returned in a deterministic order. + */ + array features = []; + + /** + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + */ + union { null, string } nextPageToken = null; + } + + /** + Gets a list of `Feature` matching the search criteria. + + `POST /features/search` must accept a JSON version of + `SearchFeaturesRequest` as the post body and will return a JSON version of + `SearchFeaturesResponse`. + */ + SearchFeaturesResponse searchFeatures( + /** This request maps to the body of `POST /features/search` as JSON. */ + SearchFeaturesRequest request) throws GAException; + + + /**************** /features/{id} *******************/ + /** + Gets a `org.ga4gh.models.Feature` by ID. + `GET /features/{id}` will return a JSON version of `Feature`. + */ + org.ga4gh.models.Feature getFeature( + /** + The ID of the `Feature`. + */ + string id) throws GAException; + +} + +// TODO: AnnotationSet methods. diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl new file mode 100644 index 00000000..d5cb2cf0 --- /dev/null +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -0,0 +1,124 @@ +@namespace("org.ga4gh.models") +/** +This protocol defines annotations on GA4GH genomic sequences It includes two +types of annotations: continuous and discrete hierarchical. + +The discrete hierarchical annotations are derived from the Sequence Ontology +(SO) and GFF3 work + + http://www.sequenceontology.org/gff3.shtml + +The goal is to be able to store annotations using the GFF3 and SO conceptual +model, although there is not necessarly a one-to-one mapping in Avro records +to GFF3 records. + +The minimum requirement is to be able to accurately represent the current +state of the art annotation data and the full SO model. Feature is the +core generic record which corresponds to the a GFF3 record. +*/ +protocol SequenceAnnotations { + + import idl "common.avdl"; + import idl "metadata.avdl"; + + /** + Type defining a collection of attributes associated with various protocol + records. Each attribute is a name that maps to an array of one or more + values. Values can be strings, external identifiers, or ontology terms. + Values should be split into the array elements instead of using a separator + syntax that needs to parsed. + */ + // TODO: how are multiple instances of a given attribute vs multiple values + // for an attribute distinguished + record Attributes { + map> vals = {}; + } + + /** + Node in the annotation graph that annotates a contiguous region of a + sequence. + */ + record Feature { + /** + Id of this annotation node. + */ + string id; + + /** + Ids of the parents of this annotation node. + */ + array parentIds; + + /** + Identifier for the containing feature set. + */ + string featureSetId; + + /** + The reference on which this feature occurs. + (e.g. `chr20` or `X`) + */ + string referenceName; + + /** + The start position at which this feature occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Features spanning the join of circular genomes are represented as + two features one on each side of the join (position 0). + */ + long start = 0; + + /** + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + */ + long end; + + /** + Feature that is annotated by this region. Normally, this will be a term in + the Sequence Ontology. + */ + OntologyTerm featureType; + + /** + Name/value attributes of the annotation. Attribute names follow the GFF3 + naming convention of reserved names starting with an upper cases + character, and user-define names start with lower-case. Most GFF3 + pre-defined attributes apply, the exceptions are ID and Parent, which are + defined as fields. Additional, the following attributes are added: + * Score - the GFF3 score column + * Phase - the GFF3 phase column for CDS features. + */ + Attributes attributes; + } + + /* + A set of sequence features annotations + */ + record FeatureSet { + /** The ID of this annotation set. */ + string id; + + /** The ID of the dataset this annotation set belongs to. */ + union { null, string } datasetId = null; + + /** + The ID of the reference set which defines the coordinate-space for this + set of annotations. + */ + union { null, string } referenceSetId; + + /** The display name for this annotation set. */ + union { null, string } name = null; + + /** + The source URI describing the file from which this annotation set was + generated, if any. + */ + union { null, string } sourceURI = null; + + /** Set of additional attributes */ + Attributes attributes; + } +} From 6fe416673d7070e52c5d84dc597160fa243b2ba0 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Thu, 28 Jan 2016 08:57:40 -0800 Subject: [PATCH 02/13] Updated OntologyTerm with latest definition from metadata group. --- doc/source/schemas/index.rst | 4 ++ doc/source/schemas/metadata.rst | 36 +++++++----- doc/source/schemas/metadatamethods.rst | 36 +++++++----- doc/source/schemas/readmethods.rst | 36 +++++++----- doc/source/schemas/reads.rst | 36 +++++++----- .../schemas/sequenceAnnotationmethods.rst | 36 +++++++----- doc/source/schemas/sequenceAnnotations.rst | 36 +++++++----- src/main/resources/avro/metadata.avdl | 56 +++++++++++-------- 8 files changed, 162 insertions(+), 114 deletions(-) diff --git a/doc/source/schemas/index.rst b/doc/source/schemas/index.rst index b634cbdc..0e42a1b3 100644 --- a/doc/source/schemas/index.rst +++ b/doc/source/schemas/index.rst @@ -4,6 +4,7 @@ Schemas .. toctree:: common metadata + metadatamethods methods readmethods reads @@ -11,3 +12,6 @@ Schemas references variantmethods variants + sequenceAnnotations + sequenceAnnotationmethods + diff --git a/doc/source/schemas/metadata.rst b/doc/source/schemas/metadata.rst index ae5b3d5f..413c5f74 100644 --- a/doc/source/schemas/metadata.rst +++ b/doc/source/schemas/metadata.rst @@ -106,23 +106,29 @@ This protocol defines metadata used in the other GA4GH protocols. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/doc/source/schemas/metadatamethods.rst b/doc/source/schemas/metadatamethods.rst index 706ef8eb..536a9895 100644 --- a/doc/source/schemas/metadatamethods.rst +++ b/doc/source/schemas/metadatamethods.rst @@ -127,23 +127,29 @@ Gets a `Dataset` by ID. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/doc/source/schemas/readmethods.rst b/doc/source/schemas/readmethods.rst index 8a5927e8..1a8687ee 100644 --- a/doc/source/schemas/readmethods.rst +++ b/doc/source/schemas/readmethods.rst @@ -165,23 +165,29 @@ Gets a `org.ga4gh.models.ReadGroup` by ID. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/doc/source/schemas/reads.rst b/doc/source/schemas/reads.rst index 8cfa7782..4d674e5b 100644 --- a/doc/source/schemas/reads.rst +++ b/doc/source/schemas/reads.rst @@ -108,23 +108,29 @@ See {TODO: LINK TO READS OVERVIEW} for more information. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/doc/source/schemas/sequenceAnnotationmethods.rst b/doc/source/schemas/sequenceAnnotationmethods.rst index 780a1cf6..ab0d18e1 100644 --- a/doc/source/schemas/sequenceAnnotationmethods.rst +++ b/doc/source/schemas/sequenceAnnotationmethods.rst @@ -129,23 +129,29 @@ Gets a list of `Feature` matching the search criteria. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/doc/source/schemas/sequenceAnnotations.rst b/doc/source/schemas/sequenceAnnotations.rst index a382cf4a..5813ba5a 100644 --- a/doc/source/schemas/sequenceAnnotations.rst +++ b/doc/source/schemas/sequenceAnnotations.rst @@ -120,23 +120,29 @@ core generic record which corresponds to the a GFF3 record. .. avro:record:: OntologyTerm - :field ontologySourceName: - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type ontologySourceName: null|string - :field ontologySourceID: - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - :type ontologySourceID: null|string - :field ontologySourceVersion: - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type ontologySourceVersion: null|string + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) + 'polydactyly' from HPO) .. avro:record:: Experiment diff --git a/src/main/resources/avro/metadata.avdl b/src/main/resources/avro/metadata.avdl index b9f01533..f106f700 100644 --- a/src/main/resources/avro/metadata.avdl +++ b/src/main/resources/avro/metadata.avdl @@ -9,30 +9,38 @@ protocol Metadata { import idl "common.avdl"; /** - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - */ - record OntologyTerm { - /** - ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - */ - union { null, string } ontologySourceName = null; - - /** - ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - */ - union { null, string } ontologySourceID = null; - - /** - ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - */ - union { null, string } ontologySourceVersion = null; - } +An ontology term describing an attribute. (e.g. the phenotype attribute +'polydactyly' from HPO) +*/ +record OntologyTerm { + /** + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + */ + string id; + + /** + Ontology term - the representation the id is pointing to. + */ + union { null, string } term = null; + + /** + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + */ + union { null, string } sourceName = null; + + /** + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + */ + union { null, string } sourceVersion = null; +} /** An experimental preparation of a sample. From 5305261b9980b346dafc3008512db9eeeb6f2a06 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 29 Jan 2016 10:26:24 -0800 Subject: [PATCH 03/13] changed schemas per discussion in PR. --- .gitignore | 5 +- doc/source/schemas/common.rst | 107 ---- doc/source/schemas/metadata.rst | 211 ------- doc/source/schemas/metadatamethods.rst | 263 -------- doc/source/schemas/methods.rst | 7 - doc/source/schemas/readmethods.rst | 580 ------------------ doc/source/schemas/reads.rst | 433 ------------- doc/source/schemas/referencemethods.rst | 379 ------------ doc/source/schemas/references.rst | 199 ------ .../schemas/sequenceAnnotationmethods.rst | 370 ----------- doc/source/schemas/sequenceAnnotations.rst | 302 --------- doc/source/schemas/variantmethods.rst | 475 -------------- doc/source/schemas/variants.rst | 297 --------- .../avro/sequenceAnnotationmethods.avdl | 78 ++- .../resources/avro/sequenceAnnotations.avdl | 5 + 15 files changed, 73 insertions(+), 3638 deletions(-) delete mode 100644 doc/source/schemas/common.rst delete mode 100644 doc/source/schemas/metadata.rst delete mode 100644 doc/source/schemas/metadatamethods.rst delete mode 100644 doc/source/schemas/methods.rst delete mode 100644 doc/source/schemas/readmethods.rst delete mode 100644 doc/source/schemas/reads.rst delete mode 100644 doc/source/schemas/referencemethods.rst delete mode 100644 doc/source/schemas/references.rst delete mode 100644 doc/source/schemas/sequenceAnnotationmethods.rst delete mode 100644 doc/source/schemas/sequenceAnnotations.rst delete mode 100644 doc/source/schemas/variantmethods.rst delete mode 100644 doc/source/schemas/variants.rst diff --git a/.gitignore b/.gitignore index fded6d94..0a94a474 100644 --- a/.gitignore +++ b/.gitignore @@ -1,8 +1,11 @@ *.py[cod] target *~ -#* + doc/source/schemas/*.avpr +doc/source/schemas/*.rst +!doc/source/schemas/index.rst + build #********** windows template********** diff --git a/doc/source/schemas/common.rst b/doc/source/schemas/common.rst deleted file mode 100644 index 99605783..00000000 --- a/doc/source/schemas/common.rst +++ /dev/null @@ -1,107 +0,0 @@ -Common -****** - -This file defines common types used in other parts of the schema. -There are no directly associated methods. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - diff --git a/doc/source/schemas/metadata.rst b/doc/source/schemas/metadata.rst deleted file mode 100644 index 413c5f74..00000000 --- a/doc/source/schemas/metadata.rst +++ /dev/null @@ -1,211 +0,0 @@ -Metadata -******** - -This protocol defines metadata used in the other GA4GH protocols. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - diff --git a/doc/source/schemas/metadatamethods.rst b/doc/source/schemas/metadatamethods.rst deleted file mode 100644 index 536a9895..00000000 --- a/doc/source/schemas/metadatamethods.rst +++ /dev/null @@ -1,263 +0,0 @@ -MetadataMethods -*************** - - .. function:: searchDatasets(request) - - :param request: SearchDatasetsRequest: This request maps to the body of `POST /datasets/search` as JSON. - :return type: SearchDatasetsResponse - :throws: GAException - -Gets a list of datasets accessible through the API. - -TODO: Reads and variants both want to have datasets. Are they the same object? - -`POST /datasets/search` must accept a JSON version of -`SearchDatasetsRequest` as the post body and will return a JSON version -of `SearchDatasetsResponse`. - - .. function:: getDataset(id) - - :param id: string: The ID of the `Dataset`. - :return type: org.ga4gh.models.Dataset - :throws: GAException - -Gets a `Dataset` by ID. -`GET /datasets/{id}` will return a JSON version of `Dataset`. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - -.. avro:error:: GAException - - A general exception type. - -.. avro:record:: SearchDatasetsRequest - - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /datasets/search` as JSON. - -.. avro:record:: SearchDatasetsResponse - - :field datasets: - The list of datasets. - :type datasets: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /datasets/search` expressed as JSON. - diff --git a/doc/source/schemas/methods.rst b/doc/source/schemas/methods.rst deleted file mode 100644 index e02a4cb3..00000000 --- a/doc/source/schemas/methods.rst +++ /dev/null @@ -1,7 +0,0 @@ -RPC -*** - -.. avro:error:: GAException - - A general exception type. - diff --git a/doc/source/schemas/readmethods.rst b/doc/source/schemas/readmethods.rst deleted file mode 100644 index 1a8687ee..00000000 --- a/doc/source/schemas/readmethods.rst +++ /dev/null @@ -1,580 +0,0 @@ -ReadMethods -*********** - - .. function:: searchReads(request) - - :param request: SearchReadsRequest: This request maps to the body of `POST /reads/search` as JSON. - :return type: SearchReadsResponse - :throws: GAException - -Gets a list of `ReadAlignment`s for one or more `ReadGroup`s. - -`searchReads` operates over a genomic coordinate space of reference sequence -and position defined by the `Reference`s to which the requested `ReadGroup`s are -aligned. - -If a target positional range is specified, search returns all reads whose -alignment to the reference genome *overlap* the range. A query which specifies -only read group IDs yields all reads in those read groups, including unmapped -reads. - -All reads returned (including reads on subsequent pages) are ordered by genomic -coordinate (by reference sequence, then position). Reads with equivalent genomic -coordinates are returned in an unspecified order. This order must be consistent -for a given repository, such that two queries for the same content (regardless -of page size) yield reads in the same order across their respective streams of -paginated responses. - -`POST /reads/search` must accept a JSON version of `SearchReadsRequest` as -the post body and will return a JSON version of `SearchReadsResponse`. - - .. function:: searchReadGroupSets(request) - - :param request: SearchReadGroupSetsRequest: This request maps to the body of `POST /readgroupsets/search` as JSON. - :return type: SearchReadGroupSetsResponse - :throws: GAException - -Gets a list of `ReadGroupSet` matching the search criteria. - -`POST /readgroupsets/search` must accept a JSON version of -`SearchReadGroupSetsRequest` as the post body and will return a JSON -version of `SearchReadGroupSetsResponse`. - - .. function:: getReadGroupSet(id) - - :param id: string: The ID of the `ReadGroupSet`. - :return type: org.ga4gh.models.ReadGroupSet - :throws: GAException - -Gets a `org.ga4gh.models.ReadGroupSet` by ID. -`GET /readgroupsets/{id}` will return a JSON version of `ReadGroupSet`. - - .. function:: getReadGroup(id) - - :param id: string: The ID of the `ReadGroup`. - :return type: org.ga4gh.models.ReadGroup - :throws: GAException - -Gets a `org.ga4gh.models.ReadGroup` by ID. -`GET /readgroups/{id}` will return a JSON version of `ReadGroup`. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:error:: GAException - - A general exception type. - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - -.. avro:record:: Program - - :field commandLine: - The command line used to run this program. - :type commandLine: null|string - :field id: - The user specified ID of the program. - :type id: null|string - :field name: - The name of the program. - :type name: null|string - :field prevProgramId: - The ID of the program run before this one. - :type prevProgramId: null|string - :field version: - The version of the program run. - :type version: null|string - - Program can be used to track the provenance of how read data was generated. - -.. avro:record:: ReadStats - - :field alignedReadCount: - The number of aligned reads. - :type alignedReadCount: null|long - :field unalignedReadCount: - The number of unaligned reads. - :type unalignedReadCount: null|long - :field baseCount: - The total number of bases. - This is equivalent to the sum of `alignedSequence.length` for all reads. - :type baseCount: null|long - - ReadStats can be used to provide summary statistics about read data. - -.. avro:record:: ReadGroup - - :field id: - The read group ID. - :type id: string - :field datasetId: - The ID of the dataset this read group belongs to. - :type datasetId: null|string - :field name: - The read group name. - :type name: null|string - :field description: - The read group description. - :type description: null|string - :field sampleId: - The sample this read group's data was generated from. - Note: the current API does not have a rigorous definition of sample. Therefore, this - field actually contains an arbitrary string, typically corresponding to the SM tag in a - BAM file. - :type sampleId: null|string - :field experiment: - The experiment used to generate this read group. - :type experiment: null|Experiment - :field predictedInsertSize: - The predicted insert size of this read group. - :type predictedInsertSize: null|int - :field created: - The time at which this read group was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this read group was last updated in milliseconds - from the epoch. - :type updated: null|long - :field stats: - Statistical data on reads in this read group. - :type stats: null|ReadStats - :field programs: - The programs used to generate this read group. - :type programs: array - :field referenceSetId: - The ID of the reference set to which the reads in this read group are aligned. - Required if there are any read alignments. - :type referenceSetId: null|string - :field info: - A map of additional read group information. - :type info: map> - - A ReadGroup is a set of reads derived from one physical sequencing process. - -.. avro:record:: ReadGroupSet - - :field id: - The read group set ID. - :type id: string - :field datasetId: - The ID of the dataset this read group set belongs to. - :type datasetId: null|string - :field name: - The read group set name. - :type name: null|string - :field stats: - Statistical data on reads in this read group set. - :type stats: null|ReadStats - :field readGroups: - The read groups in this set. - :type readGroups: array - - A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet - represents all the reads from one experimental sample. - -.. avro:record:: LinearAlignment - - :field position: - The position of this alignment. - :type position: Position - :field mappingQuality: - The mapping quality of this alignment, meaning the likelihood that the read - maps to this position. - - Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the - nearest integer. - :type mappingQuality: null|int - :field cigar: - Represents the local alignment of this sequence (alignment matches, indels, etc) - versus the reference. - :type cigar: array - - A linear alignment describes the alignment of a read to a Reference, using a - position and CIGAR array. - -.. avro:record:: ReadAlignment - - :field id: - The read alignment ID. This ID is unique within the read group this - alignment belongs to. - - For performance reasons, this field may be omitted by a backend. - If provided, its intended use is to make caching and UI display easier for - genome browsers and other lightweight clients. - :type id: null|string - :field readGroupId: - The ID of the read group this read belongs to. - (Every read must belong to exactly one read group.) - :type readGroupId: string - :field fragmentName: - The fragment name. Equivalent to QNAME (query template name) in SAM. - :type fragmentName: string - :field properPlacement: - The orientation and the distance between reads from the fragment are - consistent with the sequencing protocol (equivalent to SAM flag 0x2) - :type properPlacement: null|boolean - :field duplicateFragment: - The fragment is a PCR or optical duplicate (SAM flag 0x400). - :type duplicateFragment: null|boolean - :field numberReads: - The number of reads in the fragment (extension to SAM flag 0x1) - :type numberReads: null|int - :field fragmentLength: - The observed length of the fragment, equivalent to TLEN in SAM. - :type fragmentLength: null|int - :field readNumber: - The read ordinal in the fragment, 0-based and less than numberReads. This - field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly - represent multiple reads per fragment. - :type readNumber: null|int - :field failedVendorQualityChecks: - The read fails platform or vendor quality checks (SAM flag 0x200). - :type failedVendorQualityChecks: null|boolean - :field alignment: - The alignment for this alignment record. This field will be null if the read - is unmapped. - :type alignment: null|LinearAlignment - :field secondaryAlignment: - Whether this alignment is secondary. Equivalent to SAM flag 0x100. - A secondary alignment represents an alternative to the primary alignment - for this read. Aligners may return secondary alignments if a read can map - ambiguously to multiple coordinates in the genome. - - By convention, each read has one and only one alignment where both - secondaryAlignment and supplementaryAlignment are false. - :type secondaryAlignment: null|boolean - :field supplementaryAlignment: - Whether this alignment is supplementary. Equivalent to SAM flag 0x800. - Supplementary alignments are used in the representation of a chimeric - alignment. In a chimeric alignment, a read is split into multiple - linear alignments that map to different reference contigs. The first - linear alignment in the read will be designated as the representative alignment; - the remaining linear alignments will be designated as supplementary alignments. - These alignments may have different mapping quality scores. - - In each linear alignment in a chimeric alignment, the read will be hard clipped. - The `alignedSequence` and `alignedQuality` fields in the alignment record will - only represent the bases for its respective linear alignment. - :type supplementaryAlignment: null|boolean - :field alignedSequence: - The bases of the read sequence contained in this alignment record (equivalent - to SEQ in SAM). - - `alignedSequence` and `alignedQuality` may be shorter than the full read sequence - and quality. This will occur if the alignment is part of a chimeric alignment, - or if the read was trimmed. When this occurs, the CIGAR for this read will - begin/end with a hard clip operator that will indicate the length of the - excised sequence. - :type alignedSequence: null|string - :field alignedQuality: - The quality of the read sequence contained in this alignment record - (equivalent to QUAL in SAM). - - `alignedSequence` and `alignedQuality` may be shorter than the full read sequence - and quality. This will occur if the alignment is part of a chimeric alignment, - or if the read was trimmed. When this occurs, the CIGAR for this read will - begin/end with a hard clip operator that will indicate the length of the excised sequence. - :type alignedQuality: array - :field nextMatePosition: - The mapping of the primary alignment of the `(readNumber+1)%numberReads` - read in the fragment. It replaces mate position and mate strand in SAM. - :type nextMatePosition: null|Position - :field info: - A map of additional read alignment information. - :type info: map> - - Each read alignment describes an alignment with additional information - about the fragment and the read. A read alignment object is equivalent to a - line in a SAM file. - -.. avro:record:: SearchReadsRequest - - :field readGroupIds: - The ReadGroups to search. At least one id must be specified. - :type readGroupIds: array - :field referenceId: - The reference to query. Leaving blank returns results from all - references, including unmapped reads - this could be very large. - :type referenceId: null|string - :field start: - The start position (0-based) of this query. - If a reference is specified, this defaults to 0. - Genomic positions are non-negative integers less than reference length. - Requests spanning the join of circular genomes are represented as - two requests one on each side of the join (position 0). - :type start: null|long - :field end: - The end position (0-based, exclusive) of this query. - If a reference is specified, this defaults to the - reference's length. - :type end: null|long - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /reads/search` as JSON. - - If a reference is specified, all queried `ReadGroup`s must be aligned - to `ReferenceSet`s containing that same `Reference`. If no reference is - specified, all queried `ReadGroup`s must be aligned to the same `ReferenceSet`. - -.. avro:record:: SearchReadsResponse - - :field alignments: - The list of matching alignment records, sorted by position. - Unmapped reads, which have no position, are returned last. - :type alignments: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /reads/search` expressed as JSON. - -.. avro:record:: SearchReadGroupSetsRequest - - :field datasetId: - The dataset to search. - :type datasetId: string - :field name: - Only return read group sets with this name (case-sensitive, exact match). - :type name: null|string - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /readgroupsets/search` as JSON. - - TODO: Factor this out to a common API patterns section. - - If searching by a resource ID, and that resource is not found, the method - will return a `404` HTTP status code (`NOT_FOUND`). - - If searching by other attributes, e.g. `name`, and no matches are found, the - method will return a `200` HTTP status code (`OK`) with an empty result list. - -.. avro:record:: SearchReadGroupSetsResponse - - :field readGroupSets: - The list of matching read group sets. - :type readGroupSets: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /readgroupsets/search` expressed as JSON. - diff --git a/doc/source/schemas/reads.rst b/doc/source/schemas/reads.rst deleted file mode 100644 index 4d674e5b..00000000 --- a/doc/source/schemas/reads.rst +++ /dev/null @@ -1,433 +0,0 @@ -Reads -***** - -This file defines the objects used to represent a reads and alignments, most importantly -ReadGroupSet, ReadGroup, and ReadAlignment. -See {TODO: LINK TO READS OVERVIEW} for more information. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - -.. avro:record:: Program - - :field commandLine: - The command line used to run this program. - :type commandLine: null|string - :field id: - The user specified ID of the program. - :type id: null|string - :field name: - The name of the program. - :type name: null|string - :field prevProgramId: - The ID of the program run before this one. - :type prevProgramId: null|string - :field version: - The version of the program run. - :type version: null|string - - Program can be used to track the provenance of how read data was generated. - -.. avro:record:: ReadStats - - :field alignedReadCount: - The number of aligned reads. - :type alignedReadCount: null|long - :field unalignedReadCount: - The number of unaligned reads. - :type unalignedReadCount: null|long - :field baseCount: - The total number of bases. - This is equivalent to the sum of `alignedSequence.length` for all reads. - :type baseCount: null|long - - ReadStats can be used to provide summary statistics about read data. - -.. avro:record:: ReadGroup - - :field id: - The read group ID. - :type id: string - :field datasetId: - The ID of the dataset this read group belongs to. - :type datasetId: null|string - :field name: - The read group name. - :type name: null|string - :field description: - The read group description. - :type description: null|string - :field sampleId: - The sample this read group's data was generated from. - Note: the current API does not have a rigorous definition of sample. Therefore, this - field actually contains an arbitrary string, typically corresponding to the SM tag in a - BAM file. - :type sampleId: null|string - :field experiment: - The experiment used to generate this read group. - :type experiment: null|Experiment - :field predictedInsertSize: - The predicted insert size of this read group. - :type predictedInsertSize: null|int - :field created: - The time at which this read group was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this read group was last updated in milliseconds - from the epoch. - :type updated: null|long - :field stats: - Statistical data on reads in this read group. - :type stats: null|ReadStats - :field programs: - The programs used to generate this read group. - :type programs: array - :field referenceSetId: - The ID of the reference set to which the reads in this read group are aligned. - Required if there are any read alignments. - :type referenceSetId: null|string - :field info: - A map of additional read group information. - :type info: map> - - A ReadGroup is a set of reads derived from one physical sequencing process. - -.. avro:record:: ReadGroupSet - - :field id: - The read group set ID. - :type id: string - :field datasetId: - The ID of the dataset this read group set belongs to. - :type datasetId: null|string - :field name: - The read group set name. - :type name: null|string - :field stats: - Statistical data on reads in this read group set. - :type stats: null|ReadStats - :field readGroups: - The read groups in this set. - :type readGroups: array - - A ReadGroupSet is a logical collection of ReadGroups. Typically one ReadGroupSet - represents all the reads from one experimental sample. - -.. avro:record:: LinearAlignment - - :field position: - The position of this alignment. - :type position: Position - :field mappingQuality: - The mapping quality of this alignment, meaning the likelihood that the read - maps to this position. - - Specifically, this is -10 log10 Pr(mapping position is wrong), rounded to the - nearest integer. - :type mappingQuality: null|int - :field cigar: - Represents the local alignment of this sequence (alignment matches, indels, etc) - versus the reference. - :type cigar: array - - A linear alignment describes the alignment of a read to a Reference, using a - position and CIGAR array. - -.. avro:record:: ReadAlignment - - :field id: - The read alignment ID. This ID is unique within the read group this - alignment belongs to. - - For performance reasons, this field may be omitted by a backend. - If provided, its intended use is to make caching and UI display easier for - genome browsers and other lightweight clients. - :type id: null|string - :field readGroupId: - The ID of the read group this read belongs to. - (Every read must belong to exactly one read group.) - :type readGroupId: string - :field fragmentName: - The fragment name. Equivalent to QNAME (query template name) in SAM. - :type fragmentName: string - :field properPlacement: - The orientation and the distance between reads from the fragment are - consistent with the sequencing protocol (equivalent to SAM flag 0x2) - :type properPlacement: null|boolean - :field duplicateFragment: - The fragment is a PCR or optical duplicate (SAM flag 0x400). - :type duplicateFragment: null|boolean - :field numberReads: - The number of reads in the fragment (extension to SAM flag 0x1) - :type numberReads: null|int - :field fragmentLength: - The observed length of the fragment, equivalent to TLEN in SAM. - :type fragmentLength: null|int - :field readNumber: - The read ordinal in the fragment, 0-based and less than numberReads. This - field replaces SAM flag 0x40 and 0x80 and is intended to more cleanly - represent multiple reads per fragment. - :type readNumber: null|int - :field failedVendorQualityChecks: - The read fails platform or vendor quality checks (SAM flag 0x200). - :type failedVendorQualityChecks: null|boolean - :field alignment: - The alignment for this alignment record. This field will be null if the read - is unmapped. - :type alignment: null|LinearAlignment - :field secondaryAlignment: - Whether this alignment is secondary. Equivalent to SAM flag 0x100. - A secondary alignment represents an alternative to the primary alignment - for this read. Aligners may return secondary alignments if a read can map - ambiguously to multiple coordinates in the genome. - - By convention, each read has one and only one alignment where both - secondaryAlignment and supplementaryAlignment are false. - :type secondaryAlignment: null|boolean - :field supplementaryAlignment: - Whether this alignment is supplementary. Equivalent to SAM flag 0x800. - Supplementary alignments are used in the representation of a chimeric - alignment. In a chimeric alignment, a read is split into multiple - linear alignments that map to different reference contigs. The first - linear alignment in the read will be designated as the representative alignment; - the remaining linear alignments will be designated as supplementary alignments. - These alignments may have different mapping quality scores. - - In each linear alignment in a chimeric alignment, the read will be hard clipped. - The `alignedSequence` and `alignedQuality` fields in the alignment record will - only represent the bases for its respective linear alignment. - :type supplementaryAlignment: null|boolean - :field alignedSequence: - The bases of the read sequence contained in this alignment record (equivalent - to SEQ in SAM). - - `alignedSequence` and `alignedQuality` may be shorter than the full read sequence - and quality. This will occur if the alignment is part of a chimeric alignment, - or if the read was trimmed. When this occurs, the CIGAR for this read will - begin/end with a hard clip operator that will indicate the length of the - excised sequence. - :type alignedSequence: null|string - :field alignedQuality: - The quality of the read sequence contained in this alignment record - (equivalent to QUAL in SAM). - - `alignedSequence` and `alignedQuality` may be shorter than the full read sequence - and quality. This will occur if the alignment is part of a chimeric alignment, - or if the read was trimmed. When this occurs, the CIGAR for this read will - begin/end with a hard clip operator that will indicate the length of the excised sequence. - :type alignedQuality: array - :field nextMatePosition: - The mapping of the primary alignment of the `(readNumber+1)%numberReads` - read in the fragment. It replaces mate position and mate strand in SAM. - :type nextMatePosition: null|Position - :field info: - A map of additional read alignment information. - :type info: map> - - Each read alignment describes an alignment with additional information - about the fragment and the read. A read alignment object is equivalent to a - line in a SAM file. - diff --git a/doc/source/schemas/referencemethods.rst b/doc/source/schemas/referencemethods.rst deleted file mode 100644 index e287d4df..00000000 --- a/doc/source/schemas/referencemethods.rst +++ /dev/null @@ -1,379 +0,0 @@ -ReferenceMethods -**************** - - .. function:: getReferenceSet(id) - - :param id: string: The ID of the `ReferenceSet`. - :return type: org.ga4gh.models.ReferenceSet - :throws: GAException - -Gets a `ReferenceSet` by ID. -`GET /referencesets/{id}` will return a JSON version of `ReferenceSet`. - - .. function:: getReference(id) - - :param id: string: The ID of the `Reference`. - :return type: org.ga4gh.models.Reference - :throws: GAException - -Gets a `Reference` by ID. -`GET /references/{id}` will return a JSON version of `Reference`. - - .. function:: searchReferences(request) - - :param request: SearchReferencesRequest: This request maps to the body of `POST /references/search` - as JSON. - :return type: SearchReferencesResponse - :throws: GAException - -Gets a list of `Reference` matching the search criteria. - -`POST /references/search` must accept a JSON version of -`SearchReferencesRequest` as the post body and will return a JSON -version of `SearchReferencesResponse`. - - .. function:: getReferenceBases(id, request) - - :param id: string: The ID of the `Reference`. - :param request: ListReferenceBasesRequest: Additional request parameters to restrict the query. - :return type: ListReferenceBasesResponse - :throws: GAException - -Lists `Reference` bases by ID and optional range. -`GET /references/{id}/bases` will return a JSON version of -`ListReferenceBasesResponse`. - - .. function:: searchReferenceSets(request) - - :param request: SearchReferenceSetsRequest: This request maps to the body of `POST /referencesets/search` - as JSON. - :return type: SearchReferenceSetsResponse - :throws: GAException - -Gets a list of `ReferenceSet` matching the search criteria. - -`POST /referencesets/search` must accept a JSON version of -`SearchReferenceSetsRequest` as the post body and will return a JSON -version of `SearchReferenceSetsResponse`. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:error:: GAException - - A general exception type. - -.. avro:record:: Reference - - :field id: - The reference ID. Unique within the repository. - :type id: string - :field length: - The length of this reference's sequence. - :type length: long - :field md5checksum: - The MD5 checksum uniquely representing this `Reference` as a lower-case - hexadecimal string, calculated as the MD5 of the upper-case sequence - excluding all whitespace characters (this is equivalent to SQ:M5 in SAM). - :type md5checksum: string - :field name: - The name of this reference. (e.g. '22'). - :type name: string - :field sourceURI: - The URI from which the sequence was obtained. Specifies a FASTA format - file/string with one name, sequence pair. In most cases, clients should call - the `getReferenceBases()` method to obtain sequence bases for a `Reference` - instead of attempting to retrieve this URI. - :type sourceURI: null|string - :field sourceAccessions: - All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include - a version number, e.g. `GCF_000001405.26`. - :type sourceAccessions: array - :field isDerived: - A sequence X is said to be derived from source sequence Y, if X and Y - are of the same length and the per-base sequence divergence at A/C/G/T bases - is sufficiently small. Two sequences derived from the same official - sequence share the same coordinates and annotations, and - can be replaced with the official sequence for certain use cases. - :type isDerived: boolean - :field sourceDivergence: - The `sourceDivergence` is the fraction of non-indel bases that do not match the - reference this record was derived from. - :type sourceDivergence: null|float - :field ncbiTaxonId: - ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human). - :type ncbiTaxonId: null|int - - A `Reference` is a canonical assembled contig, intended to act as a - reference coordinate space for other genomic annotations. A single - `Reference` might represent the human chromosome 1, for instance. - - `Reference`s are designed to be immutable. - -.. avro:record:: ReferenceSet - - :field id: - The reference set ID. Unique in the repository. - :type id: string - :field name: - The reference set name. - :type name: null|string - :field md5checksum: - Order-independent MD5 checksum which identifies this `ReferenceSet`. - - To compute this checksum, make a list of `Reference.md5checksum` for all - `Reference`s in this set. Then sort that list, and take the MD5 hash of - all the strings concatenated together. Express the hash as a lower-case - hexadecimal string. - :type md5checksum: string - :field ncbiTaxonId: - ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating - the species which this assembly is intended to model. Note that contained - `Reference`s may specify a different `ncbiTaxonId`, as assemblies may - contain reference sequences which do not belong to the modeled species, e.g. - EBV in a human reference genome. - :type ncbiTaxonId: null|int - :field description: - Optional free text description of this reference set. - :type description: null|string - :field assemblyId: - Public id of this reference set, such as `GRCh37`. - :type assemblyId: null|string - :field sourceURI: - Specifies a FASTA format file/string. - :type sourceURI: null|string - :field sourceAccessions: - All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally - with a version number, e.g. `NC_000001.11`. - :type sourceAccessions: array - :field isDerived: - A reference set may be derived from a source if it contains - additional sequences, or some of the sequences within it are derived - (see the definition of `isDerived` in `Reference`). - :type isDerived: boolean - - A `ReferenceSet` is a set of `Reference`s which typically comprise a - reference assembly, such as `GRCh38`. A `ReferenceSet` defines a common - coordinate space for comparing reference-aligned experimental data. - -.. avro:record:: SearchReferenceSetsRequest - - :field md5checksum: - If not null, return the reference sets for which the - `md5checksum` matches this string (case-sensitive, exact match). - See `ReferenceSet::md5checksum` for details. - :type md5checksum: null|string - :field accession: - If not null, return the reference sets for which the `accession` - matches this string (case-sensitive, exact match). - :type accession: null|string - :field assemblyId: - If not null, return the reference sets for which the `assemblyId` - matches this string (case-sensitive, exact match). - :type assemblyId: null|string - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /referencesets/search` - as JSON. - -.. avro:record:: SearchReferenceSetsResponse - - :field referenceSets: - The list of matching reference sets. - :type referenceSets: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /referencesets/search` - expressed as JSON. - -.. avro:record:: SearchReferencesRequest - - :field referenceSetId: - The `ReferenceSet` to search. - :type referenceSetId: string - :field md5checksum: - If not null, return the references for which the - `md5checksum` matches this string (case-sensitive, exact match). - See `ReferenceSet::md5checksum` for details. - :type md5checksum: null|string - :field accession: - If not null, return the references for which the `accession` - matches this string (case-sensitive, exact match). - :type accession: null|string - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /references/search` - as JSON. - -.. avro:record:: SearchReferencesResponse - - :field references: - The list of matching references. - :type references: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /references/search` expressed as JSON. - -.. avro:record:: ListReferenceBasesRequest - - :field start: - The start position (0-based) of this query. Defaults to 0. - Genomic positions are non-negative integers less than reference length. - Requests spanning the join of circular genomes are represented as - two requests one on each side of the join (position 0). - :type start: long - :field end: - The end position (0-based, exclusive) of this query. Defaults - to the length of this `Reference`. - :type end: null|long - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - The query parameters for a request to `GET /references/{id}/bases`, for - example: - - `GET /references/{id}/bases?start=100&end=200` - -.. avro:record:: ListReferenceBasesResponse - - :field offset: - The offset position (0-based) of the given sequence from the start of this - `Reference`. This value will differ for each page in a paginated request. - :type offset: long - :field sequence: - A substring of the bases that make up this reference. Bases are represented - as IUPAC-IUB codes; this string matches the regexp `[ACGTMRWSYKVHDBN]*`. - :type sequence: string - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - The response from `GET /references/{id}/bases` expressed as JSON. - diff --git a/doc/source/schemas/references.rst b/doc/source/schemas/references.rst deleted file mode 100644 index 9473784d..00000000 --- a/doc/source/schemas/references.rst +++ /dev/null @@ -1,199 +0,0 @@ -References -********** - -Defines types used by the GA4GH References API. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: Reference - - :field id: - The reference ID. Unique within the repository. - :type id: string - :field length: - The length of this reference's sequence. - :type length: long - :field md5checksum: - The MD5 checksum uniquely representing this `Reference` as a lower-case - hexadecimal string, calculated as the MD5 of the upper-case sequence - excluding all whitespace characters (this is equivalent to SQ:M5 in SAM). - :type md5checksum: string - :field name: - The name of this reference. (e.g. '22'). - :type name: string - :field sourceURI: - The URI from which the sequence was obtained. Specifies a FASTA format - file/string with one name, sequence pair. In most cases, clients should call - the `getReferenceBases()` method to obtain sequence bases for a `Reference` - instead of attempting to retrieve this URI. - :type sourceURI: null|string - :field sourceAccessions: - All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include - a version number, e.g. `GCF_000001405.26`. - :type sourceAccessions: array - :field isDerived: - A sequence X is said to be derived from source sequence Y, if X and Y - are of the same length and the per-base sequence divergence at A/C/G/T bases - is sufficiently small. Two sequences derived from the same official - sequence share the same coordinates and annotations, and - can be replaced with the official sequence for certain use cases. - :type isDerived: boolean - :field sourceDivergence: - The `sourceDivergence` is the fraction of non-indel bases that do not match the - reference this record was derived from. - :type sourceDivergence: null|float - :field ncbiTaxonId: - ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human). - :type ncbiTaxonId: null|int - - A `Reference` is a canonical assembled contig, intended to act as a - reference coordinate space for other genomic annotations. A single - `Reference` might represent the human chromosome 1, for instance. - - `Reference`s are designed to be immutable. - -.. avro:record:: ReferenceSet - - :field id: - The reference set ID. Unique in the repository. - :type id: string - :field name: - The reference set name. - :type name: null|string - :field md5checksum: - Order-independent MD5 checksum which identifies this `ReferenceSet`. - - To compute this checksum, make a list of `Reference.md5checksum` for all - `Reference`s in this set. Then sort that list, and take the MD5 hash of - all the strings concatenated together. Express the hash as a lower-case - hexadecimal string. - :type md5checksum: string - :field ncbiTaxonId: - ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating - the species which this assembly is intended to model. Note that contained - `Reference`s may specify a different `ncbiTaxonId`, as assemblies may - contain reference sequences which do not belong to the modeled species, e.g. - EBV in a human reference genome. - :type ncbiTaxonId: null|int - :field description: - Optional free text description of this reference set. - :type description: null|string - :field assemblyId: - Public id of this reference set, such as `GRCh37`. - :type assemblyId: null|string - :field sourceURI: - Specifies a FASTA format file/string. - :type sourceURI: null|string - :field sourceAccessions: - All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally - with a version number, e.g. `NC_000001.11`. - :type sourceAccessions: array - :field isDerived: - A reference set may be derived from a source if it contains - additional sequences, or some of the sequences within it are derived - (see the definition of `isDerived` in `Reference`). - :type isDerived: boolean - - A `ReferenceSet` is a set of `Reference`s which typically comprise a - reference assembly, such as `GRCh38`. A `ReferenceSet` defines a common - coordinate space for comparing reference-aligned experimental data. - diff --git a/doc/source/schemas/sequenceAnnotationmethods.rst b/doc/source/schemas/sequenceAnnotationmethods.rst deleted file mode 100644 index ab0d18e1..00000000 --- a/doc/source/schemas/sequenceAnnotationmethods.rst +++ /dev/null @@ -1,370 +0,0 @@ -SequenceAnnotationMethods -************************* - - .. function:: getFeature(id) - - :param id: string: The ID of the `Feature`. - :return type: org.ga4gh.models.Feature - :throws: GAException - -Gets a `org.ga4gh.models.Feature` by ID. - `GET /features/{id}` will return a JSON version of `Feature`. - - .. function:: searchFeatures(request) - - :param request: SearchFeaturesRequest: This request maps to the body of `POST /features/search` as JSON. - :return type: SearchFeaturesResponse - :throws: GAException - -Gets a list of `Feature` matching the search criteria. - - `POST /features/search` must accept a JSON version of - `SearchFeaturesRequest` as the post body and will return a JSON version of - `SearchFeaturesResponse`. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:error:: GAException - - A general exception type. - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - -.. avro:record:: Attributes - - :field vals: - :type vals: map> - - Type defining a collection of attributes associated with various protocol - records. Each attribute is a name that maps to an array of one or more - values. Values can be strings, external identifiers, or ontology terms. - Values should be split into the array elements instead of using a separator - syntax that needs to parsed. - -.. avro:record:: Feature - - :field id: - Id of this annotation node. - :type id: string - :field parentIds: - Ids of the parents of this annotation node. - :type parentIds: array - :field featureSetId: - Identifier for the containing feature set. - :type featureSetId: string - :field referenceName: - The reference on which this feature occurs. - (e.g. `chr20` or `X`) - :type referenceName: string - :field start: - The start position at which this feature occurs (0-based). - This corresponds to the first base of the string of reference bases. - Genomic positions are non-negative integers less than reference length. - Features spanning the join of circular genomes are represented as - two features one on each side of the join (position 0). - :type start: long - :field end: - The end position (exclusive), resulting in [start, end) closed-open interval. - This is typically calculated by `start + referenceBases.length`. - :type end: long - :field featureType: - Feature that is annotated by this region. Normally, this will be a term in - the Sequence Ontology. - :type featureType: OntologyTerm - :field attributes: - Name/value attributes of the annotation. Attribute names follow the GFF3 - naming convention of reserved names starting with an upper cases - character, and user-define names start with lower-case. Most GFF3 - pre-defined attributes apply, the exceptions are ID and Parent, which are - defined as fields. Additional, the following attributes are added: - * Score - the GFF3 score column - * Phase - the GFF3 phase column for CDS features. - :type attributes: Attributes - - Node in the annotation graph that annotates a contiguous region of a - sequence. - -.. avro:record:: FeatureSet - - :field id: - The ID of this annotation set. - :type id: string - :field datasetId: - The ID of the dataset this annotation set belongs to. - :type datasetId: null|string - :field referenceSetId: - The ID of the reference set which defines the coordinate-space for this - set of annotations. - :type referenceSetId: null|string - :field name: - The display name for this annotation set. - :type name: null|string - :field sourceURI: - The source URI describing the file from which this annotation set was - generated, if any. - :type sourceURI: null|string - :field attributes: - Set of additional attributes - :type attributes: Attributes - -.. avro:record:: SearchFeaturesRequest - - :field featureSetId: - The annotation set to search within. Either `featureSetId` or - `parentId` must be non-empty. - :type featureSetId: null|string - :field parentId: - Restricts the search to direct children of the given parent `feature` - ID. Either `featureSetId` or `parentId` must be non-empty. - :type parentId: null|string - :field referenceName: - Only return features with on the reference with this name. One of this - field or `referenceId` is required. (case-sensitive, exact match) - :type referenceName: null|string - :field referenceId: - Only return feature on the reference with this ID. One of this field or - `referenceName` is required. - :type referenceId: null|string - :field start: - Required. The beginning of the window (0-based, inclusive) for which - overlapping features should be returned. Genomic positions are - non-negative integers less than reference length. Requests spanning the - join of circular genomes are represented as two requests one on each side - of the join (position 0). - :type start: long - :field end: - Required. The end of the window (0-based, exclusive) for which overlapping - features should be returned. - :type end: long - :field features: - If specified, this query matches only annotations which match one of the - provided feature types. - :type features: array - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /features/search` as JSON. - -.. avro:record:: SearchFeaturesResponse - - :field features: - The list of matching annotations, sorted by start position. Annotations which - share a start position are returned in a deterministic order. - :type features: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /features/search` expressed as JSON. - diff --git a/doc/source/schemas/sequenceAnnotations.rst b/doc/source/schemas/sequenceAnnotations.rst deleted file mode 100644 index 5813ba5a..00000000 --- a/doc/source/schemas/sequenceAnnotations.rst +++ /dev/null @@ -1,302 +0,0 @@ -SequenceAnnotations -******************* - -This protocol defines annotations on GA4GH genomic sequences It includes two -types of annotations: continuous and discrete hierarchical. - -The discrete hierarchical annotations are derived from the Sequence Ontology -(SO) and GFF3 work - - http://www.sequenceontology.org/gff3.shtml - -The goal is to be able to store annotations using the GFF3 and SO conceptual -model, although there is not necessarly a one-to-one mapping in Avro records -to GFF3 records. - -The minimum requirement is to be able to accurately represent the current -state of the art annotation data and the full SO model. Feature is the -core generic record which corresponds to the a GFF3 record. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: OntologyTerm - - :field id: - Ontology source identifier - the identifier, a CURIE (preferred) or - PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo - It differs from the standard GA4GH schema's :ref:`id ` - in that it is a URI pointing to an information resource outside of the scope - of the schema or its resource implementation. - :type id: string - :field term: - Ontology term - the representation the id is pointing to. - :type term: null|string - :field sourceName: - Ontology source name - the name of ontology from which the term is obtained - e.g. 'Human Phenotype Ontology' - :type sourceName: null|string - :field sourceVersion: - Ontology source version - the version of the ontology from which the - OntologyTerm is obtained; e.g. 2.6.1. - There is no standard for ontology versioning and some frequently - released ontologies may use a datestamp, or build number. - :type sourceVersion: null|string - - An ontology term describing an attribute. (e.g. the phenotype attribute - 'polydactyly' from HPO) - -.. avro:record:: Experiment - - :field id: - The experiment UUID. This is globally unique. - :type id: string - :field name: - The name of the experiment. - :type name: null|string - :field description: - A description of the experiment. - :type description: null|string - :field recordCreateTime: - The time at which this record was created. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordCreateTime: string - :field recordUpdateTime: - The time at which this record was last updated. - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) - :type recordUpdateTime: string - :field runTime: - The time at which this experiment was performed. - Granularity here is variable (e.g. date only). - Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) - :type runTime: null|string - :field molecule: - The molecule examined in this experiment. (e.g. genomics DNA, total RNA) - :type molecule: null|string - :field strategy: - The experiment technique or strategy applied to the sample. - (e.g. whole genome sequencing, RNA-seq, RIP-seq) - :type strategy: null|string - :field selection: - The method used to enrich the target. (e.g. immunoprecipitation, size - fractionation, MNase digestion) - :type selection: null|string - :field library: - The name of the library used as part of this experiment. - :type library: null|string - :field libraryLayout: - The configuration of sequenced reads. (e.g. Single or Paired) - :type libraryLayout: null|string - :field instrumentModel: - The instrument model used as part of this experiment. - This maps to sequencing technology in BAM. - :type instrumentModel: null|string - :field instrumentDataFile: - The data file generated by the instrument. - TODO: This isn't actually a file is it? - Should this be `instrumentData` instead? - :type instrumentDataFile: null|string - :field sequencingCenter: - The sequencing center used as part of this experiment. - :type sequencingCenter: null|string - :field platformUnit: - The platform unit used as part of this experiment. This is a flowcell-barcode - or slide unique identifier. - :type platformUnit: null|string - :field info: - A map of additional experiment information. - :type info: map> - - An experimental preparation of a sample. - -.. avro:record:: Dataset - - :field id: - The dataset's id, locally unique to the server instance. - :type id: string - :field name: - The name of the dataset. - :type name: null|string - :field description: - Additional, human-readable information on the dataset. - :type description: null|string - - A Dataset is a collection of related data of multiple types. - Data providers decide how to group data into datasets. - See [Metadata API](../api/metadata.html) for a more detailed discussion. - -.. avro:record:: Attributes - - :field vals: - :type vals: map> - - Type defining a collection of attributes associated with various protocol - records. Each attribute is a name that maps to an array of one or more - values. Values can be strings, external identifiers, or ontology terms. - Values should be split into the array elements instead of using a separator - syntax that needs to parsed. - -.. avro:record:: Feature - - :field id: - Id of this annotation node. - :type id: string - :field parentIds: - Ids of the parents of this annotation node. - :type parentIds: array - :field featureSetId: - Identifier for the containing feature set. - :type featureSetId: string - :field referenceName: - The reference on which this feature occurs. - (e.g. `chr20` or `X`) - :type referenceName: string - :field start: - The start position at which this feature occurs (0-based). - This corresponds to the first base of the string of reference bases. - Genomic positions are non-negative integers less than reference length. - Features spanning the join of circular genomes are represented as - two features one on each side of the join (position 0). - :type start: long - :field end: - The end position (exclusive), resulting in [start, end) closed-open interval. - This is typically calculated by `start + referenceBases.length`. - :type end: long - :field featureType: - Feature that is annotated by this region. Normally, this will be a term in - the Sequence Ontology. - :type featureType: OntologyTerm - :field attributes: - Name/value attributes of the annotation. Attribute names follow the GFF3 - naming convention of reserved names starting with an upper cases - character, and user-define names start with lower-case. Most GFF3 - pre-defined attributes apply, the exceptions are ID and Parent, which are - defined as fields. Additional, the following attributes are added: - * Score - the GFF3 score column - * Phase - the GFF3 phase column for CDS features. - :type attributes: Attributes - - Node in the annotation graph that annotates a contiguous region of a - sequence. - -.. avro:record:: FeatureSet - - :field id: - The ID of this annotation set. - :type id: string - :field datasetId: - The ID of the dataset this annotation set belongs to. - :type datasetId: null|string - :field referenceSetId: - The ID of the reference set which defines the coordinate-space for this - set of annotations. - :type referenceSetId: null|string - :field name: - The display name for this annotation set. - :type name: null|string - :field sourceURI: - The source URI describing the file from which this annotation set was - generated, if any. - :type sourceURI: null|string - :field attributes: - Set of additional attributes - :type attributes: Attributes - diff --git a/doc/source/schemas/variantmethods.rst b/doc/source/schemas/variantmethods.rst deleted file mode 100644 index a3641a85..00000000 --- a/doc/source/schemas/variantmethods.rst +++ /dev/null @@ -1,475 +0,0 @@ -VariantMethods -************** - - .. function:: searchVariants(request) - - :param request: SearchVariantsRequest: This request maps to the body of `POST /variants/search` as JSON. - :return type: SearchVariantsResponse - :throws: GAException - -Gets a list of `Variant` matching the search criteria. - -`POST /variants/search` must accept a JSON version of `SearchVariantsRequest` -as the post body and will return a JSON version of `SearchVariantsResponse`. - - .. function:: getCallSet(id) - - :param id: string: The ID of the `CallSet`. - :return type: org.ga4gh.models.CallSet - :throws: GAException - -Gets a `CallSet` by ID. -`GET /callsets/{id}` will return a JSON version of `CallSet`. - - .. function:: searchVariantSets(request) - - :param request: SearchVariantSetsRequest: This request maps to the body of `POST /variantsets/search` as JSON. - :return type: SearchVariantSetsResponse - :throws: GAException - -Gets a list of `VariantSet` matching the search criteria. - -`POST /variantsets/search` must accept a JSON version of -`SearchVariantSetsRequest` as the post body and will return a JSON version -of `SearchVariantSetsResponse`. - - .. function:: getVariantSet(id) - - :param id: string: The ID of the `VariantSet`. - :return type: org.ga4gh.models.VariantSet - :throws: GAException - -Gets a `VariantSet` by ID. -`GET /variantsets/{id}` will return a JSON version of `VariantSet`. - - .. function:: getVariant(id) - - :param id: string: The ID of the `Variant`. - :return type: org.ga4gh.models.Variant - :throws: GAException - -Gets a `Variant` by ID. -`GET /variants/{id}` will return a JSON version of `Variant`. - - .. function:: searchCallSets(request) - - :param request: SearchCallSetsRequest: This request maps to the body of `POST /callsets/search` as JSON. - :return type: SearchCallSetsResponse - :throws: GAException - -Gets a list of `CallSet` matching the search criteria. - -`POST /callsets/search` must accept a JSON version of `SearchCallSetsRequest` -as the post body and will return a JSON version of `SearchCallSetsResponse`. - -.. avro:error:: GAException - - A general exception type. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: VariantSetMetadata - - :field key: - The top-level key. - :type key: string - :field value: - The value field for simple metadata. - :type value: string - :field id: - User-provided ID field, not enforced by this API. - Two or more pieces of structured metadata with identical - id and key fields are considered equivalent. - `FIXME: If it's not enforced, then why can't it be null?` - :type id: string - :field type: - The type of data. - :type type: string - :field number: - The number of values that can be included in a field described by this - metadata. - :type number: string - :field description: - A textual description of this metadata. - :type description: string - :field info: - Remaining structured metadata key-value pairs. - :type info: map> - - Optional metadata associated with a variant set. - -.. avro:record:: VariantSet - - :field id: - The variant set ID. - :type id: string - :field name: - The variant set name. - :type name: null|string - :field datasetId: - The ID of the dataset this variant set belongs to. - :type datasetId: string - :field referenceSetId: - The ID of the reference set that describes the sequences used by the variants in this set. - :type referenceSetId: string - :field metadata: - Optional metadata associated with this variant set. - This array can be used to store information about the variant set, such as information found - in VCF header fields, that isn't already available in first class fields such as "name". - :type metadata: array - - A VariantSet is a collection of variants and variant calls intended to be analyzed together. - -.. avro:record:: CallSet - - :field id: - The call set ID. - :type id: string - :field name: - The call set name. - :type name: null|string - :field sampleId: - The sample this call set's data was generated from. - Note: the current API does not have a rigorous definition of sample. Therefore, this - field actually contains an arbitrary string, typically corresponding to the sampleId - field in the read groups used to generate this call set. - :type sampleId: null|string - :field variantSetIds: - The IDs of the variant sets this call set has calls in. - :type variantSetIds: array - :field created: - The date this call set was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this call set was last updated in - milliseconds from the epoch. - :type updated: null|long - :field info: - A map of additional call set information. - :type info: map> - - A CallSet is a collection of calls that were generated by the same analysis of the same sample. - -.. avro:record:: Call - - :field callSetName: - The name of the call set this variant call belongs to. - If this field is not present, the ordering of the call sets from a - `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match - the ordering of the calls on this `Variant`. - The number of results will also be the same. - :type callSetName: null|string - :field callSetId: - The ID of the call set this variant call belongs to. - - If this field is not present, the ordering of the call sets from a - `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match - the ordering of the calls on this `Variant`. - The number of results will also be the same. - :type callSetId: null|string - :field genotype: - The genotype of this variant call. - - A 0 value represents the reference allele of the associated `Variant`. Any - other value is a 1-based index into the alternate alleles of the associated - `Variant`. - - If a variant had a referenceBases field of "T", an alternateBases - value of ["A", "C"], and the genotype was [2, 1], that would mean the call - represented the heterozygous value "CA" for this variant. If the genotype - was instead [0, 1] the represented value would be "TA". Ordering of the - genotype values is important if the phaseset field is present. - :type genotype: array - :field phaseset: - If this field is not null, this variant call's genotype ordering implies - the phase of the bases and is consistent with any other variant calls on - the same contig which have the same phaseset string. - :type phaseset: null|string - :field genotypeLikelihood: - The genotype likelihoods for this variant call. Each array entry - represents how likely a specific genotype is for this call as - log10(P(data | genotype)), analogous to the GL tag in the VCF spec. The - value ordering is defined by the GL tag in the VCF spec. - :type genotypeLikelihood: array - :field info: - A map of additional variant call information. - :type info: map> - - A `Call` represents the determination of genotype with respect to a - particular `Variant`. - - It may include associated information such as quality - and phasing. For example, a call might assign a probability of 0.32 to - the occurrence of a SNP named rs1234 in a call set with the name NA12345. - -.. avro:record:: Variant - - :field id: - The variant ID. - :type id: string - :field variantSetId: - The ID of the `VariantSet` this variant belongs to. This transitively defines - the `ReferenceSet` against which the `Variant` is to be interpreted. - :type variantSetId: string - :field names: - Names for the variant, for example a RefSNP ID. - :type names: array - :field created: - The date this variant was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this variant was last updated in - milliseconds from the epoch. - :type updated: null|long - :field referenceName: - The reference on which this variant occurs. - (e.g. `chr20` or `X`) - :type referenceName: string - :field start: - The start position at which this variant occurs (0-based). - This corresponds to the first base of the string of reference bases. - Genomic positions are non-negative integers less than reference length. - Variants spanning the join of circular genomes are represented as - two variants one on each side of the join (position 0). - :type start: long - :field end: - The end position (exclusive), resulting in [start, end) closed-open interval. - This is typically calculated by `start + referenceBases.length`. - :type end: long - :field referenceBases: - The reference bases for this variant. They start at the given start position. - :type referenceBases: string - :field alternateBases: - The bases that appear instead of the reference bases. Multiple alternate - alleles are possible. - :type alternateBases: array - :field info: - A map of additional variant information. - :type info: map> - :field calls: - The variant calls for this particular variant. Each one represents the - determination of genotype with respect to this variant. `Call`s in this array - are implicitly associated with this `Variant`. - :type calls: array - - A `Variant` represents a change in DNA sequence relative to some reference. - For example, a variant could represent a SNP or an insertion. - Variants belong to a `VariantSet`. - This is equivalent to a row in VCF. - -.. avro:record:: SearchVariantSetsRequest - - :field datasetId: - The `Dataset` to search. - :type datasetId: string - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /variantsets/search` as JSON. - -.. avro:record:: SearchVariantSetsResponse - - :field variantSets: - The list of matching variant sets. - :type variantSets: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /variantsets/search` expressed as JSON. - -.. avro:record:: SearchVariantsRequest - - :field variantSetId: - The `VariantSet` to search. - :type variantSetId: string - :field callSetIds: - Only return variant calls which belong to call sets with these IDs. - If an empty array, returns variants without any call objects. - If null, returns all variant calls. - :type callSetIds: null|array - :field referenceName: - Required. Only return variants on this reference. - :type referenceName: string - :field start: - Required. The beginning of the window (0-based, inclusive) for - which overlapping variants should be returned. - Genomic positions are non-negative integers less than reference length. - Requests spanning the join of circular genomes are represented as - two requests one on each side of the join (position 0). - :type start: long - :field end: - Required. The end of the window (0-based, exclusive) for which overlapping - variants should be returned. - :type end: long - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /variants/search` as JSON. - -.. avro:record:: SearchVariantsResponse - - :field variants: - The list of matching variants. - If the `callSetId` field on the returned calls is not present, - the ordering of the call sets from a `SearchCallSetsRequest` - over the parent `VariantSet` is guaranteed to match the ordering - of the calls on each `Variant`. The number of results will also be - the same. - :type variants: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /variants/search` expressed as JSON. - -.. avro:record:: SearchCallSetsRequest - - :field variantSetId: - The VariantSet to search. - :type variantSetId: string - :field name: - Only return call sets with this name (case-sensitive, exact match). - :type name: null|string - :field pageSize: - Specifies the maximum number of results to return in a single page. - If unspecified, a system default will be used. - :type pageSize: null|int - :field pageToken: - The continuation token, which is used to page through large result sets. - To get the next page of results, set this parameter to the value of - `nextPageToken` from the previous response. - :type pageToken: null|string - - This request maps to the body of `POST /callsets/search` as JSON. - -.. avro:record:: SearchCallSetsResponse - - :field callSets: - The list of matching call sets. - :type callSets: array - :field nextPageToken: - The continuation token, which is used to page through large result sets. - Provide this value in a subsequent request to return the next page of - results. This field will be empty if there aren't any additional results. - :type nextPageToken: null|string - - This is the response from `POST /callsets/search` expressed as JSON. - diff --git a/doc/source/schemas/variants.rst b/doc/source/schemas/variants.rst deleted file mode 100644 index f51ccd9c..00000000 --- a/doc/source/schemas/variants.rst +++ /dev/null @@ -1,297 +0,0 @@ -Variants -******** - -This file defines the objects used to represent variant calls, most importantly -VariantSet, Variant, and Call. -See {TODO: LINK TO VARIANTS OVERVIEW} for more information. - -.. avro:enum:: Strand - - :symbols: NEG_STRAND|POS_STRAND - Indicates the DNA strand associate for some data item. - * `NEG_STRAND`: The negative (-) strand. - * `POS_STRAND`: The postive (+) strand. - -.. avro:record:: Position - - :field referenceName: - The name of the `Reference` on which the `Position` is located. - :type referenceName: string - :field position: - The 0-based offset from the start of the forward strand for that `Reference`. - Genomic positions are non-negative integers less than `Reference` length. - :type position: long - :field strand: - Strand the position is associated with. - :type strand: Strand - - A `Position` is an unoriented base in some `Reference`. A `Position` is - represented by a `Reference` name, and a base number on that `Reference` - (0-based). - -.. avro:record:: ExternalIdentifier - - :field database: - The source of the identifier. - (e.g. `Ensembl`) - :type database: string - :field identifier: - The ID defined by the external database. - (e.g. `ENST00000000000`) - :type identifier: string - :field version: - The version of the object or the database - (e.g. `78`) - :type version: string - - Identifier from a public database - -.. avro:enum:: CigarOperation - - :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH - An enum for the different types of CIGAR alignment operations that exist. - Used wherever CIGAR alignments are used. The different enumerated values - have the following usage: - - * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be - aligned to the reference without evidence of an INDEL. Unlike the - `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` - operator does not indicate whether the reference and read sequences are an - exact match. This operator is equivalent to SAM's `M`. - * `INSERT`: The insert operator indicates that the read contains evidence of - bases being inserted into the reference. This operator is equivalent to - SAM's `I`. - * `DELETE`: The delete operator indicates that the read contains evidence of - bases being deleted from the reference. This operator is equivalent to - SAM's `D`. - * `SKIP`: The skip operator indicates that this read skips a long segment of - the reference, but the bases have not been deleted. This operator is - commonly used when working with RNA-seq data, where reads may skip long - segments of the reference between exons. This operator is equivalent to - SAM's 'N'. - * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end - of a read have not been considered during alignment. This may occur if the - majority of a read maps, except for low quality bases at the start/end of - a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped - will still be stored in the read. - * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of - a read have been omitted from this alignment. This may occur if this linear - alignment is part of a chimeric alignment, or if the read has been trimmed - (e.g., during error correction, or to trim poly-A tails for RNA-seq). This - operator is equivalent to SAM's 'H'. - * `PAD`: The pad operator indicates that there is padding in an alignment. - This operator is equivalent to SAM's 'P'. - * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned - sequence exactly matches the reference (e.g., all bases are equal to the - reference bases). This operator is equivalent to SAM's '='. - * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the - aligned sequence is an alignment match to the reference, but a sequence - mismatch (e.g., the bases are not equal to the reference). This can - indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. - -.. avro:record:: CigarUnit - - :field operation: - The operation type. - :type operation: CigarOperation - :field operationLength: - The number of bases that the operation runs for. - :type operationLength: long - :field referenceSequence: - `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) - and deletions (`DELETE`). Filling this field replaces the MD tag. - If the relevant information is not available, leave this field as `null`. - :type referenceSequence: null|string - - A structure for an instance of a CIGAR operation. - `FIXME: This belongs under Reads (only readAlignment refers to this)` - -.. avro:record:: VariantSetMetadata - - :field key: - The top-level key. - :type key: string - :field value: - The value field for simple metadata. - :type value: string - :field id: - User-provided ID field, not enforced by this API. - Two or more pieces of structured metadata with identical - id and key fields are considered equivalent. - `FIXME: If it's not enforced, then why can't it be null?` - :type id: string - :field type: - The type of data. - :type type: string - :field number: - The number of values that can be included in a field described by this - metadata. - :type number: string - :field description: - A textual description of this metadata. - :type description: string - :field info: - Remaining structured metadata key-value pairs. - :type info: map> - - Optional metadata associated with a variant set. - -.. avro:record:: VariantSet - - :field id: - The variant set ID. - :type id: string - :field name: - The variant set name. - :type name: null|string - :field datasetId: - The ID of the dataset this variant set belongs to. - :type datasetId: string - :field referenceSetId: - The ID of the reference set that describes the sequences used by the variants in this set. - :type referenceSetId: string - :field metadata: - Optional metadata associated with this variant set. - This array can be used to store information about the variant set, such as information found - in VCF header fields, that isn't already available in first class fields such as "name". - :type metadata: array - - A VariantSet is a collection of variants and variant calls intended to be analyzed together. - -.. avro:record:: CallSet - - :field id: - The call set ID. - :type id: string - :field name: - The call set name. - :type name: null|string - :field sampleId: - The sample this call set's data was generated from. - Note: the current API does not have a rigorous definition of sample. Therefore, this - field actually contains an arbitrary string, typically corresponding to the sampleId - field in the read groups used to generate this call set. - :type sampleId: null|string - :field variantSetIds: - The IDs of the variant sets this call set has calls in. - :type variantSetIds: array - :field created: - The date this call set was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this call set was last updated in - milliseconds from the epoch. - :type updated: null|long - :field info: - A map of additional call set information. - :type info: map> - - A CallSet is a collection of calls that were generated by the same analysis of the same sample. - -.. avro:record:: Call - - :field callSetName: - The name of the call set this variant call belongs to. - If this field is not present, the ordering of the call sets from a - `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match - the ordering of the calls on this `Variant`. - The number of results will also be the same. - :type callSetName: null|string - :field callSetId: - The ID of the call set this variant call belongs to. - - If this field is not present, the ordering of the call sets from a - `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match - the ordering of the calls on this `Variant`. - The number of results will also be the same. - :type callSetId: null|string - :field genotype: - The genotype of this variant call. - - A 0 value represents the reference allele of the associated `Variant`. Any - other value is a 1-based index into the alternate alleles of the associated - `Variant`. - - If a variant had a referenceBases field of "T", an alternateBases - value of ["A", "C"], and the genotype was [2, 1], that would mean the call - represented the heterozygous value "CA" for this variant. If the genotype - was instead [0, 1] the represented value would be "TA". Ordering of the - genotype values is important if the phaseset field is present. - :type genotype: array - :field phaseset: - If this field is not null, this variant call's genotype ordering implies - the phase of the bases and is consistent with any other variant calls on - the same contig which have the same phaseset string. - :type phaseset: null|string - :field genotypeLikelihood: - The genotype likelihoods for this variant call. Each array entry - represents how likely a specific genotype is for this call as - log10(P(data | genotype)), analogous to the GL tag in the VCF spec. The - value ordering is defined by the GL tag in the VCF spec. - :type genotypeLikelihood: array - :field info: - A map of additional variant call information. - :type info: map> - - A `Call` represents the determination of genotype with respect to a - particular `Variant`. - - It may include associated information such as quality - and phasing. For example, a call might assign a probability of 0.32 to - the occurrence of a SNP named rs1234 in a call set with the name NA12345. - -.. avro:record:: Variant - - :field id: - The variant ID. - :type id: string - :field variantSetId: - The ID of the `VariantSet` this variant belongs to. This transitively defines - the `ReferenceSet` against which the `Variant` is to be interpreted. - :type variantSetId: string - :field names: - Names for the variant, for example a RefSNP ID. - :type names: array - :field created: - The date this variant was created in milliseconds from the epoch. - :type created: null|long - :field updated: - The time at which this variant was last updated in - milliseconds from the epoch. - :type updated: null|long - :field referenceName: - The reference on which this variant occurs. - (e.g. `chr20` or `X`) - :type referenceName: string - :field start: - The start position at which this variant occurs (0-based). - This corresponds to the first base of the string of reference bases. - Genomic positions are non-negative integers less than reference length. - Variants spanning the join of circular genomes are represented as - two variants one on each side of the join (position 0). - :type start: long - :field end: - The end position (exclusive), resulting in [start, end) closed-open interval. - This is typically calculated by `start + referenceBases.length`. - :type end: long - :field referenceBases: - The reference bases for this variant. They start at the given start position. - :type referenceBases: string - :field alternateBases: - The bases that appear instead of the reference bases. Multiple alternate - alleles are possible. - :type alternateBases: array - :field info: - A map of additional variant information. - :type info: map> - :field calls: - The variant calls for this particular variant. Each one represents the - determination of genotype with respect to this variant. `Call`s in this array - are implicitly associated with this `Variant`. - :type calls: array - - A `Variant` represents a change in DNA sequence relative to some reference. - For example, a variant could represent a SNP or an insertion. - Variants belong to a `VariantSet`. - This is equivalent to a row in VCF. - diff --git a/src/main/resources/avro/sequenceAnnotationmethods.avdl b/src/main/resources/avro/sequenceAnnotationmethods.avdl index f96911cc..148f3ded 100644 --- a/src/main/resources/avro/sequenceAnnotationmethods.avdl +++ b/src/main/resources/avro/sequenceAnnotationmethods.avdl @@ -6,6 +6,65 @@ protocol SequenceAnnotationMethods { import idl "methods.avdl"; import idl "sequenceAnnotations.avdl"; + /****************** /featuresets/search *********************/ + /** This request maps to the body of `POST /featuresets/search` as JSON. */ + record SearchFeatureSetsRequest { + /** + The `Dataset` to search. + */ + string datasetId; + + /** + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + */ + union { null, int } pageSize = null; + + /** + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + */ + union { null, string } pageToken = null; + } + + /** This is the response from `POST /featuresets/search` expressed as JSON. */ + record SearchFeatureSetsResponse { + /** The list of matching feature sets. */ + array featureSets = []; + + /** + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + */ + union { null, string } nextPageToken = null; + } + + /** + Gets a list of `FeatureSet` matching the search criteria. + + `POST /featuresets/search` must accept a JSON version of + `SearchFeatureSetsRequest` as the post body and will return a JSON version + of `SearchFeatureSetsResponse`. + */ + SearchFeatureSetsResponse searchFeatureSets( + /** This request maps to the body of `POST /featuresets/search` as JSON. */ + SearchFeatureSetsRequest request) throws GAException; + + /**************** /featuresets/{id} *******************/ + /** + Gets a `FeatureSet` by ID. + `GET /featuresets/{id}` will return a JSON version of `FeatureSet`. + */ + org.ga4gh.models.FeatureSet getFeatureSet( + /** + The ID of the `FeatureSet`. + */ + string id) throws GAException; + + + /****************** /features/search *****************/ /** This request maps to the body of `POST /features/search` as JSON. */ @@ -23,14 +82,7 @@ protocol SequenceAnnotationMethods { union { null, string } parentId; /** - Only return features with on the reference with this name. One of this - field or `referenceId` is required. (case-sensitive, exact match) - */ - union { null, string } referenceName = null; - - /** - Only return feature on the reference with this ID. One of this field or - `referenceName` is required. + Only return features on the reference with this ID. */ union { null, string } referenceId = null; @@ -49,13 +101,13 @@ protocol SequenceAnnotationMethods { */ long end; - // TODO: Fix this field. Clarify semantics around how OntologyTerm - // matching works, or change the Ontology term field on the feature. + // TODO: To be replaced with a fully featured ontology search + // once the Metadata definitions are rounded out. /** If specified, this query matches only annotations which match one of the - provided feature types. + provided ontology terms. */ - array features = []; + array ontologyTerms = []; /** Specifies the maximum number of results to return in a single page. @@ -111,5 +163,3 @@ protocol SequenceAnnotationMethods { string id) throws GAException; } - -// TODO: AnnotationSet methods. diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl index d5cb2cf0..1d2a1499 100644 --- a/src/main/resources/avro/sequenceAnnotations.avdl +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -75,6 +75,11 @@ protocol SequenceAnnotations { */ long end; + /** + The strand on which the feature is present. + */ + Strand strand; + /** Feature that is annotated by this region. Normally, this will be a term in the Sequence Ontology. From 118c9dde038c86a04ed34791851dd74b6a1e6a4e Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Mon, 1 Feb 2016 11:43:39 -0800 Subject: [PATCH 04/13] Introduced changes addressing Mark and Sarah's feedback. --- doc/source/api/sequence_annotations.rst | 30 +++++++++---------- .../avro/sequenceAnnotationmethods.avdl | 2 +- .../resources/avro/sequenceAnnotations.avdl | 12 ++++++-- 3 files changed, 26 insertions(+), 18 deletions(-) diff --git a/doc/source/api/sequence_annotations.rst b/doc/source/api/sequence_annotations.rst index 327b170a..9000c9bd 100644 --- a/doc/source/api/sequence_annotations.rst +++ b/doc/source/api/sequence_annotations.rst @@ -9,28 +9,28 @@ For the Sequence Annotation schema definitions, see `Sequence Annotation schema ------------------------ Feature Based Hierarchy ------------------------ -The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature has one or more parent Features which enables the construction of more complex representations in a hierarchical way. +The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature has one parent Feature, and can have an ordered array of child Features, which enables the construction of more complex representations in a hierarchical way. -For example, a top level Feature may be a single Gene. The different transcripts would have the gene Feature as parent. Similarly, the specific exons for each transcript would have both gene and transcript as parent. This structure can also exend to annotating CDS, binding sites or any other sub-gene level features. +For example, a single gene Feature may be parent to several different transcript Features. The specific exons for each transcript would have that transcript Feature as parent. The same physical exon may occur as part of two different transcript Features, but in our notation, it would be +encoded as two separate exon Features, each with a different parent, both occupying the same genomic coordinates. This structure can also exend to annotating CDS, binding sites or any other sub-gene level features. -This model is very similar to that used by `GFF3`_. + +------------------------------ +The Sequence Annotation Schema +------------------------------ + +This model is similar to that used by the standard `GFF3`_ file format. .. _GFF3: http://sequenceontology.org/resources/gff3.html -A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose to gather -all Features from the same GFF3 file into a common FeatureSet. +The main differences concern the deprecation and replacement of discontinuous features, the replacing +of multi-parent features with multiple copies of that feature, and the ability to impose an explicit order on child features. ---------------------------- -The Sequence Annotation Schema ---------------------------- -TODO: insert an example annotation translation from GFF3 to GA4GH +In the first case, a CDS composed of multiple regions is sometimes encoded as multiple rows of a GFF3 file, each with the same feature ID. This is translated in our hierarchy into a single CDS Feature with an ordered set of CDS_region Feature children, each corresponding to a single row of the original record. ---------------------------------------- -Annotation Design - RNA Considerations ---------------------------------------- +In the second case, as explained above, features with multiple parents in a GFF3 record are simply replicated and assigned a new identifier as many times as needed to ensure a unique parent for every feature. -Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic sequences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, but a dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated by the existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in this case) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotation API needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or even the whole gene level if assignment is not more specific than that. +In the final case, an explicit mechanism is provided for ordering child Features. Most of the time this ordering is trivially derived from the genomic coordinate ordering of the children, but in some biologically important cases this order can differ, such as in non-canonical splicing of exomes into transcripts (also known as back splicing - see below). -Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will result in a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. The exon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more general case, the API should allow child features to have an ordered relationship. +A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose to gather all Features from the same GFF3 file into a common FeatureSet. -The annotation API needs to also be flexible enough to handle multiple references in the same gene or transcript. This is needed to cover the cases of fusion genes or inter-chromosomal translocations. diff --git a/src/main/resources/avro/sequenceAnnotationmethods.avdl b/src/main/resources/avro/sequenceAnnotationmethods.avdl index 148f3ded..7efb2d0e 100644 --- a/src/main/resources/avro/sequenceAnnotationmethods.avdl +++ b/src/main/resources/avro/sequenceAnnotationmethods.avdl @@ -84,7 +84,7 @@ protocol SequenceAnnotationMethods { /** Only return features on the reference with this ID. */ - union { null, string } referenceId = null; + string referenceId; /** Required. The beginning of the window (0-based, inclusive) for which diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl index 1d2a1499..53eb1560 100644 --- a/src/main/resources/avro/sequenceAnnotations.avdl +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -45,9 +45,17 @@ protocol SequenceAnnotations { string id; /** - Ids of the parents of this annotation node. + Parent Id of this node. Set to empty string if node has no parent. */ - array parentIds; + string parentId; + + /** + Ordered array of Child Ids of this node. + Since not all child nodes are ordered by genomic coordinates, + this can't always be reconstructed from parentId's of the children alone. + */ + + array childIds = []; /** Identifier for the containing feature set. From 18629463828cf72977fef4c535158b7543cb1101 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Mon, 1 Feb 2016 15:16:02 -0800 Subject: [PATCH 05/13] Re-added RNA considerations to sequence_annotations.rst --- doc/source/api/sequence_annotations.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/doc/source/api/sequence_annotations.rst b/doc/source/api/sequence_annotations.rst index 9000c9bd..a784b54f 100644 --- a/doc/source/api/sequence_annotations.rst +++ b/doc/source/api/sequence_annotations.rst @@ -34,3 +34,13 @@ In the final case, an explicit mechanism is provided for ordering child Features A FeatureSet is simply a collection of features from the same source. An implementer may, for example, choose to gather all Features from the same GFF3 file into a common FeatureSet. + +-------------------------------------- +Annotation Design - RNA Considerations +-------------------------------------- + +Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic sequences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, but a dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated by the existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in this case) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotation API needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or even the whole gene level if assignment is not more specific than that. + +Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will result in a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. The exon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more general case, the API should allow child features to have an ordered relationship. + +The annotation API needs to also be flexible enough to handle multiple references in the same gene or transcript. This is needed to cover the cases of fusion genes or inter-chromosomal translocations. From bf7a01d1eef2799f27618ad8b65a57de57b0d109 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 12 Feb 2016 09:57:22 -0800 Subject: [PATCH 06/13] Removed attributes field in FeatureSet in favor of a more generic key-value info map. --- src/main/resources/avro/sequenceAnnotations.avdl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl index 53eb1560..adeb12b4 100644 --- a/src/main/resources/avro/sequenceAnnotations.avdl +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -131,7 +131,7 @@ protocol SequenceAnnotations { */ union { null, string } sourceURI = null; - /** Set of additional attributes */ - Attributes attributes; + /** Remaining structured metadata key-value pairs. */ + map> info = {}; } } From 5ccbe68a478a97e07b647d06097b14d390f7f5ab Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 12 Feb 2016 15:01:24 -0800 Subject: [PATCH 07/13] referenceId -> referenceName in featureSearch --- src/main/resources/avro/sequenceAnnotationmethods.avdl | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/main/resources/avro/sequenceAnnotationmethods.avdl b/src/main/resources/avro/sequenceAnnotationmethods.avdl index 7efb2d0e..8b690c7e 100644 --- a/src/main/resources/avro/sequenceAnnotationmethods.avdl +++ b/src/main/resources/avro/sequenceAnnotationmethods.avdl @@ -82,9 +82,10 @@ protocol SequenceAnnotationMethods { union { null, string } parentId; /** - Only return features on the reference with this ID. + Only return features on the reference with this name + (matched to literal reference name as imported from the GFF3). */ - string referenceId; + string referenceName; /** Required. The beginning of the window (0-based, inclusive) for which From 63fe2029aa9e63fbeef1fb159f982fb134cf8a96 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Wed, 9 Mar 2016 17:09:28 -0800 Subject: [PATCH 08/13] finished merge from master to sequence annotations. --- .gitignore | 5 +- doc/source/schemas/common.rst | 107 ++++++ doc/source/schemas/metadatamethods.rst | 237 ++++++++++++ doc/source/schemas/methods.rst | 7 + doc/source/schemas/referencemethods.rst | 379 +++++++++++++++++++ doc/source/schemas/references.rst | 199 ++++++++++ doc/source/schemas/variantmethods.rst | 475 ++++++++++++++++++++++++ doc/source/schemas/variants.rst | 297 +++++++++++++++ 8 files changed, 1702 insertions(+), 4 deletions(-) create mode 100644 doc/source/schemas/common.rst create mode 100644 doc/source/schemas/metadatamethods.rst create mode 100644 doc/source/schemas/methods.rst create mode 100644 doc/source/schemas/referencemethods.rst create mode 100644 doc/source/schemas/references.rst create mode 100644 doc/source/schemas/variantmethods.rst create mode 100644 doc/source/schemas/variants.rst diff --git a/.gitignore b/.gitignore index 0a94a474..fded6d94 100644 --- a/.gitignore +++ b/.gitignore @@ -1,11 +1,8 @@ *.py[cod] target *~ - +#* doc/source/schemas/*.avpr -doc/source/schemas/*.rst -!doc/source/schemas/index.rst - build #********** windows template********** diff --git a/doc/source/schemas/common.rst b/doc/source/schemas/common.rst new file mode 100644 index 00000000..99605783 --- /dev/null +++ b/doc/source/schemas/common.rst @@ -0,0 +1,107 @@ +Common +****** + +This file defines common types used in other parts of the schema. +There are no directly associated methods. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + diff --git a/doc/source/schemas/metadatamethods.rst b/doc/source/schemas/metadatamethods.rst new file mode 100644 index 00000000..703c69cb --- /dev/null +++ b/doc/source/schemas/metadatamethods.rst @@ -0,0 +1,237 @@ +MetadataMethods +*************** + + .. function:: searchDatasets(request) + + :param request: SearchDatasetsRequest: This request maps to the body of `POST /datasets/search` as JSON. + :return type: SearchDatasetsResponse + :throws: GAException + +Gets a list of datasets accessible through the API. + +TODO: Reads and variants both want to have datasets. Are they the same object? + +`POST /datasets/search` must accept a JSON version of +`SearchDatasetsRequest` as the post body and will return a JSON version +of `SearchDatasetsResponse`. + + .. function:: getDataset(id) + + :param id: string: The ID of the `Dataset`. + :return type: org.ga4gh.models.Dataset + :throws: GAException + +Gets a `Dataset` by ID. +`GET /datasets/{id}` will return a JSON version of `Dataset`. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: Experiment + + :field id: + The experiment UUID. This is globally unique. + :type id: string + :field name: + The name of the experiment. + :type name: null|string + :field description: + A description of the experiment. + :type description: null|string + :field recordCreateTime: + The time at which this record was created. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordCreateTime: string + :field recordUpdateTime: + The time at which this record was last updated. + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS.SSS (e.g. 2015-02-10T00:03:42.123Z) + :type recordUpdateTime: string + :field runTime: + The time at which this experiment was performed. + Granularity here is variable (e.g. date only). + Format: ISO 8601, YYYY-MM-DDTHH:MM:SS (e.g. 2015-02-10T00:03:42) + :type runTime: null|string + :field molecule: + The molecule examined in this experiment. (e.g. genomics DNA, total RNA) + :type molecule: null|string + :field strategy: + The experiment technique or strategy applied to the sample. + (e.g. whole genome sequencing, RNA-seq, RIP-seq) + :type strategy: null|string + :field selection: + The method used to enrich the target. (e.g. immunoprecipitation, size + fractionation, MNase digestion) + :type selection: null|string + :field library: + The name of the library used as part of this experiment. + :type library: null|string + :field libraryLayout: + The configuration of sequenced reads. (e.g. Single or Paired) + :type libraryLayout: null|string + :field instrumentModel: + The instrument model used as part of this experiment. + This maps to sequencing technology in BAM. + :type instrumentModel: null|string + :field instrumentDataFile: + The data file generated by the instrument. + TODO: This isn't actually a file is it? + Should this be `instrumentData` instead? + :type instrumentDataFile: null|string + :field sequencingCenter: + The sequencing center used as part of this experiment. + :type sequencingCenter: null|string + :field platformUnit: + The platform unit used as part of this experiment. This is a flowcell-barcode + or slide unique identifier. + :type platformUnit: null|string + :field info: + A map of additional experiment information. + :type info: map> + + An experimental preparation of a sample. + +.. avro:record:: Dataset + + :field id: + The dataset's id, locally unique to the server instance. + :type id: string + :field name: + The name of the dataset. + :type name: null|string + :field description: + Additional, human-readable information on the dataset. + :type description: null|string + + A Dataset is a collection of related data of multiple types. + Data providers decide how to group data into datasets. + See [Metadata API](../api/metadata.html) for a more detailed discussion. + +.. avro:error:: GAException + + A general exception type. + +.. avro:record:: SearchDatasetsRequest + + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /datasets/search` as JSON. + +.. avro:record:: SearchDatasetsResponse + + :field datasets: + The list of datasets. + :type datasets: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /datasets/search` expressed as JSON. + diff --git a/doc/source/schemas/methods.rst b/doc/source/schemas/methods.rst new file mode 100644 index 00000000..e02a4cb3 --- /dev/null +++ b/doc/source/schemas/methods.rst @@ -0,0 +1,7 @@ +RPC +*** + +.. avro:error:: GAException + + A general exception type. + diff --git a/doc/source/schemas/referencemethods.rst b/doc/source/schemas/referencemethods.rst new file mode 100644 index 00000000..e287d4df --- /dev/null +++ b/doc/source/schemas/referencemethods.rst @@ -0,0 +1,379 @@ +ReferenceMethods +**************** + + .. function:: getReferenceSet(id) + + :param id: string: The ID of the `ReferenceSet`. + :return type: org.ga4gh.models.ReferenceSet + :throws: GAException + +Gets a `ReferenceSet` by ID. +`GET /referencesets/{id}` will return a JSON version of `ReferenceSet`. + + .. function:: getReference(id) + + :param id: string: The ID of the `Reference`. + :return type: org.ga4gh.models.Reference + :throws: GAException + +Gets a `Reference` by ID. +`GET /references/{id}` will return a JSON version of `Reference`. + + .. function:: searchReferences(request) + + :param request: SearchReferencesRequest: This request maps to the body of `POST /references/search` + as JSON. + :return type: SearchReferencesResponse + :throws: GAException + +Gets a list of `Reference` matching the search criteria. + +`POST /references/search` must accept a JSON version of +`SearchReferencesRequest` as the post body and will return a JSON +version of `SearchReferencesResponse`. + + .. function:: getReferenceBases(id, request) + + :param id: string: The ID of the `Reference`. + :param request: ListReferenceBasesRequest: Additional request parameters to restrict the query. + :return type: ListReferenceBasesResponse + :throws: GAException + +Lists `Reference` bases by ID and optional range. +`GET /references/{id}/bases` will return a JSON version of +`ListReferenceBasesResponse`. + + .. function:: searchReferenceSets(request) + + :param request: SearchReferenceSetsRequest: This request maps to the body of `POST /referencesets/search` + as JSON. + :return type: SearchReferenceSetsResponse + :throws: GAException + +Gets a list of `ReferenceSet` matching the search criteria. + +`POST /referencesets/search` must accept a JSON version of +`SearchReferenceSetsRequest` as the post body and will return a JSON +version of `SearchReferenceSetsResponse`. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:error:: GAException + + A general exception type. + +.. avro:record:: Reference + + :field id: + The reference ID. Unique within the repository. + :type id: string + :field length: + The length of this reference's sequence. + :type length: long + :field md5checksum: + The MD5 checksum uniquely representing this `Reference` as a lower-case + hexadecimal string, calculated as the MD5 of the upper-case sequence + excluding all whitespace characters (this is equivalent to SQ:M5 in SAM). + :type md5checksum: string + :field name: + The name of this reference. (e.g. '22'). + :type name: string + :field sourceURI: + The URI from which the sequence was obtained. Specifies a FASTA format + file/string with one name, sequence pair. In most cases, clients should call + the `getReferenceBases()` method to obtain sequence bases for a `Reference` + instead of attempting to retrieve this URI. + :type sourceURI: null|string + :field sourceAccessions: + All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include + a version number, e.g. `GCF_000001405.26`. + :type sourceAccessions: array + :field isDerived: + A sequence X is said to be derived from source sequence Y, if X and Y + are of the same length and the per-base sequence divergence at A/C/G/T bases + is sufficiently small. Two sequences derived from the same official + sequence share the same coordinates and annotations, and + can be replaced with the official sequence for certain use cases. + :type isDerived: boolean + :field sourceDivergence: + The `sourceDivergence` is the fraction of non-indel bases that do not match the + reference this record was derived from. + :type sourceDivergence: null|float + :field ncbiTaxonId: + ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human). + :type ncbiTaxonId: null|int + + A `Reference` is a canonical assembled contig, intended to act as a + reference coordinate space for other genomic annotations. A single + `Reference` might represent the human chromosome 1, for instance. + + `Reference`s are designed to be immutable. + +.. avro:record:: ReferenceSet + + :field id: + The reference set ID. Unique in the repository. + :type id: string + :field name: + The reference set name. + :type name: null|string + :field md5checksum: + Order-independent MD5 checksum which identifies this `ReferenceSet`. + + To compute this checksum, make a list of `Reference.md5checksum` for all + `Reference`s in this set. Then sort that list, and take the MD5 hash of + all the strings concatenated together. Express the hash as a lower-case + hexadecimal string. + :type md5checksum: string + :field ncbiTaxonId: + ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating + the species which this assembly is intended to model. Note that contained + `Reference`s may specify a different `ncbiTaxonId`, as assemblies may + contain reference sequences which do not belong to the modeled species, e.g. + EBV in a human reference genome. + :type ncbiTaxonId: null|int + :field description: + Optional free text description of this reference set. + :type description: null|string + :field assemblyId: + Public id of this reference set, such as `GRCh37`. + :type assemblyId: null|string + :field sourceURI: + Specifies a FASTA format file/string. + :type sourceURI: null|string + :field sourceAccessions: + All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally + with a version number, e.g. `NC_000001.11`. + :type sourceAccessions: array + :field isDerived: + A reference set may be derived from a source if it contains + additional sequences, or some of the sequences within it are derived + (see the definition of `isDerived` in `Reference`). + :type isDerived: boolean + + A `ReferenceSet` is a set of `Reference`s which typically comprise a + reference assembly, such as `GRCh38`. A `ReferenceSet` defines a common + coordinate space for comparing reference-aligned experimental data. + +.. avro:record:: SearchReferenceSetsRequest + + :field md5checksum: + If not null, return the reference sets for which the + `md5checksum` matches this string (case-sensitive, exact match). + See `ReferenceSet::md5checksum` for details. + :type md5checksum: null|string + :field accession: + If not null, return the reference sets for which the `accession` + matches this string (case-sensitive, exact match). + :type accession: null|string + :field assemblyId: + If not null, return the reference sets for which the `assemblyId` + matches this string (case-sensitive, exact match). + :type assemblyId: null|string + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /referencesets/search` + as JSON. + +.. avro:record:: SearchReferenceSetsResponse + + :field referenceSets: + The list of matching reference sets. + :type referenceSets: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /referencesets/search` + expressed as JSON. + +.. avro:record:: SearchReferencesRequest + + :field referenceSetId: + The `ReferenceSet` to search. + :type referenceSetId: string + :field md5checksum: + If not null, return the references for which the + `md5checksum` matches this string (case-sensitive, exact match). + See `ReferenceSet::md5checksum` for details. + :type md5checksum: null|string + :field accession: + If not null, return the references for which the `accession` + matches this string (case-sensitive, exact match). + :type accession: null|string + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /references/search` + as JSON. + +.. avro:record:: SearchReferencesResponse + + :field references: + The list of matching references. + :type references: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /references/search` expressed as JSON. + +.. avro:record:: ListReferenceBasesRequest + + :field start: + The start position (0-based) of this query. Defaults to 0. + Genomic positions are non-negative integers less than reference length. + Requests spanning the join of circular genomes are represented as + two requests one on each side of the join (position 0). + :type start: long + :field end: + The end position (0-based, exclusive) of this query. Defaults + to the length of this `Reference`. + :type end: null|long + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + The query parameters for a request to `GET /references/{id}/bases`, for + example: + + `GET /references/{id}/bases?start=100&end=200` + +.. avro:record:: ListReferenceBasesResponse + + :field offset: + The offset position (0-based) of the given sequence from the start of this + `Reference`. This value will differ for each page in a paginated request. + :type offset: long + :field sequence: + A substring of the bases that make up this reference. Bases are represented + as IUPAC-IUB codes; this string matches the regexp `[ACGTMRWSYKVHDBN]*`. + :type sequence: string + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + The response from `GET /references/{id}/bases` expressed as JSON. + diff --git a/doc/source/schemas/references.rst b/doc/source/schemas/references.rst new file mode 100644 index 00000000..9473784d --- /dev/null +++ b/doc/source/schemas/references.rst @@ -0,0 +1,199 @@ +References +********** + +Defines types used by the GA4GH References API. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: Reference + + :field id: + The reference ID. Unique within the repository. + :type id: string + :field length: + The length of this reference's sequence. + :type length: long + :field md5checksum: + The MD5 checksum uniquely representing this `Reference` as a lower-case + hexadecimal string, calculated as the MD5 of the upper-case sequence + excluding all whitespace characters (this is equivalent to SQ:M5 in SAM). + :type md5checksum: string + :field name: + The name of this reference. (e.g. '22'). + :type name: string + :field sourceURI: + The URI from which the sequence was obtained. Specifies a FASTA format + file/string with one name, sequence pair. In most cases, clients should call + the `getReferenceBases()` method to obtain sequence bases for a `Reference` + instead of attempting to retrieve this URI. + :type sourceURI: null|string + :field sourceAccessions: + All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) which must include + a version number, e.g. `GCF_000001405.26`. + :type sourceAccessions: array + :field isDerived: + A sequence X is said to be derived from source sequence Y, if X and Y + are of the same length and the per-base sequence divergence at A/C/G/T bases + is sufficiently small. Two sequences derived from the same official + sequence share the same coordinates and annotations, and + can be replaced with the official sequence for certain use cases. + :type isDerived: boolean + :field sourceDivergence: + The `sourceDivergence` is the fraction of non-indel bases that do not match the + reference this record was derived from. + :type sourceDivergence: null|float + :field ncbiTaxonId: + ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human). + :type ncbiTaxonId: null|int + + A `Reference` is a canonical assembled contig, intended to act as a + reference coordinate space for other genomic annotations. A single + `Reference` might represent the human chromosome 1, for instance. + + `Reference`s are designed to be immutable. + +.. avro:record:: ReferenceSet + + :field id: + The reference set ID. Unique in the repository. + :type id: string + :field name: + The reference set name. + :type name: null|string + :field md5checksum: + Order-independent MD5 checksum which identifies this `ReferenceSet`. + + To compute this checksum, make a list of `Reference.md5checksum` for all + `Reference`s in this set. Then sort that list, and take the MD5 hash of + all the strings concatenated together. Express the hash as a lower-case + hexadecimal string. + :type md5checksum: string + :field ncbiTaxonId: + ID from http://www.ncbi.nlm.nih.gov/taxonomy (e.g. 9606->human) indicating + the species which this assembly is intended to model. Note that contained + `Reference`s may specify a different `ncbiTaxonId`, as assemblies may + contain reference sequences which do not belong to the modeled species, e.g. + EBV in a human reference genome. + :type ncbiTaxonId: null|int + :field description: + Optional free text description of this reference set. + :type description: null|string + :field assemblyId: + Public id of this reference set, such as `GRCh37`. + :type assemblyId: null|string + :field sourceURI: + Specifies a FASTA format file/string. + :type sourceURI: null|string + :field sourceAccessions: + All known corresponding accession IDs in INSDC (GenBank/ENA/DDBJ) ideally + with a version number, e.g. `NC_000001.11`. + :type sourceAccessions: array + :field isDerived: + A reference set may be derived from a source if it contains + additional sequences, or some of the sequences within it are derived + (see the definition of `isDerived` in `Reference`). + :type isDerived: boolean + + A `ReferenceSet` is a set of `Reference`s which typically comprise a + reference assembly, such as `GRCh38`. A `ReferenceSet` defines a common + coordinate space for comparing reference-aligned experimental data. + diff --git a/doc/source/schemas/variantmethods.rst b/doc/source/schemas/variantmethods.rst new file mode 100644 index 00000000..a3641a85 --- /dev/null +++ b/doc/source/schemas/variantmethods.rst @@ -0,0 +1,475 @@ +VariantMethods +************** + + .. function:: searchVariants(request) + + :param request: SearchVariantsRequest: This request maps to the body of `POST /variants/search` as JSON. + :return type: SearchVariantsResponse + :throws: GAException + +Gets a list of `Variant` matching the search criteria. + +`POST /variants/search` must accept a JSON version of `SearchVariantsRequest` +as the post body and will return a JSON version of `SearchVariantsResponse`. + + .. function:: getCallSet(id) + + :param id: string: The ID of the `CallSet`. + :return type: org.ga4gh.models.CallSet + :throws: GAException + +Gets a `CallSet` by ID. +`GET /callsets/{id}` will return a JSON version of `CallSet`. + + .. function:: searchVariantSets(request) + + :param request: SearchVariantSetsRequest: This request maps to the body of `POST /variantsets/search` as JSON. + :return type: SearchVariantSetsResponse + :throws: GAException + +Gets a list of `VariantSet` matching the search criteria. + +`POST /variantsets/search` must accept a JSON version of +`SearchVariantSetsRequest` as the post body and will return a JSON version +of `SearchVariantSetsResponse`. + + .. function:: getVariantSet(id) + + :param id: string: The ID of the `VariantSet`. + :return type: org.ga4gh.models.VariantSet + :throws: GAException + +Gets a `VariantSet` by ID. +`GET /variantsets/{id}` will return a JSON version of `VariantSet`. + + .. function:: getVariant(id) + + :param id: string: The ID of the `Variant`. + :return type: org.ga4gh.models.Variant + :throws: GAException + +Gets a `Variant` by ID. +`GET /variants/{id}` will return a JSON version of `Variant`. + + .. function:: searchCallSets(request) + + :param request: SearchCallSetsRequest: This request maps to the body of `POST /callsets/search` as JSON. + :return type: SearchCallSetsResponse + :throws: GAException + +Gets a list of `CallSet` matching the search criteria. + +`POST /callsets/search` must accept a JSON version of `SearchCallSetsRequest` +as the post body and will return a JSON version of `SearchCallSetsResponse`. + +.. avro:error:: GAException + + A general exception type. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: VariantSetMetadata + + :field key: + The top-level key. + :type key: string + :field value: + The value field for simple metadata. + :type value: string + :field id: + User-provided ID field, not enforced by this API. + Two or more pieces of structured metadata with identical + id and key fields are considered equivalent. + `FIXME: If it's not enforced, then why can't it be null?` + :type id: string + :field type: + The type of data. + :type type: string + :field number: + The number of values that can be included in a field described by this + metadata. + :type number: string + :field description: + A textual description of this metadata. + :type description: string + :field info: + Remaining structured metadata key-value pairs. + :type info: map> + + Optional metadata associated with a variant set. + +.. avro:record:: VariantSet + + :field id: + The variant set ID. + :type id: string + :field name: + The variant set name. + :type name: null|string + :field datasetId: + The ID of the dataset this variant set belongs to. + :type datasetId: string + :field referenceSetId: + The ID of the reference set that describes the sequences used by the variants in this set. + :type referenceSetId: string + :field metadata: + Optional metadata associated with this variant set. + This array can be used to store information about the variant set, such as information found + in VCF header fields, that isn't already available in first class fields such as "name". + :type metadata: array + + A VariantSet is a collection of variants and variant calls intended to be analyzed together. + +.. avro:record:: CallSet + + :field id: + The call set ID. + :type id: string + :field name: + The call set name. + :type name: null|string + :field sampleId: + The sample this call set's data was generated from. + Note: the current API does not have a rigorous definition of sample. Therefore, this + field actually contains an arbitrary string, typically corresponding to the sampleId + field in the read groups used to generate this call set. + :type sampleId: null|string + :field variantSetIds: + The IDs of the variant sets this call set has calls in. + :type variantSetIds: array + :field created: + The date this call set was created in milliseconds from the epoch. + :type created: null|long + :field updated: + The time at which this call set was last updated in + milliseconds from the epoch. + :type updated: null|long + :field info: + A map of additional call set information. + :type info: map> + + A CallSet is a collection of calls that were generated by the same analysis of the same sample. + +.. avro:record:: Call + + :field callSetName: + The name of the call set this variant call belongs to. + If this field is not present, the ordering of the call sets from a + `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match + the ordering of the calls on this `Variant`. + The number of results will also be the same. + :type callSetName: null|string + :field callSetId: + The ID of the call set this variant call belongs to. + + If this field is not present, the ordering of the call sets from a + `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match + the ordering of the calls on this `Variant`. + The number of results will also be the same. + :type callSetId: null|string + :field genotype: + The genotype of this variant call. + + A 0 value represents the reference allele of the associated `Variant`. Any + other value is a 1-based index into the alternate alleles of the associated + `Variant`. + + If a variant had a referenceBases field of "T", an alternateBases + value of ["A", "C"], and the genotype was [2, 1], that would mean the call + represented the heterozygous value "CA" for this variant. If the genotype + was instead [0, 1] the represented value would be "TA". Ordering of the + genotype values is important if the phaseset field is present. + :type genotype: array + :field phaseset: + If this field is not null, this variant call's genotype ordering implies + the phase of the bases and is consistent with any other variant calls on + the same contig which have the same phaseset string. + :type phaseset: null|string + :field genotypeLikelihood: + The genotype likelihoods for this variant call. Each array entry + represents how likely a specific genotype is for this call as + log10(P(data | genotype)), analogous to the GL tag in the VCF spec. The + value ordering is defined by the GL tag in the VCF spec. + :type genotypeLikelihood: array + :field info: + A map of additional variant call information. + :type info: map> + + A `Call` represents the determination of genotype with respect to a + particular `Variant`. + + It may include associated information such as quality + and phasing. For example, a call might assign a probability of 0.32 to + the occurrence of a SNP named rs1234 in a call set with the name NA12345. + +.. avro:record:: Variant + + :field id: + The variant ID. + :type id: string + :field variantSetId: + The ID of the `VariantSet` this variant belongs to. This transitively defines + the `ReferenceSet` against which the `Variant` is to be interpreted. + :type variantSetId: string + :field names: + Names for the variant, for example a RefSNP ID. + :type names: array + :field created: + The date this variant was created in milliseconds from the epoch. + :type created: null|long + :field updated: + The time at which this variant was last updated in + milliseconds from the epoch. + :type updated: null|long + :field referenceName: + The reference on which this variant occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this variant occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Variants spanning the join of circular genomes are represented as + two variants one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field referenceBases: + The reference bases for this variant. They start at the given start position. + :type referenceBases: string + :field alternateBases: + The bases that appear instead of the reference bases. Multiple alternate + alleles are possible. + :type alternateBases: array + :field info: + A map of additional variant information. + :type info: map> + :field calls: + The variant calls for this particular variant. Each one represents the + determination of genotype with respect to this variant. `Call`s in this array + are implicitly associated with this `Variant`. + :type calls: array + + A `Variant` represents a change in DNA sequence relative to some reference. + For example, a variant could represent a SNP or an insertion. + Variants belong to a `VariantSet`. + This is equivalent to a row in VCF. + +.. avro:record:: SearchVariantSetsRequest + + :field datasetId: + The `Dataset` to search. + :type datasetId: string + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /variantsets/search` as JSON. + +.. avro:record:: SearchVariantSetsResponse + + :field variantSets: + The list of matching variant sets. + :type variantSets: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /variantsets/search` expressed as JSON. + +.. avro:record:: SearchVariantsRequest + + :field variantSetId: + The `VariantSet` to search. + :type variantSetId: string + :field callSetIds: + Only return variant calls which belong to call sets with these IDs. + If an empty array, returns variants without any call objects. + If null, returns all variant calls. + :type callSetIds: null|array + :field referenceName: + Required. Only return variants on this reference. + :type referenceName: string + :field start: + Required. The beginning of the window (0-based, inclusive) for + which overlapping variants should be returned. + Genomic positions are non-negative integers less than reference length. + Requests spanning the join of circular genomes are represented as + two requests one on each side of the join (position 0). + :type start: long + :field end: + Required. The end of the window (0-based, exclusive) for which overlapping + variants should be returned. + :type end: long + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /variants/search` as JSON. + +.. avro:record:: SearchVariantsResponse + + :field variants: + The list of matching variants. + If the `callSetId` field on the returned calls is not present, + the ordering of the call sets from a `SearchCallSetsRequest` + over the parent `VariantSet` is guaranteed to match the ordering + of the calls on each `Variant`. The number of results will also be + the same. + :type variants: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /variants/search` expressed as JSON. + +.. avro:record:: SearchCallSetsRequest + + :field variantSetId: + The VariantSet to search. + :type variantSetId: string + :field name: + Only return call sets with this name (case-sensitive, exact match). + :type name: null|string + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /callsets/search` as JSON. + +.. avro:record:: SearchCallSetsResponse + + :field callSets: + The list of matching call sets. + :type callSets: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /callsets/search` expressed as JSON. + diff --git a/doc/source/schemas/variants.rst b/doc/source/schemas/variants.rst new file mode 100644 index 00000000..f51ccd9c --- /dev/null +++ b/doc/source/schemas/variants.rst @@ -0,0 +1,297 @@ +Variants +******** + +This file defines the objects used to represent variant calls, most importantly +VariantSet, Variant, and Call. +See {TODO: LINK TO VARIANTS OVERVIEW} for more information. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: VariantSetMetadata + + :field key: + The top-level key. + :type key: string + :field value: + The value field for simple metadata. + :type value: string + :field id: + User-provided ID field, not enforced by this API. + Two or more pieces of structured metadata with identical + id and key fields are considered equivalent. + `FIXME: If it's not enforced, then why can't it be null?` + :type id: string + :field type: + The type of data. + :type type: string + :field number: + The number of values that can be included in a field described by this + metadata. + :type number: string + :field description: + A textual description of this metadata. + :type description: string + :field info: + Remaining structured metadata key-value pairs. + :type info: map> + + Optional metadata associated with a variant set. + +.. avro:record:: VariantSet + + :field id: + The variant set ID. + :type id: string + :field name: + The variant set name. + :type name: null|string + :field datasetId: + The ID of the dataset this variant set belongs to. + :type datasetId: string + :field referenceSetId: + The ID of the reference set that describes the sequences used by the variants in this set. + :type referenceSetId: string + :field metadata: + Optional metadata associated with this variant set. + This array can be used to store information about the variant set, such as information found + in VCF header fields, that isn't already available in first class fields such as "name". + :type metadata: array + + A VariantSet is a collection of variants and variant calls intended to be analyzed together. + +.. avro:record:: CallSet + + :field id: + The call set ID. + :type id: string + :field name: + The call set name. + :type name: null|string + :field sampleId: + The sample this call set's data was generated from. + Note: the current API does not have a rigorous definition of sample. Therefore, this + field actually contains an arbitrary string, typically corresponding to the sampleId + field in the read groups used to generate this call set. + :type sampleId: null|string + :field variantSetIds: + The IDs of the variant sets this call set has calls in. + :type variantSetIds: array + :field created: + The date this call set was created in milliseconds from the epoch. + :type created: null|long + :field updated: + The time at which this call set was last updated in + milliseconds from the epoch. + :type updated: null|long + :field info: + A map of additional call set information. + :type info: map> + + A CallSet is a collection of calls that were generated by the same analysis of the same sample. + +.. avro:record:: Call + + :field callSetName: + The name of the call set this variant call belongs to. + If this field is not present, the ordering of the call sets from a + `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match + the ordering of the calls on this `Variant`. + The number of results will also be the same. + :type callSetName: null|string + :field callSetId: + The ID of the call set this variant call belongs to. + + If this field is not present, the ordering of the call sets from a + `SearchCallSetsRequest` over this `VariantSet` is guaranteed to match + the ordering of the calls on this `Variant`. + The number of results will also be the same. + :type callSetId: null|string + :field genotype: + The genotype of this variant call. + + A 0 value represents the reference allele of the associated `Variant`. Any + other value is a 1-based index into the alternate alleles of the associated + `Variant`. + + If a variant had a referenceBases field of "T", an alternateBases + value of ["A", "C"], and the genotype was [2, 1], that would mean the call + represented the heterozygous value "CA" for this variant. If the genotype + was instead [0, 1] the represented value would be "TA". Ordering of the + genotype values is important if the phaseset field is present. + :type genotype: array + :field phaseset: + If this field is not null, this variant call's genotype ordering implies + the phase of the bases and is consistent with any other variant calls on + the same contig which have the same phaseset string. + :type phaseset: null|string + :field genotypeLikelihood: + The genotype likelihoods for this variant call. Each array entry + represents how likely a specific genotype is for this call as + log10(P(data | genotype)), analogous to the GL tag in the VCF spec. The + value ordering is defined by the GL tag in the VCF spec. + :type genotypeLikelihood: array + :field info: + A map of additional variant call information. + :type info: map> + + A `Call` represents the determination of genotype with respect to a + particular `Variant`. + + It may include associated information such as quality + and phasing. For example, a call might assign a probability of 0.32 to + the occurrence of a SNP named rs1234 in a call set with the name NA12345. + +.. avro:record:: Variant + + :field id: + The variant ID. + :type id: string + :field variantSetId: + The ID of the `VariantSet` this variant belongs to. This transitively defines + the `ReferenceSet` against which the `Variant` is to be interpreted. + :type variantSetId: string + :field names: + Names for the variant, for example a RefSNP ID. + :type names: array + :field created: + The date this variant was created in milliseconds from the epoch. + :type created: null|long + :field updated: + The time at which this variant was last updated in + milliseconds from the epoch. + :type updated: null|long + :field referenceName: + The reference on which this variant occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this variant occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Variants spanning the join of circular genomes are represented as + two variants one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field referenceBases: + The reference bases for this variant. They start at the given start position. + :type referenceBases: string + :field alternateBases: + The bases that appear instead of the reference bases. Multiple alternate + alleles are possible. + :type alternateBases: array + :field info: + A map of additional variant information. + :type info: map> + :field calls: + The variant calls for this particular variant. Each one represents the + determination of genotype with respect to this variant. `Call`s in this array + are implicitly associated with this `Variant`. + :type calls: array + + A `Variant` represents a change in DNA sequence relative to some reference. + For example, a variant could represent a SNP or an insertion. + Variants belong to a `VariantSet`. + This is equivalent to a row in VCF. + From 3d6fdfdee592ec38de3a1c28d5415f46372dffed Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 18 Mar 2016 11:19:51 -0700 Subject: [PATCH 09/13] Added back in autogenerated rst doc files for sequence annotations. --- .../schemas/sequenceAnnotationmethods.rst | 457 ++++++++++++++++++ doc/source/schemas/sequenceAnnotations.rst | 342 +++++++++++++ 2 files changed, 799 insertions(+) create mode 100644 doc/source/schemas/sequenceAnnotationmethods.rst create mode 100644 doc/source/schemas/sequenceAnnotations.rst diff --git a/doc/source/schemas/sequenceAnnotationmethods.rst b/doc/source/schemas/sequenceAnnotationmethods.rst new file mode 100644 index 00000000..c0c836b3 --- /dev/null +++ b/doc/source/schemas/sequenceAnnotationmethods.rst @@ -0,0 +1,457 @@ +SequenceAnnotationMethods +************************* + + .. function:: searchFeatureSets(request) + + :param request: SearchFeatureSetsRequest: This request maps to the body of `POST /featuresets/search` as JSON. + :return type: SearchFeatureSetsResponse + :throws: GAException + +Gets a list of `FeatureSet` matching the search criteria. + + `POST /featuresets/search` must accept a JSON version of + `SearchFeatureSetsRequest` as the post body and will return a JSON version + of `SearchFeatureSetsResponse`. + + .. function:: getFeatureSet(id) + + :param id: string: The ID of the `FeatureSet`. + :return type: org.ga4gh.models.FeatureSet + :throws: GAException + +Gets a `FeatureSet` by ID. + `GET /featuresets/{id}` will return a JSON version of `FeatureSet`. + + .. function:: getFeature(id) + + :param id: string: The ID of the `Feature`. + :return type: org.ga4gh.models.Feature + :throws: GAException + +Gets a `org.ga4gh.models.Feature` by ID. + `GET /features/{id}` will return a JSON version of `Feature`. + + .. function:: searchFeatures(request) + + :param request: SearchFeaturesRequest: This request maps to the body of `POST /features/search` as JSON. + :return type: SearchFeaturesResponse + :throws: GAException + +Gets a list of `Feature` matching the search criteria. + + `POST /features/search` must accept a JSON version of + `SearchFeaturesRequest` as the post body and will return a JSON version of + `SearchFeaturesResponse`. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:error:: GAException + + A general exception type. + +.. avro:record:: OntologyTerm + + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + +.. avro:record:: Experiment + + :field id: + The experiment UUID. This is globally unique. + :type id: string + :field name: + The name of the experiment. + :type name: null|string + :field description: + A description of the experiment. + :type description: null|string + :field createDateTime: + The time at which this record was created. + Format: :ref:`ISO 8601 ` + :type createDateTime: string + :field updateDateTime: + The time at which this record was last updated. + Format: :ref:`ISO 8601 ` + :type updateDateTime: string + :field runTime: + The time at which this experiment was performed. + Granularity here is variable (e.g. date only). + Format: :ref:`ISO 8601 ` + :type runTime: null|string + :field molecule: + The molecule examined in this experiment. (e.g. genomics DNA, total RNA) + :type molecule: null|string + :field strategy: + The experiment technique or strategy applied to the sample. + (e.g. whole genome sequencing, RNA-seq, RIP-seq) + :type strategy: null|string + :field selection: + The method used to enrich the target. (e.g. immunoprecipitation, size + fractionation, MNase digestion) + :type selection: null|string + :field library: + The name of the library used as part of this experiment. + :type library: null|string + :field libraryLayout: + The configuration of sequenced reads. (e.g. Single or Paired) + :type libraryLayout: null|string + :field instrumentModel: + The instrument model used as part of this experiment. + This maps to sequencing technology in BAM. + :type instrumentModel: null|string + :field instrumentDataFile: + The data file generated by the instrument. + TODO: This isn't actually a file is it? + Should this be `instrumentData` instead? + :type instrumentDataFile: null|string + :field sequencingCenter: + The sequencing center used as part of this experiment. + :type sequencingCenter: null|string + :field platformUnit: + The platform unit used as part of this experiment. This is a flowcell-barcode + or slide unique identifier. + :type platformUnit: null|string + :field info: + A map of additional experiment information. + :type info: map> + + An experimental preparation of a sample. + +.. avro:record:: Dataset + + :field id: + The dataset's id, locally unique to the server instance. + :type id: string + :field name: + The name of the dataset. + :type name: null|string + :field description: + Additional, human-readable information on the dataset. + :type description: null|string + + A Dataset is a collection of related data of multiple types. + Data providers decide how to group data into datasets. + See [Metadata API](../api/metadata.html) for a more detailed discussion. + +.. avro:record:: Analysis + + :field id: + Formats of id | name | description | accessions are described in the + documentation on general attributes and formats. + :type id: string + :field name: + :type name: null|string + :field description: + :type description: null|string + :field createDateTime: + The time at which this record was created. + Format: :ref:`ISO 8601 ` + :type createDateTime: null|string + :field updateDateTime: + The time at which this record was last updated. + Format: :ref:`ISO 8601 ` + :type updateDateTime: string + :field type: + The type of analysis. + :type type: null|string + :field software: + The software run to generate this analysis. + :type software: array + :field info: + A map of additional analysis information. + :type info: map> + + An analysis contains an interpretation of one or several experiments. + (e.g. SNVs, copy number variations, methylation status) together with + information about the methodology used. + +.. avro:record:: Attributes + + :field vals: + :type vals: map> + + Type defining a collection of attributes associated with various protocol + records. Each attribute is a name that maps to an array of one or more + values. Values can be strings, external identifiers, or ontology terms. + Values should be split into the array elements instead of using a separator + syntax that needs to parsed. + +.. avro:record:: Feature + + :field id: + Id of this annotation node. + :type id: string + :field parentId: + Parent Id of this node. Set to empty string if node has no parent. + :type parentId: string + :field childIds: + Ordered array of Child Ids of this node. + Since not all child nodes are ordered by genomic coordinates, + this can't always be reconstructed from parentId's of the children alone. + :type childIds: array + :field featureSetId: + Identifier for the containing feature set. + :type featureSetId: string + :field referenceName: + The reference on which this feature occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this feature occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Features spanning the join of circular genomes are represented as + two features one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field strand: + The strand on which the feature is present. + :type strand: Strand + :field featureType: + Feature that is annotated by this region. Normally, this will be a term in + the Sequence Ontology. + :type featureType: OntologyTerm + :field attributes: + Name/value attributes of the annotation. Attribute names follow the GFF3 + naming convention of reserved names starting with an upper cases + character, and user-define names start with lower-case. Most GFF3 + pre-defined attributes apply, the exceptions are ID and Parent, which are + defined as fields. Additional, the following attributes are added: + * Score - the GFF3 score column + * Phase - the GFF3 phase column for CDS features. + :type attributes: Attributes + + Node in the annotation graph that annotates a contiguous region of a + sequence. + +.. avro:record:: FeatureSet + + :field id: + The ID of this annotation set. + :type id: string + :field datasetId: + The ID of the dataset this annotation set belongs to. + :type datasetId: null|string + :field referenceSetId: + The ID of the reference set which defines the coordinate-space for this + set of annotations. + :type referenceSetId: null|string + :field name: + The display name for this annotation set. + :type name: null|string + :field sourceURI: + The source URI describing the file from which this annotation set was + generated, if any. + :type sourceURI: null|string + :field info: + Remaining structured metadata key-value pairs. + :type info: map> + +.. avro:record:: SearchFeatureSetsRequest + + :field datasetId: + The `Dataset` to search. + :type datasetId: string + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /featuresets/search` as JSON. + +.. avro:record:: SearchFeatureSetsResponse + + :field featureSets: + The list of matching feature sets. + :type featureSets: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /featuresets/search` expressed as JSON. + +.. avro:record:: SearchFeaturesRequest + + :field featureSetId: + The annotation set to search within. Either `featureSetId` or + `parentId` must be non-empty. + :type featureSetId: null|string + :field parentId: + Restricts the search to direct children of the given parent `feature` + ID. Either `featureSetId` or `parentId` must be non-empty. + :type parentId: null|string + :field referenceName: + Only return features on the reference with this name + (matched to literal reference name as imported from the GFF3). + :type referenceName: string + :field start: + Required. The beginning of the window (0-based, inclusive) for which + overlapping features should be returned. Genomic positions are + non-negative integers less than reference length. Requests spanning the + join of circular genomes are represented as two requests one on each side + of the join (position 0). + :type start: long + :field end: + Required. The end of the window (0-based, exclusive) for which overlapping + features should be returned. + :type end: long + :field ontologyTerms: + If specified, this query matches only annotations which match one of the + provided ontology terms. + :type ontologyTerms: array + :field pageSize: + Specifies the maximum number of results to return in a single page. + If unspecified, a system default will be used. + :type pageSize: null|int + :field pageToken: + The continuation token, which is used to page through large result sets. + To get the next page of results, set this parameter to the value of + `nextPageToken` from the previous response. + :type pageToken: null|string + + This request maps to the body of `POST /features/search` as JSON. + +.. avro:record:: SearchFeaturesResponse + + :field features: + The list of matching annotations, sorted by start position. Annotations which + share a start position are returned in a deterministic order. + :type features: array + :field nextPageToken: + The continuation token, which is used to page through large result sets. + Provide this value in a subsequent request to return the next page of + results. This field will be empty if there aren't any additional results. + :type nextPageToken: null|string + + This is the response from `POST /features/search` expressed as JSON. + diff --git a/doc/source/schemas/sequenceAnnotations.rst b/doc/source/schemas/sequenceAnnotations.rst new file mode 100644 index 00000000..a50b12d3 --- /dev/null +++ b/doc/source/schemas/sequenceAnnotations.rst @@ -0,0 +1,342 @@ +SequenceAnnotations +******************* + +This protocol defines annotations on GA4GH genomic sequences It includes two +types of annotations: continuous and discrete hierarchical. + +The discrete hierarchical annotations are derived from the Sequence Ontology +(SO) and GFF3 work + + http://www.sequenceontology.org/gff3.shtml + +The goal is to be able to store annotations using the GFF3 and SO conceptual +model, although there is not necessarly a one-to-one mapping in Avro records +to GFF3 records. + +The minimum requirement is to be able to accurately represent the current +state of the art annotation data and the full SO model. Feature is the +core generic record which corresponds to the a GFF3 record. + +.. avro:enum:: Strand + + :symbols: NEG_STRAND|POS_STRAND + Indicates the DNA strand associate for some data item. + * `NEG_STRAND`: The negative (-) strand. + * `POS_STRAND`: The postive (+) strand. + +.. avro:record:: Position + + :field referenceName: + The name of the `Reference` on which the `Position` is located. + :type referenceName: string + :field position: + The 0-based offset from the start of the forward strand for that `Reference`. + Genomic positions are non-negative integers less than `Reference` length. + :type position: long + :field strand: + Strand the position is associated with. + :type strand: Strand + + A `Position` is an unoriented base in some `Reference`. A `Position` is + represented by a `Reference` name, and a base number on that `Reference` + (0-based). + +.. avro:record:: ExternalIdentifier + + :field database: + The source of the identifier. + (e.g. `Ensembl`) + :type database: string + :field identifier: + The ID defined by the external database. + (e.g. `ENST00000000000`) + :type identifier: string + :field version: + The version of the object or the database + (e.g. `78`) + :type version: string + + Identifier from a public database + +.. avro:enum:: CigarOperation + + :symbols: ALIGNMENT_MATCH|INSERT|DELETE|SKIP|CLIP_SOFT|CLIP_HARD|PAD|SEQUENCE_MATCH|SEQUENCE_MISMATCH + An enum for the different types of CIGAR alignment operations that exist. + Used wherever CIGAR alignments are used. The different enumerated values + have the following usage: + + * `ALIGNMENT_MATCH`: An alignment match indicates that a sequence can be + aligned to the reference without evidence of an INDEL. Unlike the + `SEQUENCE_MATCH` and `SEQUENCE_MISMATCH` operators, the `ALIGNMENT_MATCH` + operator does not indicate whether the reference and read sequences are an + exact match. This operator is equivalent to SAM's `M`. + * `INSERT`: The insert operator indicates that the read contains evidence of + bases being inserted into the reference. This operator is equivalent to + SAM's `I`. + * `DELETE`: The delete operator indicates that the read contains evidence of + bases being deleted from the reference. This operator is equivalent to + SAM's `D`. + * `SKIP`: The skip operator indicates that this read skips a long segment of + the reference, but the bases have not been deleted. This operator is + commonly used when working with RNA-seq data, where reads may skip long + segments of the reference between exons. This operator is equivalent to + SAM's 'N'. + * `CLIP_SOFT`: The soft clip operator indicates that bases at the start/end + of a read have not been considered during alignment. This may occur if the + majority of a read maps, except for low quality bases at the start/end of + a read. This operator is equivalent to SAM's 'S'. Bases that are soft clipped + will still be stored in the read. + * `CLIP_HARD`: The hard clip operator indicates that bases at the start/end of + a read have been omitted from this alignment. This may occur if this linear + alignment is part of a chimeric alignment, or if the read has been trimmed + (e.g., during error correction, or to trim poly-A tails for RNA-seq). This + operator is equivalent to SAM's 'H'. + * `PAD`: The pad operator indicates that there is padding in an alignment. + This operator is equivalent to SAM's 'P'. + * `SEQUENCE_MATCH`: This operator indicates that this portion of the aligned + sequence exactly matches the reference (e.g., all bases are equal to the + reference bases). This operator is equivalent to SAM's '='. + * `SEQUENCE_MISMATCH`: This operator indicates that this portion of the + aligned sequence is an alignment match to the reference, but a sequence + mismatch (e.g., the bases are not equal to the reference). This can + indicate a SNP or a read error. This operator is equivalent to SAM's 'X'. + +.. avro:record:: CigarUnit + + :field operation: + The operation type. + :type operation: CigarOperation + :field operationLength: + The number of bases that the operation runs for. + :type operationLength: long + :field referenceSequence: + `referenceSequence` is only used at mismatches (`SEQUENCE_MISMATCH`) + and deletions (`DELETE`). Filling this field replaces the MD tag. + If the relevant information is not available, leave this field as `null`. + :type referenceSequence: null|string + + A structure for an instance of a CIGAR operation. + `FIXME: This belongs under Reads (only readAlignment refers to this)` + +.. avro:record:: OntologyTerm + + :field id: + Ontology source identifier - the identifier, a CURIE (preferred) or + PURL for an ontology source e.g. http://purl.obolibrary.org/obo/hp.obo + It differs from the standard GA4GH schema's :ref:`id ` + in that it is a URI pointing to an information resource outside of the scope + of the schema or its resource implementation. + :type id: string + :field term: + Ontology term - the representation the id is pointing to. + :type term: null|string + :field sourceName: + Ontology source name - the name of ontology from which the term is obtained + e.g. 'Human Phenotype Ontology' + :type sourceName: null|string + :field sourceVersion: + Ontology source version - the version of the ontology from which the + OntologyTerm is obtained; e.g. 2.6.1. + There is no standard for ontology versioning and some frequently + released ontologies may use a datestamp, or build number. + :type sourceVersion: null|string + + An ontology term describing an attribute. (e.g. the phenotype attribute + 'polydactyly' from HPO) + +.. avro:record:: Experiment + + :field id: + The experiment UUID. This is globally unique. + :type id: string + :field name: + The name of the experiment. + :type name: null|string + :field description: + A description of the experiment. + :type description: null|string + :field createDateTime: + The time at which this record was created. + Format: :ref:`ISO 8601 ` + :type createDateTime: string + :field updateDateTime: + The time at which this record was last updated. + Format: :ref:`ISO 8601 ` + :type updateDateTime: string + :field runTime: + The time at which this experiment was performed. + Granularity here is variable (e.g. date only). + Format: :ref:`ISO 8601 ` + :type runTime: null|string + :field molecule: + The molecule examined in this experiment. (e.g. genomics DNA, total RNA) + :type molecule: null|string + :field strategy: + The experiment technique or strategy applied to the sample. + (e.g. whole genome sequencing, RNA-seq, RIP-seq) + :type strategy: null|string + :field selection: + The method used to enrich the target. (e.g. immunoprecipitation, size + fractionation, MNase digestion) + :type selection: null|string + :field library: + The name of the library used as part of this experiment. + :type library: null|string + :field libraryLayout: + The configuration of sequenced reads. (e.g. Single or Paired) + :type libraryLayout: null|string + :field instrumentModel: + The instrument model used as part of this experiment. + This maps to sequencing technology in BAM. + :type instrumentModel: null|string + :field instrumentDataFile: + The data file generated by the instrument. + TODO: This isn't actually a file is it? + Should this be `instrumentData` instead? + :type instrumentDataFile: null|string + :field sequencingCenter: + The sequencing center used as part of this experiment. + :type sequencingCenter: null|string + :field platformUnit: + The platform unit used as part of this experiment. This is a flowcell-barcode + or slide unique identifier. + :type platformUnit: null|string + :field info: + A map of additional experiment information. + :type info: map> + + An experimental preparation of a sample. + +.. avro:record:: Dataset + + :field id: + The dataset's id, locally unique to the server instance. + :type id: string + :field name: + The name of the dataset. + :type name: null|string + :field description: + Additional, human-readable information on the dataset. + :type description: null|string + + A Dataset is a collection of related data of multiple types. + Data providers decide how to group data into datasets. + See [Metadata API](../api/metadata.html) for a more detailed discussion. + +.. avro:record:: Analysis + + :field id: + Formats of id | name | description | accessions are described in the + documentation on general attributes and formats. + :type id: string + :field name: + :type name: null|string + :field description: + :type description: null|string + :field createDateTime: + The time at which this record was created. + Format: :ref:`ISO 8601 ` + :type createDateTime: null|string + :field updateDateTime: + The time at which this record was last updated. + Format: :ref:`ISO 8601 ` + :type updateDateTime: string + :field type: + The type of analysis. + :type type: null|string + :field software: + The software run to generate this analysis. + :type software: array + :field info: + A map of additional analysis information. + :type info: map> + + An analysis contains an interpretation of one or several experiments. + (e.g. SNVs, copy number variations, methylation status) together with + information about the methodology used. + +.. avro:record:: Attributes + + :field vals: + :type vals: map> + + Type defining a collection of attributes associated with various protocol + records. Each attribute is a name that maps to an array of one or more + values. Values can be strings, external identifiers, or ontology terms. + Values should be split into the array elements instead of using a separator + syntax that needs to parsed. + +.. avro:record:: Feature + + :field id: + Id of this annotation node. + :type id: string + :field parentId: + Parent Id of this node. Set to empty string if node has no parent. + :type parentId: string + :field childIds: + Ordered array of Child Ids of this node. + Since not all child nodes are ordered by genomic coordinates, + this can't always be reconstructed from parentId's of the children alone. + :type childIds: array + :field featureSetId: + Identifier for the containing feature set. + :type featureSetId: string + :field referenceName: + The reference on which this feature occurs. + (e.g. `chr20` or `X`) + :type referenceName: string + :field start: + The start position at which this feature occurs (0-based). + This corresponds to the first base of the string of reference bases. + Genomic positions are non-negative integers less than reference length. + Features spanning the join of circular genomes are represented as + two features one on each side of the join (position 0). + :type start: long + :field end: + The end position (exclusive), resulting in [start, end) closed-open interval. + This is typically calculated by `start + referenceBases.length`. + :type end: long + :field strand: + The strand on which the feature is present. + :type strand: Strand + :field featureType: + Feature that is annotated by this region. Normally, this will be a term in + the Sequence Ontology. + :type featureType: OntologyTerm + :field attributes: + Name/value attributes of the annotation. Attribute names follow the GFF3 + naming convention of reserved names starting with an upper cases + character, and user-define names start with lower-case. Most GFF3 + pre-defined attributes apply, the exceptions are ID and Parent, which are + defined as fields. Additional, the following attributes are added: + * Score - the GFF3 score column + * Phase - the GFF3 phase column for CDS features. + :type attributes: Attributes + + Node in the annotation graph that annotates a contiguous region of a + sequence. + +.. avro:record:: FeatureSet + + :field id: + The ID of this annotation set. + :type id: string + :field datasetId: + The ID of the dataset this annotation set belongs to. + :type datasetId: null|string + :field referenceSetId: + The ID of the reference set which defines the coordinate-space for this + set of annotations. + :type referenceSetId: null|string + :field name: + The display name for this annotation set. + :type name: null|string + :field sourceURI: + The source URI describing the file from which this annotation set was + generated, if any. + :type sourceURI: null|string + :field info: + Remaining structured metadata key-value pairs. + :type info: map> + From 61d2b1ffd8f784ed7e0a7f4dcf522d9ac32120b0 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 18 Mar 2016 09:54:17 -0700 Subject: [PATCH 10/13] changed ontologyTerms to featureTypes in the features/search method request. --- doc/source/schemas/sequenceAnnotationmethods.rst | 8 ++++---- src/main/resources/avro/sequenceAnnotationmethods.avdl | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/source/schemas/sequenceAnnotationmethods.rst b/doc/source/schemas/sequenceAnnotationmethods.rst index c0c836b3..01815cc6 100644 --- a/doc/source/schemas/sequenceAnnotationmethods.rst +++ b/doc/source/schemas/sequenceAnnotationmethods.rst @@ -425,10 +425,10 @@ Gets a list of `Feature` matching the search criteria. Required. The end of the window (0-based, exclusive) for which overlapping features should be returned. :type end: long - :field ontologyTerms: - If specified, this query matches only annotations which match one of the - provided ontology terms. - :type ontologyTerms: array + :field featureTypes: + If specified, this query matches only annotations whose `featureType` + matches one of the provided ontology terms. + :type featureTypes: array :field pageSize: Specifies the maximum number of results to return in a single page. If unspecified, a system default will be used. diff --git a/src/main/resources/avro/sequenceAnnotationmethods.avdl b/src/main/resources/avro/sequenceAnnotationmethods.avdl index 8b690c7e..82e74a50 100644 --- a/src/main/resources/avro/sequenceAnnotationmethods.avdl +++ b/src/main/resources/avro/sequenceAnnotationmethods.avdl @@ -105,10 +105,10 @@ protocol SequenceAnnotationMethods { // TODO: To be replaced with a fully featured ontology search // once the Metadata definitions are rounded out. /** - If specified, this query matches only annotations which match one of the - provided ontology terms. + If specified, this query matches only annotations whose `featureType` + matches one of the provided ontology terms. */ - array ontologyTerms = []; + array featureTypes = []; /** Specifies the maximum number of results to return in a single page. From 434ec0b76c10df07beaa4159c809184b50e7a72f Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Thu, 31 Mar 2016 18:16:18 -0700 Subject: [PATCH 11/13] Added sphinx fix. --- requirements.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index a936916f..9eff163c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -4,4 +4,5 @@ flake8 humanize nose requests -sphinx \ No newline at end of file +sphinx +sphinx_rtd_theme From e4120f9ed61995120e6c55e528fd4938cea0d239 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Fri, 1 Apr 2016 15:17:28 -0700 Subject: [PATCH 12/13] Fixed two schema issues pointed out by Sarah Hunt. --- doc/source/api/sequence_annotations.rst | 2 +- src/main/resources/avro/sequenceAnnotations.avdl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/api/sequence_annotations.rst b/doc/source/api/sequence_annotations.rst index a784b54f..90a368fa 100644 --- a/doc/source/api/sequence_annotations.rst +++ b/doc/source/api/sequence_annotations.rst @@ -9,7 +9,7 @@ For the Sequence Annotation schema definitions, see `Sequence Annotation schema ------------------------ Feature Based Hierarchy ------------------------ -The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature has one parent Feature, and can have an ordered array of child Features, which enables the construction of more complex representations in a hierarchical way. +The central object of the GA4GH Sequence Annotation API is a Feature. The Feature describes an interval of interest on some reference(s). It has a span from a start position to a stop position as well as descriptive data. A Feature can have a parent Feature, and can have an ordered array of child Features, which enables the construction of more complex representations in a hierarchical way. For example, a single gene Feature may be parent to several different transcript Features. The specific exons for each transcript would have that transcript Feature as parent. The same physical exon may occur as part of two different transcript Features, but in our notation, it would be encoded as two separate exon Features, each with a different parent, both occupying the same genomic coordinates. This structure can also exend to annotating CDS, binding sites or any other sub-gene level features. diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl index adeb12b4..546241b5 100644 --- a/src/main/resources/avro/sequenceAnnotations.avdl +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -114,7 +114,7 @@ protocol SequenceAnnotations { string id; /** The ID of the dataset this annotation set belongs to. */ - union { null, string } datasetId = null; + string datasetId; /** The ID of the reference set which defines the coordinate-space for this From 23a7a93450a01cec793fa7bf9391dcdb9a122506 Mon Sep 17 00:00:00 2001 From: Maciek Smuga-Otto Date: Tue, 5 Apr 2016 11:50:38 -0700 Subject: [PATCH 13/13] Removed extra TODO (instead moving it to GIT issue), and aspirational comment about multi-reference features. --- doc/source/api/sequence_annotations.rst | 2 -- src/main/resources/avro/sequenceAnnotations.avdl | 2 -- 2 files changed, 4 deletions(-) diff --git a/doc/source/api/sequence_annotations.rst b/doc/source/api/sequence_annotations.rst index 90a368fa..9b14de1c 100644 --- a/doc/source/api/sequence_annotations.rst +++ b/doc/source/api/sequence_annotations.rst @@ -42,5 +42,3 @@ Annotation Design - RNA Considerations Read data derived from RNA samples can differ from genomic read data due to the presence of non-genomic sequences. An example would be a read that spans a splice junction. It describes a contiguous sequence of reads, but a dis-continuous genomic region due to the missing intron. Feature level read assignment is further complicated by the existence of multiple splice isoforms. A read that can be definitely assigned to a particular feature (an exon in this case) may still not be definitely assigned to a particular transcript if multiple transcript share that exon. The annotation API needs to be able to report assignment at the feature level as well as aggregate assignment at the transcript or even the whole gene level if assignment is not more specific than that. Splicing (other post-transcriptional modifications?) can occur with degrees of complexity. A ‘typical’ splice will result in a mature transcript with exon in positional (numerical) order in a head-to-tail orientation. Back splicing (tail-to-head) can result in transcripts with the exon order reversed (1-3-2-4 instead of 1-2-3-4) and even circular RNA. The exon order in a transcript as well as the orientation of the splice should be discoverable via the API. In a more general case, the API should allow child features to have an ordered relationship. - -The annotation API needs to also be flexible enough to handle multiple references in the same gene or transcript. This is needed to cover the cases of fusion genes or inter-chromosomal translocations. diff --git a/src/main/resources/avro/sequenceAnnotations.avdl b/src/main/resources/avro/sequenceAnnotations.avdl index 546241b5..4e3b6f15 100644 --- a/src/main/resources/avro/sequenceAnnotations.avdl +++ b/src/main/resources/avro/sequenceAnnotations.avdl @@ -28,8 +28,6 @@ protocol SequenceAnnotations { Values should be split into the array elements instead of using a separator syntax that needs to parsed. */ - // TODO: how are multiple instances of a given attribute vs multiple values - // for an attribute distinguished record Attributes { map> vals = {}; }