Epic Type: Implementation Epic
Epic planning and implementation follow the Epic Planning and Marathon SOP.
Attention: Please do not put any confidential content here.
The aim is to rewrite all existing metldata transformations to use schemapack.
- add possibility to use custom embedding profile to configure the denormalization in schemapack
- migrating the already schemapack-based transformations to spec version 0.2.0
(see https://github.com/ghga-de/metldata/tree/poc/src/metldata/schemapack_/builtin_transformations):
- delete_properties
- infer_relations
- re-implement the following LinkML-based transformations to use schemapack
(see https://github.com/ghga-de/metldata/tree/poc/src/metldata/builtin_transformations):
- merge_slots
- re-evaluate and potentially refactor the following transformations:
- aggregate
- Reimplementation of the custom_ebeddings transformation is not required since it will already covered by builtin functionality of schemapack
- Reimplementation of the add_accessions transformation is postponed since it might be moved handled by the submission store and not be implemented as transformation.
- A full reimplementation of the GHGA transformation workflow (this depends on other changes to be in place first and would make this epic dependent on other lines of work).
- The normalize_model transformation is not needed anymore
This helps to control which relations (and relations of relations) will be embedded and which won't. It can even be used to deal with circular dependencies that would otherwise lead to an exception when denomalizing.
Here is an example:
Given the following rooted schemapack:
schemapack: 0.2.0
description: A schema used to describe relationships between team members.
classes:
Person:
id:
propertyName: name
content: ../../content_schemas/AnyObject.schema.json` # Any object allowed
relations:
teammates:
targetClass: Person # refering to itself
mandatory:
origin: false
target: false
multiple:
origin: true
target: true
manager:
targetClass: Person # refering to itself
mandatory:
origin: false
target: false
multiple:
origin: true
target: false
rootClass: Person
There is the following datapack:
datapack: 0.2.0
resources:
Person:
Alice:
content: {}
relations:
teammates:
- Bob
- Charlie
manager: Dave
Bob:
content: {}
relations:
teammates:
- Alice
- Charlie
manager: Dave
Charlie:
content: {}
relations:
teammates:
- Alice
- Bob
manager: Dave
Dave:
content: {}
relations:
teammates: []
manager: Eve
Eve: # the boss of Dave
content: {}
relations:
teammates: []
manager: null
rootResource: Alice
Running a denormalization without configuration would not work because there are circular dependencies between Alice, Bob, and Charlie regarding the teammates relation.
To resolve that, the following embedding profile can be used to configure the denormalization process:
teammates:
# do embed teammates, however, for each embedded teammate the following embedding
# restrictions apply:
teammates: false # do not embed teammates of teammates
manager: false # do not embed managers of teammates
manager: true # do embed the managers and all the nested relations that the manager
# might have
The outcome would be the following denormalized data:
name: Alice
teammates:
- name: Bob
teammates:
- Alice
- Charlie
manager:
- Dave
- name: Charlie
teammates:
- Alice
- Bob
manager:
- Dave
manager:
name: Dave
teammates: []
manager:
name: Eve
teammates: []
manager: null
Currently, the pydantic models for interacting with DataPack and SchemaPack
definitions are not fully frozen, yet, however in the future they will be.
Thus, while re-implementing the transformations, modifications to a
DataPack or SchemaPack object should never be done in place but
using methods that would also work on frozen pydantic models (e.g.
the model_copy(update={...})
method). This is also true for the already
migrated transformation.
The original merge_slots transformation should be replaced by two new transformations:
A. For merging content properties:
- nested properties (of nested JSON objects) must be supported
- If the source properties are not of the same type and the
assume_same_type
argument (see config example below) is set tofalse
, a union type is created (AnyOf). - The new property is always a list even if the source properties were not multivalued.
- a config example might look like this:
merge_content_properties: ClassA: # merged properties that shall be created in ClassA: new_property_x: source_properties: - old_property_a - old_property_b deduplicate: true # the default, all values are kept even if they occur # multiple times accross the source properties assume_same_type: true # the default, raises an exception if the source # properties do not have the same type old_property_c.new_property_y: # support for nesting: # A new property shall be created in the object contained in the # `old_property_c`. The properties `nested_property_d" and `nested_property_e` # inside of the object contained in `old_property_c` serve as sources. # Dots ('.') are used to denote nesting. If a property name contains a dot, it # can be escaped using '\.'. We assume that nobody would have `\.` in their # property names, so that cannot be escaped. source_properties: - old_property_c.nested_property_d - old_property_c.nested_property_e deduplicate: true assume_same_type: false # Will create a union type if the source properties # differ in type. ClassB: # merged properties for ClassB: new_property_z: source_properties: - old_property_a - old_property_b # not specifying `deduplicate` and `assume_same_type` is equivalent to # their defaults
B. For merging relation properties:
- the source properties must have the same targetClass
- a config example might look like this:
merge_relation_properties:
ClassA:
# merged relation properties that shall be created in ClassA:
new_property_x:
source_properties:
- old_property_a
- old_property_b
# There are no option `deduplicate` because relation properties may never
# contain duplicate values.
# Moreover, there is no option `assume_same_type` since currently schemapack
# does not support unions of multiple target classes for a relation. Thus
# the source relations must all point to the same target class.
# Nesting is also not supported in relations, so not special treatment of
# dots is required.
new_property_y:
# another merged property
source_properties:
- old_property_a
- old_property_b
ClassB:
# merged properties for ClassB:
new_property_z:
source_properties:
- old_property_a
- old_property_b
The former aggregate
transformation shall be replaced by a mixture of existing
and to be written transformations. The following new transformations shall be
implemented:
A transformation that enables the insertion of a subschema into a new property
of any object within the current content schema of a class, including an initial
value. The schema will be modified by inserting the subschema at the specified path. The data will be modified by inserting the default value at the specified path. The user may control whether or not the newly added property is added to the parent property's required
list.
# Generic Example, defaults shown for optional parameters
- class_name: MyClass
target_content:
object_path: "" # Optional, default empty string refers to content root object
property_name: new_property
required: true # Optional
schema: # Optional, defaults to the following value
type: "object"
additionalProperties: false
value: {} # Optional
# Examples from current config
- class_name: Dataset
target_content:
property_name: datasets
- class_name: Dataset
target_content:
property_name: "studies_summary"
Important
The following transformations are expected to add a new property to the schema and data as specified under target_content
. They shall validate that the path specified under object_path
is valid, i.e. leads to a property of type "object"
and that that object does not yet contain a property by the name specified under property_name
. They shall then add that property with an appropriate subschema as required by the specific transformation.
- The transformation shall count how many target objects are referenced from each source object given the relation name.
- The transformation shall validate whether the target is defined with multiplicity and fail otherwise
Example config:
- class_name: Dataset
target_content:
object_path: samples_summary
property_name: count
source_relation_path: "Dataset(samples)>Sample"
- The transformation shall count the values encountered at a specified property in the content of an object.
- The transformation shall validate that at least one of the traversed references is multi-valued by its cardinality or one of the traversed content elements is an array.
- The path to the source content property is specified in two stages: (1) a relation path string specifies the path to the class that holds the content; (2) a content path specifies the path to the property source property within the content.
- A new content property with the name
target_content.property_name
will be added to the schema by the transformation and will have the following schemawhere each observed value will be mapped to an integer value representing the number of times it was observedtype: object additionalProperties: true
- class_name: Dataset
target_content:
object_path: samples_summary.stats
property_name: sex
source:
relation_path: Dataset(individuals)>Individual
content_path: sex
- class_name: Dataset
target_content:
object_path: samples_summary.stats
property_name: tissues
source:
relation_path: Dataset(biospecimens)>Biospecimen
content_path: tissue
- Similarly to the previous transformation, this transformation does not count the element occurrences but sums up the values of the yielded elements.
- The transformation shall validate that the type of the configured source element allows arithmetic operations (number, integer, boolean).
Draft Config:
- class_name: Dataset
target_content:
object_path: files_summary.stats
property_name: size
source:
relation_path: Dataset(files)>File
content_path: size
- The transformation shall enable copying the values from one content location (of potentially a referenced class) to another content location.
- The type of the resulting subschema shall be that of the source subschema.
- class_name: Dataset
target_content:
property_name: dac_email
source:
relation_path: Dataset(data_access_policy)>DataAccessPolicy(data_access_committee)>DataAccessCommittee
content_path: email
- class_name: ClassName
reference_name: reference_name
As the name suggests, following the conventions used in the aforementioned transformations. The transformation changes the schema and data by removing the specified property from the content schema and data.
- class_name: ClassName
content_path: property.nested_property
Number of sprints required: 1
Number of developers required: multiple