Berkeley schema ingest #1295

naglepuff · 2024-07-11T14:03:14Z

Breaking changes

This breaks the ingest process for any version of nmdc-schema version <11.0.0. For deployments of the NMDC Data Portal using this code base and ingest, ingest must be pointed at a database that conforms to the "Berkeley Schema"

Changes

This set of changes updates ingest to be able to accept data from a source mongo database running version >=11.0.0 of the NMDC Schema (Berkeley). It does not attempt to update the data portal in any other way. From and end-user perspective, the changes here should have no impact. Users should see the same data, presented in the same way as data ingested from an older version of the mongo database. It does not attempt to rename endpoints, files, classes, functions, or variables to be up to date with the new schema either (e.g. the term "omics processing" will still exist in our code).

Some specific changes include:

Update link from Biosample and Data Generation to Study (part_of -> associated_studies)
Processes that used to exist across several collections now exist in a single collection, so associating a Data Generation record with multiple Biosample inputs has been simplified
Similarly, workflow activities have been squashed into a single collection. Now, when building our tables for these workflow executions, we query a single collection, filtering on the type field.
Instead of being able to extract the instrument name from an Omics Processing record itself, we have to take the instrument_id field from a Data Generation object and do a lookup in the instrument_set collection.
The workflow enum has been updated. This should greatly reduce the amount of warnings logged during ingest.

nmdc_server/ingest/biosample.py

nmdc_server/ingest/all.py

nmdc_server/schemas.py

nmdc_server/data_object_filters.py

New slots were added to NMDC schema that breaks our current usage of the function.

NMDC schema changed the slot that links a Biosample to a study. Previously the relationship was contained in the `part_of` slot on class `Biosample`. That slot has been renamed to `associated_studies`. Also, temporarily disable the backup link through omics_processing, as that relationship has changed in a more complex way.

This isn't necessary since `associated_studies` is required and has cardinality of 1..*.

Formerly known as omics_procecssing records, there were some small schema tweaks that needed to be reflected in ingest. Note how the process of obtaining input biosamples is simpler since we now only need to query one collection of processes. This is a result of the change in database structure that puts related objects into the same collection.

Note that this should be updated in the SQL schema to be optional/nullable.

Note that in the future we might want to make instrument a fully fledged model in our database.

naglepuff force-pushed the berkeley-schema-ingest branch from 1e075ce to c9e6e62 Compare July 31, 2024 19:09

naglepuff force-pushed the berkeley-schema-migration branch from 7da694b to 481461a Compare August 1, 2024 18:38

naglepuff marked this pull request as ready for review August 1, 2024 19:09

naglepuff requested a review from marySalvi August 1, 2024 19:09

naglepuff linked an issue Aug 1, 2024 that may be closed by this pull request

Update ingest for Berkeley Schema (nmdc-schema v11) #1291

Closed

naglepuff requested a review from jeffbaumes August 5, 2024 16:35