Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Berkeley schema ingest #1295

Merged
merged 13 commits into from
Aug 7, 2024
Merged

Conversation

naglepuff
Copy link
Collaborator

@naglepuff naglepuff commented Jul 11, 2024

Fix #1291

Breaking changes

This breaks the ingest process for any version of nmdc-schema version <11.0.0. For deployments of the NMDC Data Portal using this code base and ingest, ingest must be pointed at a database that conforms to the "Berkeley Schema"

Changes

This set of changes updates ingest to be able to accept data from a source mongo database running version >=11.0.0 of the NMDC Schema (Berkeley). It does not attempt to update the data portal in any other way. From and end-user perspective, the changes here should have no impact. Users should see the same data, presented in the same way as data ingested from an older version of the mongo database. It does not attempt to rename endpoints, files, classes, functions, or variables to be up to date with the new schema either (e.g. the term "omics processing" will still exist in our code).

Some specific changes include:

  • Update link from Biosample and Data Generation to Study (part_of -> associated_studies)
  • Processes that used to exist across several collections now exist in a single collection, so associating a Data Generation record with multiple Biosample inputs has been simplified
  • Similarly, workflow activities have been squashed into a single collection. Now, when building our tables for these workflow executions, we query a single collection, filtering on the type field.
  • Instead of being able to extract the instrument name from an Omics Processing record itself, we have to take the instrument_id field from a Data Generation object and do a lookup in the instrument_set collection.
  • The workflow enum has been updated. This should greatly reduce the amount of warnings logged during ingest.

@naglepuff naglepuff marked this pull request as ready for review August 1, 2024 19:09
@naglepuff naglepuff linked an issue Aug 1, 2024 that may be closed by this pull request
nmdc_server/ingest/all.py Outdated Show resolved Hide resolved
New slots were added to NMDC schema that breaks our current usage of the
function.
NMDC schema changed the slot that links a Biosample to a study.
Previously the relationship was contained in the `part_of` slot on class
`Biosample`. That slot has been renamed to `associated_studies`. Also,
temporarily disable the backup link through omics_processing, as that
relationship has changed in a more complex way.
This isn't necessary since `associated_studies` is required and has
cardinality of 1..*.
Formerly known as omics_procecssing records, there were some small
schema tweaks that needed to be reflected in ingest. Note how the
process of obtaining input biosamples is simpler since we now only need
to query one collection of processes. This is a result of the change in
database structure that puts related objects into the same collection.
Note that this should be updated in the SQL schema to be
optional/nullable.
Note that in the future we might want to make instrument a fully
fledged model in our database.
@naglepuff naglepuff merged commit 7edb861 into berkeley-schema-migration Aug 7, 2024
2 checks passed
@naglepuff naglepuff deleted the berkeley-schema-ingest branch August 7, 2024 20:05
@naglepuff naglepuff mentioned this pull request Sep 5, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ingest for Berkeley Schema (nmdc-schema v11)
3 participants