Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First class files #657

Closed
wants to merge 48 commits into from
Closed

First class files #657

wants to merge 48 commits into from

Conversation

nevoodoo
Copy link
Collaborator

@nevoodoo nevoodoo commented Jan 23, 2024

Description

Currently within metamist, we record paths in a few different ways, but inherently they’re stored in a text field. This means it’s difficult to work out:

  • Where files are used in the centre,
  • Metadata about specific files

The way we represent analysis and outputs has also evolved through the usage of metamist, and this is a good time to reflect on our future usage.

This replaces the current Analysis model to use a proper File class for it's output field instead of a str. This will allow us to support multiple outputs in a structured way along with useful file stats.

This issue has also been raised before with some prior work being done on it a while ago on #376

Proposal

We should model our File off the CWL definition (https://www.commonwl.org/v1.2/CommandLineTool.html#File). This has been partially implemented in the parsers (github:metamist/parser/generic_parser.py#L1328-L1340). It’s probably reasonable to make this a table with foreign keys out to place.

We should additionally store whether a file exists or not, it will provide more utility than removing references to specific files.

Consider how to reference secondaryFiles to ensure efficient queries. It’s safe to assume that we’ll only see one level of nesting as secondaryFiles, and secondaryFiles won’t be linked directly to analysis / assays.

References to files

Analysis

  • Currently we link one output to an analysis object
  • As part of this, it would be great to support multiple outputs per analysis (potentially a JSON?)
    Assays
  • reads?: A list of reads that were provided as part of the experiment
  • variants?: A file with variants specified. This is likely more legacy
  • *?: We should have flexibility to provide arbitrary files to assays

Migration

We’ll need to write a migration that takes the existing references to files, and migrates them to this new structure. This can act directly on a database, and could be run manually.

Extension

Later, we should capture pub/sub notifications for google cloud storage buckets, so we can mark files as archived / deleted to automatically deprecate analysis.

Special Considerations

  • .mt files need to be treated differently as they aren't technically files but rather directories on cloud storage. This means it will not have a checksum. We should also store meta for the .mt derived from the matrix-type key under the corresponding metadata.json.gz file. - ON HOLD for now
  • Once we support a nested object containing the multiple outputs for each analysis, we should ideally be tracking the structure of the outputs received so that for all queries, we can restructure the response json based off this.
  • The current output field will be deprecated in favour of the newer outputs field which will accept the nested object containing multiple outputs. This is to avoid any confusion regarding interchangeability between the two fields.
  • Secondary files will be supplied during the creation/update of the analysis/assay as part of the nested object in the outputs field.
  • Files are never deleted, even if the relationship between the file and the analysis/assay is broken. We simply remove the connection between the two entities.
  • For output values that currently do not have proper gs prefixes, a separate output column will be added to the analysis_file relationship to store this, as a proper file will not be created without a valid checksum.
  • The checksum implementation requires use of the crc32c hash as MD5 does not support multi-part uploads, and a lot of files may be missing the checksum on cloud storage. We advise you to use Google's Python crc32c wrapper to validate the file checksum in your code.

Changes

  • New File model
  • Liquibase XML to new schema
  • Migration Script for existing objects
  • API/GraphQL support for new model
  • Test cases written
  • Update dbdiagrams UML

@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2024

Codecov Report

Attention: Patch coverage is 78.03468% with 76 lines in your changes are missing coverage. Please review.

Project coverage is 76.64%. Comparing base (45d2245) to head (abe04fd).

Files Patch % Lines
db/python/tables/analysis.py 46.15% 42 Missing ⚠️
db/python/layers/analysis.py 41.17% 20 Missing ⚠️
models/models/output_file.py 89.02% 9 Missing ⚠️
db/python/tables/output_file.py 94.52% 4 Missing ⚠️
test/data/generate_data.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #657      +/-   ##
==========================================
+ Coverage   76.47%   76.64%   +0.16%     
==========================================
  Files         143      145       +2     
  Lines       11532    11820     +288     
==========================================
+ Hits         8819     9059     +240     
- Misses       2713     2761      +48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

illusional and others added 25 commits January 23, 2024 19:18
* Add path / ip to audit_log fields

* Bump version: 6.5.1 → 6.6.0

* Add more params to projectless connection

* Linting

* Remove now missing author flag, has been moved to audit logs

* Create audit_log generic stuff for analysis

* Add analysis-runner entries to generate_data

* Implement more fields onto audit_log

* Fix now missing author on front-end

* Linting fixes

---------

Co-authored-by: Michael Franklin <illusional@users.noreply.github.com>
* Update create_md5s.py script to use billing project in gsutil commands

* Bump version: 6.6.2 → 6.6.3

* Fix missing line

* Use GCP billing project not Hail billing project

* Linting

* Revert bumpversion, ignore mypy errors in api/server.py
…st-class-files-v2

Pulling changes from dev into the branch
@milo-hyben
Copy link
Contributor

Hey @nevoodoo, I had briefly look at the sql schema and I am not big fan of table name 'file'. :-)
Wonder if we should call it e.g. analysis_file, there are already various analysis_* tables.

Copy link
Contributor

@milo-hyben milo-hyben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments.

@@ -45,7 +45,7 @@ class AnalysisModel(BaseModel):
type: str
status: AnalysisStatus
meta: dict[str, Any] | None = None
output: str | None = None
outputs: str | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not outputs be of type Union[dict, str] ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, somehow missed this since I've been interacting from the layer level directly! I've also just reviewed all other type annotations for the Analysis model to use Union, Optional instead of the | operator.

db/python/layers/analysis.py Outdated Show resolved Hide resolved
db/python/layers/analysis.py Outdated Show resolved Hide resolved
@@ -48,6 +50,7 @@ def __init__(self, connection: Connection):

self.sampt = SampleTable(connection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not big fan of abbreviations like this, I know this is a bit of legacy, so no drama :-)
'self.sample_table' is clearer, the same for self.analysis_table, self.output_file_table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I've updated all those now :)

db/python/tables/file.py Outdated Show resolved Hide resolved
models/models/file.py Outdated Show resolved Hide resolved
models/models/file.py Outdated Show resolved Hide resolved
scripts/20240124_migrate_output_to_file.py Outdated Show resolved Hide resolved
test/testbase.py Outdated Show resolved Hide resolved
@nevoodoo
Copy link
Collaborator Author

nevoodoo commented Feb 28, 2024

Hey @nevoodoo, I had briefly look at the sql schema and I am not big fan of table name 'file'. :-) Wonder if we should call it e.g. analysis_file, there are already various analysis_* tables.

You make a good point @milo-hyben, I have updated the schema now to reflect a better name :) I think output_file probably makes sense here as we can reuse it for other models that have an output_file.

@nevoodoo nevoodoo self-assigned this Feb 28, 2024
@nevoodoo nevoodoo added the enhancement New feature or request label Feb 28, 2024
@nevoodoo
Copy link
Collaborator Author

I'd like to add some more tests here, for example:

  • What happens if I add a valid filename as a str to the output(s) fields of the Analysis. In theory, it should just present this back to me as a string in the result. But if it is a valid pathname, the current logic would create an OutputFile record and as such, the results would be presented back as a JSON, and not a str

@nevoodoo
Copy link
Collaborator Author

nevoodoo commented Mar 5, 2024

I'd like to add some more tests here, for example:

  • What happens if I add a valid filename as a str to the output(s) fields of the Analysis. In theory, it should just present this back to me as a string in the result. But if it is a valid pathname, the current logic would create an OutputFile record and as such, the results would be presented back as a JSON, and not a str

I was right, this is in fact what happened, all patched now however, and tests added to capture this :)

@nevoodoo nevoodoo marked this pull request as ready for review March 5, 2024 01:42
@nevoodoo nevoodoo mentioned this pull request Jul 5, 2024
11 tasks
@nevoodoo
Copy link
Collaborator Author

nevoodoo commented Jul 5, 2024

Closing, #856 has fresher changes

@nevoodoo nevoodoo closed this Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants