-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First class files #657
First class files #657
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #657 +/- ##
==========================================
+ Coverage 76.47% 76.64% +0.16%
==========================================
Files 143 145 +2
Lines 11532 11820 +288
==========================================
+ Hits 8819 9059 +240
- Misses 2713 2761 +48 ☔ View full report in Codecov by Sentry. |
* Add path / ip to audit_log fields * Bump version: 6.5.1 → 6.6.0 * Add more params to projectless connection * Linting * Remove now missing author flag, has been moved to audit logs * Create audit_log generic stuff for analysis * Add analysis-runner entries to generate_data * Implement more fields onto audit_log * Fix now missing author on front-end * Linting fixes --------- Co-authored-by: Michael Franklin <illusional@users.noreply.github.com>
* Update create_md5s.py script to use billing project in gsutil commands * Bump version: 6.6.2 → 6.6.3 * Fix missing line * Use GCP billing project not Hail billing project * Linting * Revert bumpversion, ignore mypy errors in api/server.py
…st-class-files-v2 Pulling changes from dev into the branch
e6a79f4
to
afa6778
Compare
Hey @nevoodoo, I had briefly look at the sql schema and I am not big fan of table name 'file'. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments.
api/routes/analysis.py
Outdated
@@ -45,7 +45,7 @@ class AnalysisModel(BaseModel): | |||
type: str | |||
status: AnalysisStatus | |||
meta: dict[str, Any] | None = None | |||
output: str | None = None | |||
outputs: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not outputs be of type Union[dict, str] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, somehow missed this since I've been interacting from the layer level directly! I've also just reviewed all other type annotations for the Analysis
model to use Union, Optional
instead of the |
operator.
db/python/layers/analysis.py
Outdated
@@ -48,6 +50,7 @@ def __init__(self, connection: Connection): | |||
|
|||
self.sampt = SampleTable(connection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not big fan of abbreviations like this, I know this is a bit of legacy, so no drama :-)
'self.sample_table' is clearer, the same for self.analysis_table, self.output_file_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I've updated all those now :)
You make a good point @milo-hyben, I have updated the schema now to reflect a better name :) I think |
I'd like to add some more tests here, for example:
|
I was right, this is in fact what happened, all patched now however, and tests added to capture this :) |
…st-class-files-v2 Merge from dev
Closing, #856 has fresher changes |
Description
Currently within metamist, we record paths in a few different ways, but inherently they’re stored in a text field. This means it’s difficult to work out:
The way we represent analysis and outputs has also evolved through the usage of metamist, and this is a good time to reflect on our future usage.
This replaces the current
Analysis
model to use a properFile
class for it'soutput
field instead of astr
. This will allow us to support multiple outputs in a structured way along with useful file stats.This issue has also been raised before with some prior work being done on it a while ago on #376
Proposal
We should model our File off the CWL definition (https://www.commonwl.org/v1.2/CommandLineTool.html#File). This has been partially implemented in the parsers (github:metamist/parser/generic_parser.py#L1328-L1340). It’s probably reasonable to make this a table with foreign keys out to place.
We should additionally store whether a file exists or not, it will provide more utility than removing references to specific files.
Consider how to reference secondaryFiles to ensure efficient queries. It’s safe to assume that we’ll only see one level of nesting as secondaryFiles, and secondaryFiles won’t be linked directly to analysis / assays.
References to files
Analysis
Assays
Migration
We’ll need to write a migration that takes the existing references to files, and migrates them to this new structure. This can act directly on a database, and could be run manually.
Extension
Later, we should capture pub/sub notifications for google cloud storage buckets, so we can mark files as archived / deleted to automatically deprecate analysis.
Special Considerations
.mt
files need to be treated differently as they aren't technically files but rather directories on cloud storage. This means it will not have a checksum. We should also storemeta
for the.mt
derived from thematrix-type
key under the correspondingmetadata.json.gz
file. - ON HOLD for nowoutput
field will be deprecated in favour of the neweroutputs
field which will accept the nested object containing multiple outputs. This is to avoid any confusion regarding interchangeability between the two fields.outputs
field.gs
prefixes, a separateoutput
column will be added to theanalysis_file
relationship to store this, as a proper file will not be created without a valid checksum.crc32c
hash as MD5 does not support multi-part uploads, and a lot of files may be missing the checksum on cloud storage. We advise you to use Google's Python crc32c wrapper to validate the file checksum in your code.Changes