First class files #657

nevoodoo · 2024-01-23T08:16:50Z

Description

Currently within metamist, we record paths in a few different ways, but inherently they’re stored in a text field. This means it’s difficult to work out:

Where files are used in the centre,
Metadata about specific files

The way we represent analysis and outputs has also evolved through the usage of metamist, and this is a good time to reflect on our future usage.

This replaces the current Analysis model to use a proper File class for it's output field instead of a str. This will allow us to support multiple outputs in a structured way along with useful file stats.

This issue has also been raised before with some prior work being done on it a while ago on #376

Proposal

We should model our File off the CWL definition (https://www.commonwl.org/v1.2/CommandLineTool.html#File). This has been partially implemented in the parsers (github:metamist/parser/generic_parser.py#L1328-L1340). It’s probably reasonable to make this a table with foreign keys out to place.

We should additionally store whether a file exists or not, it will provide more utility than removing references to specific files.

Consider how to reference secondaryFiles to ensure efficient queries. It’s safe to assume that we’ll only see one level of nesting as secondaryFiles, and secondaryFiles won’t be linked directly to analysis / assays.

References to files

Analysis

Currently we link one output to an analysis object
As part of this, it would be great to support multiple outputs per analysis (potentially a JSON?)
Assays
reads?: A list of reads that were provided as part of the experiment
variants?: A file with variants specified. This is likely more legacy
*?: We should have flexibility to provide arbitrary files to assays

Migration

We’ll need to write a migration that takes the existing references to files, and migrates them to this new structure. This can act directly on a database, and could be run manually.

Extension

Later, we should capture pub/sub notifications for google cloud storage buckets, so we can mark files as archived / deleted to automatically deprecate analysis.

Special Considerations

.mt files need to be treated differently as they aren't technically files but rather directories on cloud storage. This means it will not have a checksum. We should also store meta for the .mt derived from the matrix-type key under the corresponding metadata.json.gz file. - ON HOLD for now
Once we support a nested object containing the multiple outputs for each analysis, we should ideally be tracking the structure of the outputs received so that for all queries, we can restructure the response json based off this.
The current output field will be deprecated in favour of the newer outputs field which will accept the nested object containing multiple outputs. This is to avoid any confusion regarding interchangeability between the two fields.
Secondary files will be supplied during the creation/update of the analysis/assay as part of the nested object in the outputs field.
Files are never deleted, even if the relationship between the file and the analysis/assay is broken. We simply remove the connection between the two entities.
For output values that currently do not have proper gs prefixes, a separate output column will be added to the analysis_file relationship to store this, as a proper file will not be created without a valid checksum.
The checksum implementation requires use of the crc32c hash as MD5 does not support multi-part uploads, and a lot of files may be missing the checksum on cloud storage. We advise you to use Google's Python crc32c wrapper to validate the file checksum in your code.

Changes

codecov-commenter · 2024-01-23T08:18:26Z

Codecov Report

Attention: Patch coverage is 78.03468% with 76 lines in your changes are missing coverage. Please review.

Project coverage is 76.64%. Comparing base (45d2245) to head (abe04fd).

Files	Patch %	Lines
db/python/tables/analysis.py	46.15%	42 Missing ⚠️
db/python/layers/analysis.py	41.17%	20 Missing ⚠️
models/models/output_file.py	89.02%	9 Missing ⚠️
db/python/tables/output_file.py	94.52%	4 Missing ⚠️
test/data/generate_data.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #657      +/-   ##
==========================================
+ Coverage   76.47%   76.64%   +0.16%     
==========================================
  Files         143      145       +2     
  Lines       11532    11820     +288     
==========================================
+ Hits         8819     9059     +240     
- Misses       2713     2761      +48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* Add path / ip to audit_log fields * Bump version: 6.5.1 → 6.6.0 * Add more params to projectless connection * Linting * Remove now missing author flag, has been moved to audit logs * Create audit_log generic stuff for analysis * Add analysis-runner entries to generate_data * Implement more fields onto audit_log * Fix now missing author on front-end * Linting fixes --------- Co-authored-by: Michael Franklin <illusional@users.noreply.github.com>

* Update create_md5s.py script to use billing project in gsutil commands * Bump version: 6.6.2 → 6.6.3 * Fix missing line * Use GCP billing project not Hail billing project * Linting * Revert bumpversion, ignore mypy errors in api/server.py

…outputs

…st-class-files-v2 Pulling changes from dev into the branch

milo-hyben · 2024-02-26T23:44:31Z

Hey @nevoodoo, I had briefly look at the sql schema and I am not big fan of table name 'file'. :-)
Wonder if we should call it e.g. analysis_file, there are already various analysis_* tables.

milo-hyben

I've left a few comments.

milo-hyben · 2024-02-28T05:02:13Z

api/routes/analysis.py

@@ -45,7 +45,7 @@ class AnalysisModel(BaseModel):
    type: str
    status: AnalysisStatus
    meta: dict[str, Any] | None = None
-    output: str | None = None
+    outputs: str | None = None


Should not outputs be of type Union[dict, str] ?

You're right, somehow missed this since I've been interacting from the layer level directly! I've also just reviewed all other type annotations for the Analysis model to use Union, Optional instead of the | operator.

db/python/layers/analysis.py

milo-hyben · 2024-02-28T05:05:44Z

db/python/layers/analysis.py

@@ -48,6 +50,7 @@ def __init__(self, connection: Connection):

        self.sampt = SampleTable(connection)


I am not big fan of abbreviations like this, I know this is a bit of legacy, so no drama :-)
'self.sample_table' is clearer, the same for self.analysis_table, self.output_file_table

Good suggestion, I've updated all those now :)

db/python/tables/file.py

models/models/file.py

scripts/20240124_migrate_output_to_file.py

test/testbase.py

nevoodoo · 2024-02-28T07:07:44Z

Hey @nevoodoo, I had briefly look at the sql schema and I am not big fan of table name 'file'. :-) Wonder if we should call it e.g. analysis_file, there are already various analysis_* tables.

You make a good point @milo-hyben, I have updated the schema now to reflect a better name :) I think output_file probably makes sense here as we can reuse it for other models that have an output_file.

nevoodoo · 2024-02-28T08:36:47Z

I'd like to add some more tests here, for example:

What happens if I add a valid filename as a str to the output(s) fields of the Analysis. In theory, it should just present this back to me as a string in the result. But if it is a valid pathname, the current logic would create an OutputFile record and as such, the results would be presented back as a JSON, and not a str

nevoodoo · 2024-03-05T01:41:31Z

I'd like to add some more tests here, for example:

What happens if I add a valid filename as a str to the output(s) fields of the Analysis. In theory, it should just present this back to me as a string in the result. But if it is a valid pathname, the current logic would create an OutputFile record and as such, the results would be presented back as a JSON, and not a str

I was right, this is in fact what happened, all patched now however, and tests added to capture this :)

…st-class-files-v2 Merge from dev

nevoodoo · 2024-07-05T09:10:37Z

Closing, #856 has fresher changes

nevoodoo and others added 4 commits January 15, 2024 12:02

added initial file class structure

043b81e

feat(analysis): Added initial File model definition

292e39f

feat(analysis): Draft migration script for the analysis model

6daed68

chore(fixed linting and added TODO):

e860302

illusional and others added 25 commits January 23, 2024 19:18

chore: fixed linting

7f0ddf6

feat(migration): added support for JSON output structure in existing …

b82c2c7

…outputs

feat(migration): migration script for analysis output

73f7849

chore(migration): fixed linting on script

c6f7da2

removed FileInternal

1ce8f9c

changed file to a many-many relationship model

5fd3f60

added output file querying, mutation via tables

f75d27d

added system versioning for analysis_file

7c9eb6d

updated tests

0d65777

deprecating output field on analysis model

74925ee

merged from dev

de4e465

working file class implementation

292387c

Merge branch 'dev' of github.com:populationgenomics/metamist into fir…

cf88350

…st-class-files-v2 Pulling changes from dev into the branch

Merged origin/dev into first-class-files-v2

2ea75dd

Merged origin/dev into first-class-files-v2

3b6b1ab

Merged origin/dev into first-class-files-v2

2975d7f

Fixed existing tests to use outputs

f5b7289

added FileInternal use for reconstructing json

d69ce87

Merged origin/dev into first-class-files-v2

975ef5e

fixed indentation caused by isort

64d0311

fixed breaking gitbutler changes

cb426ee

removed gitbutler

0ea3164

fixed indentation caused by gitbutler

9b7d379

nevoodoo added 6 commits February 22, 2024 19:27

updated front-end to use outputs

412c030

added output files tests

8429d59

added fake gcs server

eeacee7

patching requirements for cloudpathlib

2b1432d

add local env declaration to tests

a2d9860

fixed fileinternal from_db

afa6778

nevoodoo force-pushed the first-class-files-v2 branch from e6a79f4 to afa6778 Compare February 26, 2024 11:55

nevoodoo added 4 commits February 26, 2024 23:30

add validator for field

9f1f1bb

added logging to confirm fakegcs setup

633259a

added parse_sql_bool

3e34921

added parse_sql_bool

fbed3e8

update table name and fix str output

9442aa7

milo-hyben reviewed Feb 28, 2024

View reviewed changes

nevoodoo added 3 commits February 28, 2024 19:17

refactored file to output, added better typing annotations

78fd910

updated analysis table call from dev change

22a7b98

removed testbase comments, fixed migration file

b875b7e

nevoodoo self-assigned this Feb 28, 2024

nevoodoo added the enhancement New feature or request label Feb 28, 2024

added more tests to capture file str

f68bb63

nevoodoo marked this pull request as ready for review March 5, 2024 01:42

nevoodoo added 3 commits March 12, 2024 11:37

fixed file behaviour with gs str

dced3d6

Merge branch 'dev' of github.com:populationgenomics/metamist into fir…

fd021ad

…st-class-files-v2 Merge from dev

removed comments

abe04fd

nevoodoo mentioned this pull request Jul 5, 2024

First Class Files #856

Merged

11 tasks

nevoodoo closed this Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First class files #657

First class files #657

nevoodoo commented Jan 23, 2024 •

edited

Loading

codecov-commenter commented Jan 23, 2024 •

edited

Loading

milo-hyben commented Feb 26, 2024

milo-hyben left a comment

milo-hyben Feb 28, 2024

nevoodoo Feb 28, 2024

milo-hyben Feb 28, 2024

nevoodoo Feb 28, 2024

nevoodoo commented Feb 28, 2024 •

edited

Loading

nevoodoo commented Feb 28, 2024

nevoodoo commented Mar 5, 2024

nevoodoo commented Jul 5, 2024

		@@ -48,6 +50,7 @@ def __init__(self, connection: Connection):

		self.sampt = SampleTable(connection)

First class files #657

First class files #657

Conversation

nevoodoo commented Jan 23, 2024 • edited Loading

Description

Proposal

References to files

Migration

Extension

Special Considerations

Changes

codecov-commenter commented Jan 23, 2024 • edited Loading

Codecov Report

milo-hyben commented Feb 26, 2024

milo-hyben left a comment

Choose a reason for hiding this comment

milo-hyben Feb 28, 2024

Choose a reason for hiding this comment

nevoodoo Feb 28, 2024

Choose a reason for hiding this comment

milo-hyben Feb 28, 2024

Choose a reason for hiding this comment

nevoodoo Feb 28, 2024

Choose a reason for hiding this comment

nevoodoo commented Feb 28, 2024 • edited Loading

nevoodoo commented Feb 28, 2024

nevoodoo commented Mar 5, 2024

nevoodoo commented Jul 5, 2024

nevoodoo commented Jan 23, 2024 •

edited

Loading

codecov-commenter commented Jan 23, 2024 •

edited

Loading

nevoodoo commented Feb 28, 2024 •

edited

Loading