Epic Type: Implementation Epic
This epic aims to create a new service that provides an authoritative source of truth for file information. File information includes (for now) file size and checksums.
The Dataset Information Service needs to listen to events produced in the file service backend reporting successful file uploads and dataset registration information. Corresponding deletion events also need to be handled
The proposed event schema that should be used to extract and store file information by this service is FileInternallyRegistered
.
The event paylod contains all relevant information and it is guaranteed that the file in question is actually stored in permanent storage at that point in time.
Additionally, the service needs to subscribe to file deletion events and remove all data as requested.
Analogous functionality also needs to be implemented for upsertion and deletion events that deal with datasets. This means that the events captured by this service need to include those that are currently handled by the work package service, i.e. those conforming to the MetadataDatasetOverview
and MetadataDatasetID
schemas.
File information should be provided through one simple REST endpoint returning all relevant information for a given file accession. Analogously, information for all files in a dataset identified by a dataset accession should by returned by another REST endpoint.
Return of internal only file metadata, i.e. encrypted part sizes as MD5 and SHA256 hashes and encrypted part size.
- GET
/file_information/{file_id}
: Get storage location alias, file size and sha256 for the unencrypted file content
Here the file_id
is the public accession for the given file.
This should return a payload including the unencrypted file size in bytes, the SHA256 checksum for the given file ID and the alias of the storage location, hiding unnecessary details for the requester.
{
'accession': ...,
'size': ...,
'sha256_hash': ...,
'storage_alias': ...,
}
- GET
/dataset_information/{dataset_id}
: Get file size, storage location alias and sha256 for the unencrypted file content for all files in the given dataset
Here the dataset_id
is the public accession for the given dataset.
This should return a list of objects with each element containing the file_id/accession
, unencrypted file size in bytes and SHA256 checksum (preferably sorted by accession).
{
'accession': ...,
'file_information':
[
{
`accession`: ...,
'size': ...,
'sha256_hash': ...,
'storage_alias': ...,
},
{
`accession`: ...,
'size': ...,
'sha256_hash': ...,
'storage_alias': ...,
},
...
]
}
For files with no data available in the DB a smaller object containing only the accession is returned instead.
Incoming population event schema: FileInternallyRegistered
From this event, only the file_id
, decrypted_size
and decrypted_sha256
need to be stored in the service.
These represent publicly accessible metadata that should be exposed by the service.
Incoming deletion event schema: FileDeletionRequested
Incoming population/change event: MetadataDatasetOverview
From this event the accession
(of the dataset) and files
fields need to be extracted and from the files
field only the corresponding accession
fields need to be saved.
Incoming deletion event schema: MetadataDatasetID
Dataset information can exist in one of five states:
- The requested dataset is not yet registered
- The requested dataset is registered, but no file information is available
- The requested dataset is registered and some file information is available
- The requested dataset is registered and all file information is available
- The requested dataset is deleted
File information can exist in one of three states:
- The requested file is not yet registered
- The requested file is registered
- The requested file is deleted
If the reason why data is not available is of no interest and if all dataset registered states are treated the same, the following responses will be returned:
- The file information endpoint either returns a payload or 404
- The dataset information endpoint returns either a payload or 404. Within the payload, the fields of each file information object either are populated with the actual data or only contain the accession to signal that the data for the corresponding file is not yet available or has been deleted.
Number of sprints required: 1
Number of developers required: 1