Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design datalad remake-provision #12

Open
mih opened this issue May 2, 2024 · 1 comment
Open

Design datalad remake-provision #12

mih opened this issue May 2, 2024 · 1 comment

Comments

@mih
Copy link
Member

mih commented May 2, 2024

Presently blocked by:psychoinformatics-de/datalad-concepts#174

This is about the first half of #10 -- a datalad-based data source or data provisioning helper.
See #13 for the other half.

Purpose

Materialize data in the form of files (in directories) on the filesystem. Data are obtained via some method involving datalad (datasets) (i.e. clone, get, download, clone-from-metadata, etc.). The creation of a full clone, or a checkout is no necessity, not is the result of a provisioning (necessarily) a Git repository.

Target use cases

  • (1) Provision a worktree of a (Git) repository (at a specific commit) (with certain files' content guaranteed to be locally available): output is a CWL Directory
  • (2) Provision a (set of) [git annex key(s)|Git blobs], known to a git(-annex) repository (under a given filename for each blob/key) (as named sources): outputs are CWL File, named according to the spec

(1) is serving the standard use case of datalad run/rerun. (2) can be useful for composing workflows that do not require a particular directory layout. The could be executed without having to fiddle with checkouts of nested datasets. Instead, any required content (tracked via Git in some way) can be produced under any (fixed) given name, and fed to a workflow (which can run locally or remotely).

API

We need to be able to specify

  • (1) identifier for a datalad dataset. This could be a clone URL, a UUID (from registry?), some kind of query to materialize a datalad dataset from
  • (2) an (optional) dataset version to make relative paths inside a dataset precise identifiers (not needed for annex keys)
  • (3) a list of dataset content identifiers
    • relative paths
    • annex keys
    • blob SHAs
  • (4) an (optional) mapping from a dataset content identifier to an output/provision filename
  • (5) a mode switch that indicates whether to provision a worktree when (3/4) are given, or whether to extract the content blobs into a plain directory.

(2) would not be needed when relative paths always have a version prefix (ala a Git tree-ish).
(4) could be merged with (3) via a configurable and optional delimiter (e.g. NULL byte by default) that turns the content identifier into a source/dest pair (only relevant for the CLI).

Related issues

@mih
Copy link
Member Author

mih commented May 5, 2024

Quick Sunday downtime realization to be integrated above:

  • I believe everything that is focused on provisioning something that it not a datalad dataset would be a distraction at this point. It can be done later and separately, possibly following the same or a different pattern -- at likely little additional cost/waste.
  • Dataset provisioning has a lot of overlap with the dataset generation side of RFD0041. The main difference in my mind is that dataset generation is targeting a larger bandwidth of dataset sources. Making these two development more aligned, or even identical would be beneficial (for both).
  • Rather than thinking about a possible CLI for the provisioning tool, we should focus on evaluating whether https://concepts.datalad.org/s/distribution/unreleased/ is a good-enough specification for a provisioner to proces already. If it is not, it likely needs improvement. If it is, we only need to think about a CLI of a tool that processes such records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: blocked
Development

No branches or pull requests

1 participant