add GITSHA1 value to supported digest functions #257

asartori86 · 2023-05-26T10:45:21Z

see #255 for more details

google-cla · 2023-05-26T10:45:25Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

As part of bazelbuild#257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

EdSchouten · 2023-05-26T12:04:28Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // Tree Download
+    //
+    // Once an action is successfully executed, it might have generated output files or
+    // output directories in its staging area on the remote side. Each output file
+    // needs to be uploaded to its CAS with the corresponding git blob hash. Each
+    // output directory needs to be translated to a git tree object and uploaded to the
+    // CAS with the corresponding git tree hash. Only if the content of a tree is
+    // available in the CAS is the server side allowed to return the tree to the
+    // client.
+    //
+    // In case of a generated output directory, the server only returns the
+    // corresponding git tree id to the client instead of a flat list of all
+    // recursively generated output directories as part of a Tree Protobuf
+    // message. The remote side promises that each blob and subtree contained in
+    // the root tree is available in the remote CAS. Such blobs and trees must
+    // be accessible, using the streaming interface, without specifying the size
+    // (since sizes are not stored in a git tree). Due to the Protobuf 3
+    // specification, which is used in this remote execution API, not specifying
+    // the size means the default value 0 is used.
+    GITSHA1 = 10;


So if I understand it correctly, OutputDirectory.tree_digest will now be used in two different ways:

In case digest_function != GITSHA1, it will point to a single REv2 Tree object, which has a height of at most 1.

In case digest_function == GITSHA1, it will point to a single Git Tree object, which is the equivalent of an REv2 Directory object. It can thus point to a hierarchy of CAS objects whose height is unconstrained.

In my opinion this is a feature that is independent of supporting Git's hashing algorithm and directory format. In #229 we essentially redefined REv2 Tree objects as being a simple binary container format around a sequence of directories. It is generic enough that you could also use that to pack multiple Git directory objects into a single CAS object.

But this is obviously not what you're interested in. You're interested in actually having them stored as separate CAS objects, so that it's possible to embed an output directory into the input root of a subsequent action in a fully opaque manner. And this is a valid request. I think that this is one of the shortcomings of the protocol as stands.

Could you please take a look at PR #258? In that PR I'm proposing that we add a new enumeration to the Command message that clients can use to specify the format in which they want workers to store output directories. By letting Justbuild set that to DIRECTORY_ONLY, it can access output directories in Git's native format.

Indeed, besides the desire to make use of the fact that git naturally has files and directories prehashed, the other desire is to avoid communication overhead in case the client only needs very few artifacts locally (say, the logs of failed tests) and otherwise wants to leave everything on the remote-execution side as it is only used as inputs for other actions. The latter can well be seen as an independent feature that also works with other hashes and directory representations.

EdSchouten · 2023-05-26T12:05:48Z

build/bazel/remote/execution/v2/remote_execution.proto

+    //
+    // Blob Upload
+    //
+    // A blob is uploaded to the remote side by passing its raw content as well as its


What is "raw content" in this case? Does this include the "blob ${size}\0" prefix? I think it sort of has to, right? Otherwise you can't validate the hash of a file in a streaming manner.

the raw content is, for example, the real content of the file. blob ${size}\0 must be added by the hashing function

$ printf "foo" | git hash-object --stdin 19102815663d23f8b75a47e7a01965dcdc96468c $ printf "blob 3\0foo" | sha1sum 19102815663d23f8b75a47e7a01965dcdc96468c -

Oh wow. So combined with the fact that sizes are not stored in directories, this means that hash validation can only take place after a file has been fully downloaded, right? That's pretty brutal.

I would say so.. that's the format used by git :/

EdSchouten · 2023-05-26T12:06:55Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // blob from a tree without inspecting the (potentially large) content. The
+    // markers are
+    //
+    // - 0x62 for a git blob (0x62 corresponds to the character b)


Which marker is used for Action and Command messages? Are those treated as plain blobs?

No special markers :)

The marker is introduced just to distinguish a tree object from a blob because git itself hashes them differently

So you're saying that Action and Command messages are stored as plain SHA-1 hashed objects, using regular SHA-1 hashes in their digests?

Sorry, maybe I was unclear. They get a marker. Action, for example is a blob and so will be marked with 62.

Everything is consistenly hashed according to the git-sha1 (which is different from plain sha1). If the object is a blob the hash is prefixed with 0x62, if it is a tree (for example the input root), with 0x74

Did I add even more entropy? :)

EdSchouten

In my honest opinion this change is making the semantics of the protocol a bit hairy. Because Git directories don't support object sizes, Digest.size_bytes will need to be set to zero, both for files and directories. Just approaching this from the Buildbarn side, it is unlikely that we can ever support this, as we use the size_bytes field quite intensively to allocate space and track transfer progress. bb_clientd would also be unable to display/list these files, as it relies on accurate file sizes to instantiate inodes.

One could also come up with a "lite" version of this proposal, where merely the Git-style SHA-1 hashing algorithm is added, while regular REv2 Directory nodes continue to be used. I can imagine that it would be easier to get traction for that.

Would you be interested in attending the 2023-06-13 working group meeting to discuss this proposal in more detail?

asartori86 · 2023-05-26T13:34:16Z

In my honest opinion this change is making the semantics of the protocol a bit hairy. Because Git directories don't support object sizes, Digest.size_bytes will need to be set to zero, both for files and directories. Just approaching this from the Buildbarn side, it is unlikely that we can ever support this, as we use the size_bytes field quite intensively to allocate space and track transfer progress. bb_clientd would also be unable to display/list these files, as it relies on accurate file sizes to instantiate inodes.

Thanks, @EdSchouten, for your thoughts. We do appreciate it. I know how much buildbarn relies on size_bytes. I implemented some "workaround" that I would love to discuss with you. But they work only with the hard-linking rather than with fuse.

Note that its size is known when a blob (and a tree) is uploaded. When we traverse a tree already in CAS, we don't know the size of the referenced objects. We might add a dedicated cache to it eventually..

One could also come up with a "lite" version of this proposal, where merely the Git-style SHA-1 hashing algorithm is added, while regular REv2 Directory nodes continue to be used. I can imagine that it would be easier to get traction for that.

What do you mean that regular Directory nodes are used? Will they contain just the Digest of the git tree or all the list of contained objects? For the latter, I am not sure it is worthy. The main advantage is avoiding traversing a large tree and discovering that that tree is already present in the remote cas. Think, for example, about external dependencies. On the client side, that dep can be traversed only once to upload all the blobs, and never again (at least until the whole tree is kept in the remote cas). Afterward, it is sufficient to check if only one hash (the git tree hash) is present to be sure that the whole dep is still there.

Would you be interested in attending the 2023-06-13 working group meeting to discuss this proposal in more detail?

I think @aehlig is the best person for this :) which is also knowledgeable of the protocol

EdSchouten · 2023-05-26T13:50:57Z

I know how much buildbarn relies on size_bytes. I implemented some "workaround" that I would love to discuss with you. But they work only with the hard-linking rather than with fuse.

Yeah, I already imagined that that would be the case. Unfortunately, I consider the hardlinking use case to be a dead end. Especially now that Buildbarn's NFSv4 support is rock solid (on macOS Ventura 13.3 and later), there isn't really an appealing use case for workers to use hardlinking. Supporting both FUSE and NFSv4 is good enough. At some point in the future we might as well just kick hardlinking support out of the tree.

The main advantage is avoiding traversing a large tree and discovering that that tree is already present in the remote cas. Think, for example, about external dependencies. On the client side, that dep can be traversed only once to upload all the blobs, and never again (at least until the whole tree is kept in the remote cas). Afterward, it is sufficient to check if only one hash (the git tree hash) is present to be sure that the whole dep is still there.

You might be able to already achieve that by having some kind of separate cache that can map between Git SHA-1 and a hash function of choice, right? Whenever you're about to clone a Git repository, you may check this cache to obtain the SHA-256/MD5/SHA-1/.... hash of the corresponding REv2 Directory containing the same data...

asartori86 · 2023-05-26T13:59:50Z

Would you be interested in attending the 2023-06-13 working group meeting to discuss this proposal in more detail?

I think @aehlig is the best person for this :) which is also knowledgeable of the protocol

just to be more clear, I also would love to participate :)

EdSchouten · 2023-05-26T14:05:14Z

I think @aehlig is the best person for this :) which is also knowledgeable of the protocol

just to be more clear, I also would love to participate :)

Great! @bergsieker, could you please add @asartori86 and @aehlig to the invite? Thanks!

asartori86 · 2023-05-26T14:12:01Z

You might be able to already achieve that by having some kind of separate cache that can map between Git SHA-1 and a hash function of choice, right? Whenever you're about to clone a Git repository, you may check this cache to obtain the SHA-256/MD5/SHA-1/.... hash of the corresponding REv2 Directory containing the same data...

@aehlig, what do you think?

aehlig · 2023-05-26T14:53:58Z

You might be able to already achieve that by having some kind of separate cache that can map between Git SHA-1 and a hash function of choice, right? Whenever you're about to clone a Git repository, you may check this cache to obtain the SHA-256/MD5/SHA-1/.... hash of the corresponding REv2 Directory containing the same data...

I could also switch to a content-addressable version-control system based on REv2 Directory messages, however git definitely is more popular. The idea is to make use of the fact that git already contains suitable hashes to identify objects and in this way we can avoid rehashing while still having a suitable "file system" for unmodified sources.

So, if the only concern is the missing size, I would consider it more natural that the remote side internally keep a cache of git object identifier to size. No object ever enters the CAS without size declared ahead of time.

sluongng · 2023-05-26T15:31:05Z

First of all, I just want to state that I am very happy to see a VCS-aware proposal. There are a lot of overlappings between a Build system and a Version Control system and closing the gap would definitely be beneficial to both sides.

I wonder if we could handle git support a bit differently. For example, it's not common to store/transfer git objects (Tree or Blob) one by one. Instead, objects are delta-compressed into a packfile and packfile is what's usually transferred and stored instead. See https://git-scm.com/docs/pack-format for more technical details.

So a hunch of mine is that we should do the same here: store and transfer packfiles as special CAS objects that could be "chunk"-ed into smaller objects. (See #233 for chunking discussion).

In truth, the FindMissing RPC is very similar to the Git Protocol where the client and server negotiate which objects exist on 2 sides to find the delta needed to transfer. The only difference is that instead of BatchReadBlobs RPC returning multiple objects, Git protocol would combine(archive + compress) all those objects into 1 single pack and return that pack instead.

mostynb · 2023-05-26T22:01:46Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // efficient since the effort of traversing and uploading the content of a
+    // git tree occurs only once and for each subsequent request, and the git


Does this mean that git trees would be uploaded in their entirety, even if only a single new file is included in the tree?

@mostynb , yes. Git trees are simple files (aka blobs) that contain the ordered list of the files and trees that are directly included with their names and hashes (i.e., the whole tree is not flatten as in Directory messages). So, if a file change, the hash of that blob will change, so the git tree will change as well.

git trees are blobs whose content is constructed in a precise way, and its digest is computed
tree <size of the blob>\0<blob's content> | sha1sum

The content of a tree is a sequence of entries. Each entry is defined as follows
<hash bytes><object type> <object name>\0
where <object type> is an octal number that identifies a blob, an executable, a tree, or a symlink. <object name> is the name of the file or directory.

For example, given the following directory

$ tree tree-playground/ tree-playground/ ├── bar.txt └── foo └── foo.txt 1 directory, 2 files

the git tree object can be read

5716ca5987cbf97d6bb54920bea6adde242d87e6 0o10644 bar.txt fcf0be4d7e45f0ef9592682ad68e42270b0366b4 0o40000 foo

From thins you can imagine how more complicated tree will look like.

If the content of bar.txt will change, its hash will change as well and so the tree that contains it. However, the tree foo will not change. So the client only needs to upload:

the new bar.txt

the new git tree, which contains the new hash for bar.txt.

HTH :)

mostynb · 2023-05-26T22:06:09Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // Tree Download
+    //
+    // Once an action is successfully executed, it might have generated output files or
+    // output directories in its staging area on the remote side. Each output file


"Staging area" is a git concept, that does not necessarily exist on the server/worker side. If we make this part of REAPI I think it would need to be defined here.

Here the comment is simply taking about the directory in which the action was executed. Maybe "action execution directory" is the better term.

mostynb · 2023-05-26T22:13:51Z

build/bazel/remote/execution/v2/remote_execution.proto

+    // corresponding git tree id to the client instead of a flat list of all
+    // recursively generated output directories as part of a Tree Protobuf
+    // message. The remote side promises that each blob and subtree contained in
+    // the root tree is available in the remote CAS. Such blobs and trees must


I guess this should state that the blobs must be available via their "GITSHA1" digest? Or is it expected that they are also available by other digests supported by the server?

EdSchouten · 2023-05-28T20:05:13Z

I wonder if we could handle git support a bit differently. For example, it's not common to store/transfer git objects (Tree or Blob) one by one. Instead, objects are delta-compressed into a packfile and packfile is what's usually transferred and stored instead. See https://git-scm.com/docs/pack-format for more technical details.

I'm not convinced that's worth the effort. Keep in mind that in the case of Git repositories, you're dealing with trees of commits that form very long chains. For those long chains it makes a lot of sense to store them in pack files, as you don't want to make constant trips back and forth to iterate over the full commit chain.

As far as I know, @asartori86 and @aehlig are not interested in pushing a full repository including its commit history into the CAS. They want to upload the contents of a given repository at a single commit, so that it may be built. Such trees tend to be relatively shallow.

aehlig · 2023-06-13T13:29:14Z

I wonder if we could handle git support a bit differently. For example, it's not common to store/transfer git objects (Tree or Blob) one by one. Instead, objects are delta-compressed into a packfile and packfile is what's usually transferred and stored instead. See https://git-scm.com/docs/pack-format for more technical details.

I'm not convinced that's worth the effort. Keep in mind that in the case of Git repositories, you're dealing with trees of commits that form very long chains. For those long chains it makes a lot of sense to store them in pack files, as you don't want to make constant trips back and forth to iterate over the full commit chain.

As far as I know, @asartori86 and @aehlig are not interested in pushing a full repository including its commit history into the CAS. They want to upload the contents of a given repository at a single commit, so that it may be built. Such trees tend to be relatively shallow.

Indeed. Moreover, likely an "earlier version" of that tree (i.e., that directory at an earlier commit or in a different branch) was already used in remote execution. So if the tree is not known already (the typical case), it is worth the effort to investigate with FindMissingBlobs to find out the small part that is not known to the remote execution rather than proactively sending everything.

As part of bazelbuild#257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

sluongng · 2023-06-13T18:08:14Z

I just want to note that it's true that by default, git tree does not store the size of each child object, but obtaining the size of the objects is relatively easy.

For a repository with a checked-out working copy, all files in the current working copy have their size stored inside the git's index file, which you could get very quickly with.

> git ls-files --debug

That means you would only need to manually check the size of (1) gitignore files and (2) newly added, modified files. Both of these lists could be obtained using some git commands fairly quickly.

For a repository without a working copy (bare git repository), you could have to call git ls-tree to ask git to read the file size from the ObjectDBs (loose or packed). But all git forge vendors have no problem doing this (and caching the result) to help power their WebUI as it's a relatively fast operation.

So I suspect that we would still be able to fit the "size" into GITSHA1 with a relatively smaller cost.

Realistically, I think it would be very hard to implement support on the server side for GITSHA1 without knowing the object size. Server implementations would want to use this information as a requirement to bin pack workers efficiently. For example: without knowing how big a tree is, I could only schedule a worker based on some heuristic and potentially run out of disk space while downloading the tree, leading to execution failure.

So including size in digest would be very important if we want server implementations to adopt this proposal down the line.

As part of bazelbuild#257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

* Regenerate all .pb.go files It looks like the current ones are out of sync. * Allow emitting output directories as plain Directory messages As part of #257 we're discussing adding support for storing directories in Git's format. This means OutputDirectory.tree_digest will no longer point to an actual recursive tree object (like REv2 Tree). Instead of doing that, I would like to investigate whether we can add native support for storing output directories in decomposed form.

add GITSHA1 value to supported digest functions

9c556b0

asartori86 requested a review from bergsieker as a code owner May 26, 2023 10:45

asartori86 mentioned this pull request May 26, 2023

API extension for Git hashes #255

Open

EdSchouten mentioned this pull request May 26, 2023

Allow emitting output directories as plain Directory messages #258

Merged

EdSchouten requested changes May 26, 2023

View reviewed changes

EdSchouten reviewed May 26, 2023

View reviewed changes

mostynb reviewed May 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GITSHA1 value to supported digest functions #257

add GITSHA1 value to supported digest functions #257

asartori86 commented May 26, 2023

google-cla bot commented May 26, 2023

EdSchouten May 26, 2023

aehlig Jun 13, 2023

EdSchouten May 26, 2023 •

edited

Loading

asartori86 May 26, 2023

EdSchouten May 26, 2023

asartori86 May 26, 2023

EdSchouten May 26, 2023

asartori86 May 26, 2023

EdSchouten May 26, 2023

asartori86 May 26, 2023

EdSchouten left a comment

asartori86 commented May 26, 2023

EdSchouten commented May 26, 2023 •

edited

Loading

asartori86 commented May 26, 2023

EdSchouten commented May 26, 2023

asartori86 commented May 26, 2023

aehlig commented May 26, 2023

sluongng commented May 26, 2023 •

edited

Loading

mostynb May 26, 2023

asartori86 May 30, 2023

mostynb May 26, 2023

aehlig Jun 13, 2023

mostynb May 26, 2023

EdSchouten commented May 28, 2023

aehlig commented Jun 13, 2023

sluongng commented Jun 13, 2023

		// efficient since the effort of traversing and uploading the content of a
		// git tree occurs only once and for each subsequent request, and the git

add GITSHA1 value to supported digest functions #257

Are you sure you want to change the base?

add GITSHA1 value to supported digest functions #257

Conversation

asartori86 commented May 26, 2023

google-cla bot commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten left a comment

Choose a reason for hiding this comment

asartori86 commented May 26, 2023

EdSchouten commented May 26, 2023 • edited Loading

asartori86 commented May 26, 2023

EdSchouten commented May 26, 2023

asartori86 commented May 26, 2023

aehlig commented May 26, 2023

sluongng commented May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten commented May 28, 2023

aehlig commented Jun 13, 2023

sluongng commented Jun 13, 2023

EdSchouten May 26, 2023 •

edited

Loading

EdSchouten commented May 26, 2023 •

edited

Loading

sluongng commented May 26, 2023 •

edited

Loading