Skip to content

Commit

Permalink
Add support for chunking of blobs, using a variant of BLAKE3
Browse files Browse the repository at this point in the history
Buildbarn has invested heavily in using virtual file systems. Both on
the worker and client side it's possible to lazily fault in data from
the CAS. As Buildbarn implements checksum verification where needed,
randomly accessing large files may be slow. To address this, this change
adds support for composing and decomposing CAS objects, using newly
added ConcatenateBlobs() and SplitBlobs() operations.

If implemented naively (e.g., using SHA-256), these operations would not
be verifiable. To rephrase: when merely given the checksum of smaller
objects, there is no way to obtain that of its concatenated version.
This is why at the same time, this change adds a new digest function
that closely resembles BLAKE3. BLAKE3 is based on a binary Merkle tree,
meaning that it's possible to efficiently concatenate and split objects
at the 2^k boundary (where k >= 10).

With these new operations present, there is no true need to use the
Bytestream protocol any longer. Writes can be performed by uploading
smaller parts through BatchUpdateBlobs(), followed by calling
ConcatenateBlobs(). Conversely, reads of large objects can be performed
by calling SplitBlobs() and downloading individual parts through
BatchReadBlobs(). At no point is integrity compromised, as callers of
SplitBlobs() can validate the resulting tree nodes against the original
digests.

One feature of BLAKE3 is that its hashes are variable length. Though
this is generally nice to have (allowing users to make size/security
tradeoffs), we don't want to use this in our case. The reason being that
the first 256 bits of output are identical to the chaining value, which
we need for concatenation/splitting.

Requiring the use of 256 bit hashes is problematic, as SHA-256 hashes
share the same length. The digest function can thus not be derived by
looking at the hash length. This has already become an issue with MD5 vs
MURMUR3. To solve that, we extend all operations that work with digests
to take a digest function explicitly. For compatibility, we allow this
to be UNKNOWN for all existing digest functions.
  • Loading branch information
EdSchouten committed Nov 8, 2022
1 parent 7d1354e commit b8c052f
Show file tree
Hide file tree
Showing 2 changed files with 1,553 additions and 564 deletions.
Loading

0 comments on commit b8c052f

Please sign in to comment.