-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For discussion: full-file hashes in hyperdrive metadata #12
base: master
Are you sure you want to change the base?
For discussion: full-file hashes in hyperdrive metadata #12
Conversation
a4dffe0
to
645b9d7
Compare
Other advantages:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition to Dat. I added a few questions and conversation starters, but it all boils down to one question to me: How strict should we be with hashes? Making it optional for interoperability reasons is good imo.
I feel like there are various recommendations missing from the dat project: What expectations should DAT clients/users have about hashes? Should we encourage writers to add hashes? Should we encourage writers to add multiple hashes?
Since the data-integrity of dat is already existent on a chunk-level I feel like we should have a blake2d (how many bit?) has as a recommendation.
|
||
Full-file hashes are optionally included in hyperdrive metadata to complement | ||
the existing cryptographic-strength hashing of sub-file chunks. Multiple | ||
popular hash algorithms can be included at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if they are popular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! If we are trying to be inter-operable with existing databases and large users, and in particular to bridge to older "legacy" systems which might not support newer ("better") hash functions. To be transparent, I work at the Internet Archive, and we have dozens of petabytes of files hashed and cataloged with the MD5 and SHA1 hashes (because they are popular, not because they are "strong" or the "best" in any sense). We'll probably re-hash with new algorithms some day, but would like to do so only rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a little nitpicking around my impression that the sentence would have the same meaning and impact without "popular"; this blew out of proportion.
|
||
The design decisions to adopt hash variants are usually well-founded, motivated | ||
by security concerns (such as pre-image attacks), efficiency, and | ||
implementation concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but this is note the case in dat. The data is both validated and secured on a chunk level. Isn't it in our
case merely to deduplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. The dat implementation does use something other than popular full-file hash algorithms internally, for good reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses chunk-based hashes, yes. But all data received through the dat protocol is signed. I am not sure how you would smuggle unsigned content into a dat.
|
||
Multiple hashes would be calculated in parallel with the existing | ||
chunking/hashing process, in a streaming fashion. Final hashes would be | ||
calculated when the chunking is complete, and included in the `Stat` metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But multiple hashes are not required, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct; this is making the point that even with multiple hashes, the file only needs to be scanned (read from disk) only once.
What does the user-facing API look like, specifically? | ||
|
||
Should we allow non-standard hashes, like the git "hash", or higher-level | ||
references like (single-file) bittorrent magnet links or IPFS file references? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't allowed arbitrary values for the hash field. There should be one clear specification of what type defines which hashing algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. What I was getting at here was "what if there are additional hash or merkel tree references a user would want to include that are not in the multihash table"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase: No: we should not allow non-standard hashes. Though "standard" in this context means the standard we set. I am okay with a "githash" being added for example.
Modifying a small part of a large file would require re-hashing the entire | ||
file, which is slow. Should we skip including the updated hashes in this case? | ||
Currently mitigated by the fact that we duplicate the entire file when recoding | ||
changes or additions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashes are optional so it is up to the implementor/user to add a hash or not. Including for updates of a file.
`type` is a number representing the hash algorithm, and `value` is the | ||
bytestring of the hash output itself. The length of the hash digest (in bytes) | ||
is available from protobuf metadata for the value. This scheme, and the `type` | ||
value table, is intended to be interoperable with the [multihash][multihash] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reference the multihash table to specify what the types are?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table is linked from the multihash homepage linked... i'm wary of deep linking directly into a github blob (the repo could move to a new platform or file could be renamed), but maybe that's an overblown concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a link seems better than not having it. To remove/reduce that concern you could link to a commit number or make a mirror of it.
This was intentional. I don't think we should be over prescriptive; folks should be able to adapt this feature to their own needs (including ones we aren't even thinking of at this time). For the sake of simplicity, if users or implementors don't want to be bothered choosing or coordinating algorithms to use by default, this draft does say:
|
I seemed to have overread that section 😊
Now I am all good - though @mafintosh might have something to say about additionally having to compute SHA1. |
Maybe "widely used"
…On March 20, 2018 2:29:29 PM PDT, Martin Heidegger ***@***.***> wrote:
martinheidegger commented on this pull request.
> +
+Type: Standard
+
+Status: Undefined (as of YYYY-MM-DD)
+
+Github PR: [Discussion](#12)
+
+Authors: [Bryan Newbold](https://github.com/bnewbold)
+
+
+# Summary
+[summary]: #summary
+
+Full-file hashes are optionally included in hyperdrive metadata to
complement
+the existing cryptographic-strength hashing of sub-file chunks.
Multiple
+popular hash algorithms can be included at the same time.
This was a little nitpicking around my impression that the sentence
would have the same meaning and impact without "popular"; this blew out
of proportion.
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#12 (comment)
|
Rendered pre-merge
"Full-file hashes are optionally included in hyperdrive metadata to complement the existing cryptographic-strength hashing of sub-file chunks. Multiple popular hash algorithms can be included at the same time."
Previous discussion:
Before seriously reviewing to merge as Draft, I would want to demonstrate working code and have a better idea what the API would look like, but this is otherwise pretty flushed out.
cc: @martinheidegger