Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dat Has No Client Identifier // Include a client_id Field in Handshake #61

Open
Miserlou opened this issue Aug 7, 2019 · 2 comments
Open

Comments

@Miserlou
Copy link

Miserlou commented Aug 7, 2019

As a developer, user and service operator, I'd like to be able to block leeching, misbehaving and outdated clients to ensure the best service to my users. Currently, I don't think this is possible, as Dat has no defined a mechanism for client identification - all clients look the same.

The equivalent in the BitTorrent universe is BEP0020: peer_id Conventions. This is routinely used by both trackers and peers to block malicious torrent clients like BitTyrant/BitVomit/XunLei Thunder, as well as insecure clients like older versions of Azureus and Transmission. The HTTP equivalent is obviously the User-Agent string.

I see that there is already a "User data" field, although the documentation is unclear about what this field should actually be used for - it just says "any purpose", wheres as BT's peer_id field is more specific. (If it is to be used for client identification, then this ticket is simply a documentation issue, not a feature proposal).

So, I propose that the dat handshake should include a 20 byte client_id field that could include client version information, ex., BEAK-1-0-1 or DATCLI-2-8-0, etc.

Alternately, we could formally define a convention such that the ID handhsake field is prefixed with client version information, ala BEP20, but I think that's a hack that we should avoid if at all possible.

Cheers,
Rich

@bnewbold
Copy link
Contributor

bnewbold commented Aug 8, 2019

Hi @Miserlou!

I vaguely remember there being a public conversation about including a "user-agent" field in the protocol handshake at some point, but can't find any reference now.

My first instinct was "yeah, of course we should have user-agent header/field, every protocol should!". Some counter arguments/philosophies I remember us coming up with were:

  • user-agent headers add bits of entropy and "fingerprint" to otherwise anonymous clients, and can contribute to the creation of "permacookies" that identify human users and devices across connections and sessions, which erodes end-user privacy. This is particularly true when patch-level versions are included in the string.
  • feature flags ("extensions") and client behavior are the correct pieces of information to use when altering behavior. For instance, if a peer is "just leeching" (which in the context of dat should not even be considered anti-social behavior in most cases; perhaps the peer has a pay-per-byte data plan or low mobile battery charge, or is sensitive to privacy and doesn't want to join the swarm), it is more robust to track that behavior than to try and infer it from user-agent
  • related to the above, changing behavior based on user-agent leads to a lot of complexity and tangled hierarchies; the HTTP/browser/server cross-platform support situation can be considered an anti-pattern to avoid. It can lead to people faking the user-agent string to get the behavior they want, and implementations including large weird "agent to feature" tables internally which immediately get out of date

On the other hand, maybe the benefits outweigh the downsides?

An unmentioned potential advantage would be giving folks insight into what client software, and potentially which versions, are being used "in the wild". Personally I find this to be not worth it, and almost an anti-feature, as I don't think that "market share" is the best metric to use when judging or discussing software (it's just easy to measure), as opposed to quality, "hours well spent", bug reports, user experience, maintainability, etc (I acknowledge that this is probably a fringe position).

From a technical/protocol standpoint, I think the correct place to implement this would be either: a) extend the Handshake protobuf message type with a new userAgent field (a UTF-8 string, not limited to a fixed byte length) or b) packed into userData nested protobuf messages as a default/expected field. To be clear, the id field (bytes) in the Handshake message should not be used to try and encode user-agent or any other metadata; it should be a random nonce-like identifier to prevent clients from double-connecting over different transports (eg, IPv4 + IPv6).

(open to other ideas on technical implementation though, that's just off the top of my head)

@Miserlou
Copy link
Author

Miserlou commented Aug 8, 2019

Hi Bryan - thanks for your reply!

I vaguely remember there being a public conversation about including a "user-agent" field in the protocol handshake at some point, but can't find any reference now.

Ah, okay. I think it'd be really good to dig that up if you can so we don't retread too much old ground.

Where would such a conversation happen anyway? There are so many different dat repos, from Beaker down to hypercore-crypto, I don't know where different issues should be discussed.

otherwise anonymous clients

Surely this isn't a design requirement though? Anonymity can't just be an afterthought, or a "feature" - a protocol is either anonymous or it isn't. Since this is open P2P and there's no mixing, it's not going to be even remotely anonymous unless Tor/I2P integration is a future design requirement, like for the Onion Browser. In any case, there are hundreds of other ways to fingerprint clients, so I don't think that's a good reason not to have a client identifier just for that sake.

feature flags ("extensions") and client behavior are the correct pieces of information to use when altering behavior.

I'd agree that extensions and behavior good pieces of information to use, but I don't think that means that they're the correct pieces of information to use. If you ask any private tracker operator, they'll tell you that the client version is the correct thing to check - but to verify behavior with peers if you are suspicious that a user is a cheating leech. The reason is very simple - most of the leeching clients announce themselves as such, so in practice it's very easy to 80/20.

Ultimately, any determined individual can spoof a user agent, so that isn't really isn't the point anyway - it's just another mechanism for service operators and users to ensure their their providing and receiving the best service. For instance, maybe a popular client has an accidental bug that ends up DoS'ing all of their peers? Without a user agent, how can we easily prevent those malicious connections from completing? And so on.

(Also - how are clients supposed to implement "navigator.userAgent" in the client? Beaker already identifies as BeakerBrowser in the userAgent, so any "anonymity" provided by not including it in the handshake is already lost since it can be taken via the application content.)

I agree with you that "market share" is a vanity metric and we should instead just aim to produce quality software, but I think user agents will actually greatly, greatly help with bug reports and debugging client interoperability, and thus lead to more robust software. For instance, if I'm writing some server software, and I see some that there is a client that keeps doing something "weird" that causes my server to hang, I'd much rather know that client's user agent so that I can try to reproduce it myself rather than only know that it's just "some client" coming from some IP and thats it.

So anyway, my vote is that we should go with your "a" proposal - a new userAgent field!

But! I'd also like it if we could get a better explanation of what userData is actually for - is it supposed to be like a POST?

(BTW - is there a "vision document" or something like that for Dat? I want to see a robust P2P web, but it seems like mayyybe what I want isn't exactly what Dat is aiming to provide, and that's okay, but maybe if there were a page describing what some of the goals and practical applications of the project then I could have a better understanding of what this is actually all about?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants