Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handing dropped Events #202

Open
neiledgar opened this issue Sep 30, 2019 · 7 comments
Open

Handing dropped Events #202

neiledgar opened this issue Sep 30, 2019 · 7 comments

Comments

@neiledgar
Copy link

I am trying to implement a reliable event delivery using the nexus Pub/Sub. I can tune the WebsocketServer.OutQueueSize to match my use case but at some point a slow client will cause messages to be dropped:

2019/09/30 15:58:45 !!! Dropped EVENT to session 1873459742953018: blocked
2019/09/30 15:58:45 !!! Dropped EVENT to session 1873459742953018: blocked
2019/09/30 15:58:45 !!! Dropped EVENT to session 1873459742953018: blocked

However, our application has no indication this has happened. From the discussion on #159 it was suggested the behaviour could be configured on the realm e.g. Disconnect slow clients. Are there any plans to pursue this approach? Alternatives include registering callbacks with the realm to implement application logic, meta data channels for discarded events

My use case consumes meta data OnJoin/OnLeave events to detect clients connected to the router. I have a repo to demonstrate a generic loss of events with an OutQueueSize=1

go run server/server.go
go run pubsub/subscriber/subscriber.go -scheme=ws
go run pubsub/publisher/publisher.go -scheme=ws

# Errors are displayed on the subscriber trace:
SUBSCRIBER> Client 10 Error 13 expected 6
@gammazero
Copy link
Owner

There are a few ways to handle this, but the best may depend on the cause. Are you seeing this because the receiving client is slow and the transport itself is backed up preventing the router from sending outgoing messages fast enough, or does this happen due to a burst messages that fills up the outbound queue before messages are written to the transport?

Generally, the way this has been handled so far is to increase the OutQueueSize until it is large enough to handle message bursts. If a client is very slow and simply cannot keep up with the messages sent to it, this may not be sufficient because any size queue will eventually fill up.

The current plan on how to handle this, where one or few clients are vary slow, any not have to increase OutQueueSize for all clients, is to put the messages onto an overflow queue for the client(s) that messages cannot be delivered to. The router can be configured with some retention policy for keeping the undelivered messages (up to some size, for some amount of time). However, I need to first consider if this is compatible or overlaps with a message history feature proposed in thew WAMP spec. The goal is to not slow down all clients waiting to send messages to one client.

The overflow queue could be combined with some application logic triggered when overflow happens. The queue is still necessary so that there is some place to write the message so that the router can continue to deliver to other clients, without having to wait for app logic process the message.

@neiledgar
Copy link
Author

The failure occurs during a burst of messages which causes high CPU loading. Both wamp router and wamp client are running on the same host, athough the communication is over a websocket. Hence
I'm not sure if the failure is the transport being backed up or if the websocketpeer cannot write the data quickly enough. How would I tell?

We have already tuned the OutQueueSize to 128. We don't wish to configure this much higher since it is applies across all client connections. We have a large number of producers and a single subscriber.

I like the idea of a dynamic queue that would buffer events until the slow client recovered.

In our use case, after the burst the subscriber continues to operate, not knowing it has missed a number of events. If we knew the events had been dropped (e.g. slow client gets disconnected, an additional meta event), the subscriber could take some corrective action to re-stablish sync with the consumers.

@neiledgar
Copy link
Author

I modified our test to use a local client inside the wamp router binary and listen for wamp.session.on_join events using client.ConnectLocal(nxr, config) to avoid any socket serialisation. I still see the dropped events with the default linkedPeersOutQueueSize = 16. So I believe the problem is the outbound queue is filled up before messages are written to the transport.

On a side note: I see less dropped packets when increasing the linkedPeersOutQueueSize. Can we configure the OutQueueSize of local clients in a similar way to WebSocket.OutQueueSize and RawSocket.OutQueueSize?

@gammazero
Copy link
Owner

Unfortunately, the local transport does not have a configurable value. It is set statically here:
https://github.com/gammazero/nexus/blob/v3/transport/localpeer.go#L9

The outbound queue implementation needs to be reexamined, since any configured outbound queue size will always be wrong from some use pattern/client/network. I am looking at using an unlimited size queue per client, something roughly based on https://github.com/gammazero/bigchan. That can be combined with an optional configurable policy to to disconnect clients that have some excessive memory usage or queue size.

@neiledgar
Copy link
Author

Making the queue dynamic sounds the best option as not all clients are equal and having a fixed queue size for all is a bit of an overhead, so I'm in favour of your suggestion. I agree the queue size can't be infinite and so the behaviour when a maximum is reached as to be configured. Would this behaviour be specified in RealmConfig or would the client request the behaviour they prefer?

A configurable linkedPeersOutQueueSize for local clients sounds like a dead end. I can refactor our local clients to use web/raw sockets in order specify a configurable limit.

@gammazero
Copy link
Owner

If the maximum queue size is configurable, it should be so on the server since that value is associated with server resources. I am debating whether or not the router should drop a client that has exceeded limits, or should emit a meta event that allows a trusted admin client to do that. The former is simpler as it does not require users to implement a client for that purpose, but the latter allows the user to perform any other cleanup and notification work that may want if a client is dropped.

As far as refactoring your clients, the raw socket can use unix sockets which will be more efficient, if that is an option for you. You can also change the value in a fork of this project, but as you said that is a dead end.

@neiledgar
Copy link
Author

For sure the limit of the queue size should be a server configuration option. However, the behaviour when the limit is reached (drop client, drop event, emit meta event) might be something a client could request. Some clients may tolerate dropped events, other clients may not.

For my use case I am happy to specify the same behaviour on all clients in the realm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants