Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

arunkpatra · 2020-06-19T07:44:25Z

Implement Circuit-breaker at appropriate inter-service interactions
The two micro-services, the API and the Backend talk over gRPC. Typically, Thingverse would be installed in production on to a Kubernetes cluster. We leverage the Linkerd service mesh to do TLS offloading, Retry and Load balancing.

As discussed in #12 the network partition is a dreaded situation where things spiral out of control. So, we need to be very calculated and conservative with Retry logic and not further degrade an already precarious situation.

What could be done
I believe, we could start with conservative retry mechanisms to deal with the situation. Beyond that, we should cut off inbound traffic to the failing nodes altogether, and see if the remaining nodes can handle requests if the cluster is still in an healthy state. At the moment, we are focussing on dealing with retry at the service mesh layer, but we need to design keeping in mind the Omnibus release which would run outside of K8s (2.x release train?)

arunkpatra added enhancement New feature or request research Requires specialized research labels Jun 19, 2020

arunkpatra added this to the Thingverse - v1.0.0.M3 milestone Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

arunkpatra commented Jun 19, 2020

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Implement Circuit-breaker assistance for recovery of a failing Akka Cluster #13

Comments

arunkpatra commented Jun 19, 2020