You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement Circuit-breaker at appropriate inter-service interactions
The two micro-services, the API and the Backend talk over gRPC. Typically, Thingverse would be installed in production on to a Kubernetes cluster. We leverage the Linkerd service mesh to do TLS offloading, Retry and Load balancing.
As discussed in #12 the network partition is a dreaded situation where things spiral out of control. So, we need to be very calculated and conservative with Retry logic and not further degrade an already precarious situation.
What could be done
I believe, we could start with conservative retry mechanisms to deal with the situation. Beyond that, we should cut off inbound traffic to the failing nodes altogether, and see if the remaining nodes can handle requests if the cluster is still in an healthy state. At the moment, we are focussing on dealing with retry at the service mesh layer, but we need to design keeping in mind the Omnibus release which would run outside of K8s (2.x release train?)
The text was updated successfully, but these errors were encountered:
Implement Circuit-breaker at appropriate inter-service interactions
The two micro-services, the API and the Backend talk over gRPC. Typically, Thingverse would be installed in production on to a Kubernetes cluster. We leverage the Linkerd service mesh to do TLS offloading, Retry and Load balancing.
As discussed in #12 the network partition is a dreaded situation where things spiral out of control. So, we need to be very calculated and conservative with Retry logic and not further degrade an already precarious situation.
What could be done
I believe, we could start with conservative retry mechanisms to deal with the situation. Beyond that, we should cut off inbound traffic to the failing nodes altogether, and see if the remaining nodes can handle requests if the cluster is still in an healthy state. At the moment, we are focussing on dealing with retry at the service mesh layer, but we need to design keeping in mind the Omnibus release which would run outside of K8s (2.x release train?)
The text was updated successfully, but these errors were encountered: