Implement Circuit Breaker Pattern to Protect PD Leader #8678

Tema · 2024-10-03T00:19:23Z

Feature Request

Describe your feature request related problem

In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.

Describe the feature you'd like

I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:

Actively monitor incoming requests through both gRPC and HTTP channels from TiDB, TiKV, TiFlash, CDC, and PD-ctl components.
Trip the circuit breaker when a predefined threshold of errors, retries, or resource exhaustion is detected, preventing further requests from overwhelming the PD leader.
Allow PD to enter a fail-fast state, limiting incoming traffic until the system has had time to recover or the underlying issue has been resolved.
Gradually restore normal operations by allowing a limited number of requests to flow through once the circuit has cooled down and conditions have stabilized.

This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.

Describe alternatives you've considered

While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.

Teachability, Documentation, Adoption, Migration Strategy

Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.

Documentation would need to include clear guidelines on:

How the circuit breaker operates across the different components.
Configurable options for tuning the circuit breaker thresholds.
Best practices for monitoring and adjusting the circuit breaker behavior to avoid false positives or unnecessary tripping.

Scenarios where this feature could be helpful include:

A large number of TiDB nodes restart simultaneously due to a failure, causing a surge of reconnection requests to the PD leader and rebuilding local region cache
Continuous retries from components like TiKV, TiDB, TiFlash or CDC during network partitions, causing feedback loops that overwhelm PD.

zhangjinpeng87 · 2024-10-03T16:51:52Z

@niubell PTAL

siddontang · 2024-10-14T20:57:29Z

how about TiKV or other services? do we also need to consider together?

Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Oct 3, 2024

Tema mentioned this issue Oct 18, 2024

add retry limiter to backoff function tikv/client-go#1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Tema commented Oct 3, 2024

zhangjinpeng87 commented Oct 3, 2024

siddontang commented Oct 14, 2024

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Comments

Tema commented Oct 3, 2024

Feature Request

Describe your feature request related problem

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

zhangjinpeng87 commented Oct 3, 2024

siddontang commented Oct 14, 2024