You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.
Describe the feature you'd like
I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:
Actively monitor incoming requests through both gRPC and HTTP channels from TiDB, TiKV, TiFlash, CDC, and PD-ctl components.
Trip the circuit breaker when a predefined threshold of errors, retries, or resource exhaustion is detected, preventing further requests from overwhelming the PD leader.
Allow PD to enter a fail-fast state, limiting incoming traffic until the system has had time to recover or the underlying issue has been resolved.
Gradually restore normal operations by allowing a limited number of requests to flow through once the circuit has cooled down and conditions have stabilized.
This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.
Describe alternatives you've considered
While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.
Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.
Documentation would need to include clear guidelines on:
How the circuit breaker operates across the different components.
Configurable options for tuning the circuit breaker thresholds.
Best practices for monitoring and adjusting the circuit breaker behavior to avoid false positives or unnecessary tripping.
Scenarios where this feature could be helpful include:
A large number of TiDB nodes restart simultaneously due to a failure, causing a surge of reconnection requests to the PD leader and rebuilding local region cache
Continuous retries from components like TiKV, TiDB, TiFlash or CDC during network partitions, causing feedback loops that overwhelm PD.
The text was updated successfully, but these errors were encountered:
Feature Request
Describe your feature request related problem
In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.
Describe the feature you'd like
I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:
This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.
Describe alternatives you've considered
While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.
Teachability, Documentation, Adoption, Migration Strategy
Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.
Documentation would need to include clear guidelines on:
Scenarios where this feature could be helpful include:
The text was updated successfully, but these errors were encountered: