Timeout Errors and Unhealthy Upstreams during Health Checks #130

surenraju-careem · 2023-07-06T20:01:14Z

We are experiencing frequent timeout errors during the health checks of our services. While the health APIs work fine when invoked directly from the nodes, we encounter issues during the health checks performed by Kong's lua-resty-healthcheck library.

The timeout errors are logged as follows:

Unhealthy TIMEOUT increment (10/3) for 'my-service.my-domain.com(10.123.321.234:443)', context: ngx.timer
Failed to receive status line from 'my-service.my-domain.com(10.123.321.234:443)': timeout, context: ngx.timer
Failed SSL handshake with 'my-service.my-domain.com(10.123.321.234:443)': handshake failed, context: ngx.timer

It is important to note that this issue affects specific upstreams, and only one or two pods at a time experience this problem. The upstreams remain in an unhealthy state and do not recover automatically. The issue is resolved temporarily by restarting the affected Kong pod, which sets the upstream to a healthy state again.

Upon investigating the code used by Kong's lua-resty-healthcheck library, it appears that the health check query is performed using HTTP/1.0. The relevant code snippet is as follows:

local request = ("GET %s HTTP/1.0\r\n%sHost: %s\r\n\r\n"):format(path, headers, hostheader or hostname or ip)

Considering this, we suspect that the timeouts might be related to the usage of HTTP/1.0 instead of HTTP/1.1. We believe that updating the health check query to use HTTP/1.1 might help mitigate these timeout errors.

We kindly request to make the necessary changes to the lua-resty-healthcheck library to use HTTP/1.1 for health checks. This update should help improve the reliability of the health checks and prevent the upstreams from getting stuck in an unhealthy state.

The text was updated successfully, but these errors were encountered:

nowNick · 2023-11-08T16:16:14Z

Hi @surenraju-careem

Thanks for this deep investigation and kind request. Just to make sure we're not having paralel conversations I'd like to link this one to: #128 as it seems to be that one's duplicate 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Errors and Unhealthy Upstreams during Health Checks #130

Timeout Errors and Unhealthy Upstreams during Health Checks #130

surenraju-careem commented Jul 6, 2023

nowNick commented Nov 8, 2023

Timeout Errors and Unhealthy Upstreams during Health Checks #130

Timeout Errors and Unhealthy Upstreams during Health Checks #130

Comments

surenraju-careem commented Jul 6, 2023

nowNick commented Nov 8, 2023