You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing frequent timeout errors during the health checks of our services. While the health APIs work fine when invoked directly from the nodes, we encounter issues during the health checks performed by Kong's lua-resty-healthcheck library.
The timeout errors are logged as follows:
Unhealthy TIMEOUT increment (10/3) for 'my-service.my-domain.com(10.123.321.234:443)', context: ngx.timer
Failed to receive status line from 'my-service.my-domain.com(10.123.321.234:443)': timeout, context: ngx.timer
Failed SSL handshake with 'my-service.my-domain.com(10.123.321.234:443)': handshake failed, context: ngx.timer
It is important to note that this issue affects specific upstreams, and only one or two pods at a time experience this problem. The upstreams remain in an unhealthy state and do not recover automatically. The issue is resolved temporarily by restarting the affected Kong pod, which sets the upstream to a healthy state again.
Upon investigating the code used by Kong's lua-resty-healthcheck library, it appears that the health check query is performed using HTTP/1.0. The relevant code snippet is as follows:
local request = ("GET %s HTTP/1.0\r\n%sHost: %s\r\n\r\n"):format(path, headers, hostheader or hostname or ip)
Considering this, we suspect that the timeouts might be related to the usage of HTTP/1.0 instead of HTTP/1.1. We believe that updating the health check query to use HTTP/1.1 might help mitigate these timeout errors.
We kindly request to make the necessary changes to the lua-resty-healthcheck library to use HTTP/1.1 for health checks. This update should help improve the reliability of the health checks and prevent the upstreams from getting stuck in an unhealthy state.
The text was updated successfully, but these errors were encountered:
Thanks for this deep investigation and kind request. Just to make sure we're not having paralel conversations I'd like to link this one to: #128 as it seems to be that one's duplicate 🤔
We are experiencing frequent timeout errors during the health checks of our services. While the health APIs work fine when invoked directly from the nodes, we encounter issues during the health checks performed by Kong's lua-resty-healthcheck library.
The timeout errors are logged as follows:
It is important to note that this issue affects specific upstreams, and only one or two pods at a time experience this problem. The upstreams remain in an unhealthy state and do not recover automatically. The issue is resolved temporarily by restarting the affected Kong pod, which sets the upstream to a healthy state again.
Upon investigating the code used by Kong's lua-resty-healthcheck library, it appears that the health check query is performed using HTTP/1.0. The relevant code snippet is as follows:
local request = ("GET %s HTTP/1.0\r\n%sHost: %s\r\n\r\n"):format(path, headers, hostheader or hostname or ip)
Considering this, we suspect that the timeouts might be related to the usage of HTTP/1.0 instead of HTTP/1.1. We believe that updating the health check query to use HTTP/1.1 might help mitigate these timeout errors.
We kindly request to make the necessary changes to the lua-resty-healthcheck library to use HTTP/1.1 for health checks. This update should help improve the reliability of the health checks and prevent the upstreams from getting stuck in an unhealthy state.
The text was updated successfully, but these errors were encountered: