You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To describe the bug, I'd like to look at the following "database outage" scenario:
A microservice with pretty high workload, which connects to RethinkDB
The RethinkDB server goes down (maybe due to rolling update of a worker node in K8s or whatever...)
What can then happen is:
Response times for users of the microservice are getting slower and slower, even though all DB queries are run with contexts properly set (max 30s, but response times can start to stack up quickly to >600s)
Go routines start to build up
Eventually the microservice get's OOM killed
If I conclude correctly from the code, in the connection pool there is a mutex used while distributing queries to a connection (to prevent concurrent creation of a new connection?). I guess in my scenario it takes more time to create a connection (because the connection has gone bad it needs recreation) than requests are coming in. So, Go routines will queue up waiting for the mutex (until the database connection is re-established, which stops this behavior). In the logs of the application I eventually see the connection refused error from this driver.
This shows that using mutexes has some disadvantages for these kind of scenarios because they cannot be left even when a context is provided. From my perspective, instead, the implementation should utilize something like go-lock or a construct using channels from which on the one hand the routines can be informed when the connection is ready and on the other hand a message from a context can be retrieved.
Maybe one or the other will stumble upon the same problem and this helps to better understand the observed behavior.
To Reproduce
Produce high workload
Shutdown RethinkDB server
Expected behavior
The queries to the database are cancelled by the context and do not queue up.
Screenshots
--> As soon as the DB server is shutdown, go routines are queueing up (depends on the workload how quickly)
--> This is a bit complex as it's created with pprof for a real microservice, but the important information is at the bottom: go routines are queuing up in the conn function of the connection pool
System info
RethinkDB Version: 2.4.1
The text was updated successfully, but these errors were encountered:
Describe the bug
To describe the bug, I'd like to look at the following "database outage" scenario:
What can then happen is:
If I conclude correctly from the code, in the connection pool there is a mutex used while distributing queries to a connection (to prevent concurrent creation of a new connection?). I guess in my scenario it takes more time to create a connection (because the connection has gone bad it needs recreation) than requests are coming in. So, Go routines will queue up waiting for the mutex (until the database connection is re-established, which stops this behavior). In the logs of the application I eventually see the
connection refused
error from this driver.This shows that using mutexes has some disadvantages for these kind of scenarios because they cannot be left even when a context is provided. From my perspective, instead, the implementation should utilize something like go-lock or a construct using channels from which on the one hand the routines can be informed when the connection is ready and on the other hand a message from a context can be retrieved.
Maybe one or the other will stumble upon the same problem and this helps to better understand the observed behavior.
To Reproduce
Expected behavior
The queries to the database are cancelled by the context and do not queue up.
Screenshots
--> As soon as the DB server is shutdown, go routines are queueing up (depends on the workload how quickly)
--> This is a bit complex as it's created with pprof for a real microservice, but the important information is at the bottom: go routines are queuing up in the
conn
function of the connection poolSystem info
The text was updated successfully, but these errors were encountered: