-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): fix a deadlock issue on coroutine mutex #13019
fix(core): fix a deadlock issue on coroutine mutex #13019
Conversation
kong/concurrency.lua
Outdated
-- to resolve deadlock issue in case the worker event thread | ||
-- associated with `pcall(fn)` below got killed | ||
-- and the `:post(1)` followed is skipped. | ||
if semaphore:count() <= 0 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- @samugi we should at least post 1 for unlock loop dependencies. Post 1 is safe, just as did here: https://github.com/Kong/kong/blob/master/kong/concurrency.lua#L101-L110.
- @dndx the
pcall
failed has multiple causes. It runs within a timer thread spawn by worker events. If the thread aborts abnormally in the middle ofpcall
, there is no chance tosemaphore:post(1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the thread abort in the worker event callback? Is this an intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dndx for example, the worker 0 is OOM, then it is killed. If worker 0 is killed, then other workers cannot connect to it and error.
Other workers will close the reconfigure thread when it is still applying new config (pcall
). https://github.com/Kong/lua-resty-events/blob/main/lualib/resty/events/worker.lua#L249-L253.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this has nothing to do with workers being killed. Coroutine mutexes don't have any meaning across process boundaries. @outsinre so please stop spreading this misinformation. I already answered in original PR (that is exactly the same as this one). And stop reopning this. It is just a bug that you are creating here (much bigger than the original it tries to fix).
Yes, this may be because other processes kill prematurely their processes (like event callbacks) when the broker is killed by OOM. But that is the bug in prematurely killing event callbacks.
@outsinre why is this reopened. I already said that this defeats the whole purpose of coroutine mutex? |
Coroutine mutex uses semaphore to ensure only one job is running for a specific event (e.g. reconfigure_handler). However, if the coroutine job is wrapped within a light thread (e.g. worker event's event_thread) and the associated thread was killed in the middle of the job, semaphore post would be skipped. Subsequent jobs of the same type would wait and timeout, as there is no semaphore resources available.
08caa04
to
51f8031
Compare
d2583eb
to
6be1650
Compare
@@ -125,11 +135,13 @@ function concurrency.with_coroutine_mutex(opts, fn) | |||
end | |||
end | |||
|
|||
running_jobs[opts_name] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chance to be killed here is quite low.
Will do on events lib. |
Summary
fix(core): fix a deadlock issue on coroutine mutex
Coroutine mutex uses semaphore to ensure only one job
is running for a specific event (e.g.
reconfigure_handler
).However, if the coroutine job is wrapped within a light thread
(e.g. worker event's
event_thread
) and the associated threadwas killed in the middle of the job, semaphore post followed would
be skipped.
Light threads would be killed when it cannot connect to the worker
0
which is common if the worker0
is OOM Killed by OS.All subsequent jobs of the same event would wait and timeout, as
there is not any semaphore resource available.
Please refer to the analysis here https://konghq.atlassian.net/browse/FTI-5930?focusedCommentId=131903.
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdIssue reference
Fix #FTI-5930