prevent event loop polling from stopping on active redis connections #1734

pawl · 2023-05-26T16:14:33Z

Caused by: #1476

Currently, the Redis transport will always remove the last function added to the event loop (on_tick) regardless of which connection is disconnected. If you restart Redis enough (and repeatedly cause connection errors), eventually _on_connection_disconnect will remove a function from the event loop for the only active connection and the worker will get stuck.

I've been seeing these issues with workers getting stuck after I migrated from the RabbitMQ broker to Redis on Celery 5.2.6 and Kombu 5.2.3.

This PR changes it to track which event loop function belongs to which connection and remove the correct function from on_tick.

pawl · 2023-05-26T17:33:12Z

@auvipy @thedrow would either of you guys be able to review this? And do you think this is the right approach?

I was able to reproduce this issue locally using this example repo: https://github.com/pb-dod/celery_pyamqp_memory_leak/tree/test_redis_disconnect

I was also able to verify this fix works there too.

I think this fix is somewhat critical, because it's likely making running Celery on the Redis broker unreliable for everyone who upgraded to at least v5.2.3.

kombu/transport/redis.py

pawl · 2023-05-30T23:29:25Z

@auvipy I tested this change in production today, and I'm still seeing the same behavior where workers stop responding the worker heartbeats after about an hour (the red lines are a deployment start/finish):

I have INFO level logging turned on, and I see no indication in the logs about why the workers get stuck (the last log message is often a successful task run). Workers on different queues with very different workloads get stuck too.

I'm not seeing the issue with workers getting stuck when I use RabbitMQ as a broker.

I think this PR fixes an issue around workers getting stuck after several Redis disconnections, but now I'm not sure it fixes the primary cause of celery/celery#7276 This also explains why I was getting this issue without seeing Redis connection errors in my logs.

pawl · 2023-05-31T05:15:32Z

@auvipy I added the integration test: 34e366d

It looks like it's revealing a problem. Good call on adding it!

I assumed the two connection objects here were the same:

kombu/kombu/transport/redis.py

Line 1307 in 2df5be2

def _on_disconnect(connection):

kombu/kombu/transport/redis.py

Line 1300 in 2df5be2

def register_with_event_loop(self, connection, loop):

However:

register_with_event_loop's connection is a Transport
_on_disconnect's connection is a Redis Connection.

The way I have it currently won't work because I'm adding Transports as keys to on_poll_start_by_connection and those will never match with Redis connection objects.

I'll need to re-think the solution for this.

What I tested in production today effectively disabled the fix from #1476 and workers were still getting stuck.

auvipy

can we close this in favor of #2007?

auvipy · 2024-06-16T05:45:40Z

can we close this in favor of #2007?

also should we try to extract relevant integrations tests from here to add to the test suite?

…docstring

pawl force-pushed the fix_redis_reconnect branch from b584059 to 2e9f60b Compare May 26, 2023 22:15

auvipy requested review from Nusnus and auvipy May 27, 2023 05:46

auvipy added this to the 5.3 milestone May 27, 2023

auvipy requested a review from thedrow May 27, 2023 05:50

pawl mentioned this pull request May 29, 2023

fix: Prevent redis task loss when closing connection while in poll #1733

Merged

auvipy requested changes May 30, 2023

View reviewed changes

kombu/transport/redis.py Show resolved Hide resolved

pawl added a commit to pawl/kombu that referenced this pull request May 31, 2023

add integration test for celery#1734

34e366d

auvipy modified the milestones: 5.3, 5.3.x Jun 1, 2023

pawl mentioned this pull request Jun 22, 2023

Pass socket_keepalive_options to redis client for result_backend celery/celery#8297

Open

awmackowiak mentioned this pull request May 23, 2024

Fix Redis connections after reconnect - consumer starts consuming the tasks after crash. awmackowiak/kombu#1

Merged

pawl mentioned this pull request May 27, 2024

Fix Redis connections after reconnect - consumer starts consuming the tasks after crash. #2007

Merged

auvipy requested changes Jun 16, 2024

View reviewed changes

pawl added 7 commits June 24, 2024 13:56

add test to reproduce celery issue 7276 (redis worker stuck)

5b29d7b

prevent event loop polling from stopping on active redis connections

fe3fb74

fix test_register_with_event_loop__on_disconnect__per_connection

a8ab5ad

fix line length flake8 errors

2d737c1

fix test_register_with_event_loop__on_disconnect__loop_cleanup

2d5a40b

add integration test for celery#1734

89c7228

improve test_register_with_event_loop__on_disconnect__per_connection …

8c16509

…docstring

Nusnus force-pushed the fix_redis_reconnect branch from bab5ee5 to 8c16509 Compare June 24, 2024 10:56

auvipy and others added 3 commits September 15, 2024 15:11

Merge branch 'main' into fix_redis_reconnect

b007f8a

Merge branch 'main' into fix_redis_reconnect

5396bf8

Merge branch 'main' into fix_redis_reconnect

93a9c62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent event loop polling from stopping on active redis connections #1734

prevent event loop polling from stopping on active redis connections #1734

pawl commented May 26, 2023

pawl commented May 26, 2023

pawl commented May 30, 2023

pawl commented May 31, 2023

auvipy left a comment

auvipy commented Jun 16, 2024

prevent event loop polling from stopping on active redis connections #1734

Are you sure you want to change the base?

prevent event loop polling from stopping on active redis connections #1734

Conversation

pawl commented May 26, 2023

pawl commented May 26, 2023

pawl commented May 30, 2023

pawl commented May 31, 2023

auvipy left a comment

Choose a reason for hiding this comment

auvipy commented Jun 16, 2024