Description
Hi,
This applies to 1.4.1, 1.4.3 and 1.4.4:
One of our python applications has issues with the consumer disconnecting from the broker:
[W 181210 02:02:13 base:964] Heartbeat session expired, marking coordinator dead
[W 181210 02:02:13 base:698] Marking the coordinator dead (node 1002) for group cg_0: Heartbeat session expired.
After some investigation, I ran a packet capture to verify that the heartbeat is actually sent at the configured interval (10s in our case)
The results were quite surprising. Here is the timeline for Heartbeats:
16.63
26.72
36.79
51.91
72.02
112.20
132.33
142.43
177.62
187.69
197.77
222.94
233.01
243.10
283.40
Not only are the heartbeats completely inconsistent, there are gaps of > 40 seconds.
Our session_timeout was 30 seconds, which explained the consumer disconnects.
I raised the session_timeout to 3 minutes, but still eventually missed heartbeats during a soak, leading to consumer disconnect.
I ran a simultaneous capture of two other apps written in Golang and Java. Those have 3 second heartbeats configured and where consistently spot on.
Is this something that could be fixed, or simply a limitation due to GIL?