You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each time a solid queue supervisor receives a TERM signal the supervisor initiates graceful termination. By doing so, it sends a TERM signal to every worker process it supervises. This triggers the shutdown of the worker process :
before_shutdown
shutdown
after_shutdown
Any Registrable process (workers are Registrable processes) execute stop_heartbeat in the shutdown hook before_shutdown
Only after that the worker shuts down the executor pool and waits for the shutdown_timeout for a graceful exits. This means workers stop heartbeating before shutting down. So during the graceful termination, they try to complete claimed jobs in flight, but other supervisors see them as dead and initiate pruning of a dead process, therefore failing jobs claimed by the worker.
The visible consequence of this bug is that we end up with completed jobs which also have an entry in failed executions. The occurrences can be easily detected by the following query in a rails console:
The fix seems very simple, move stop_heartbeat from the before_shutdown hook to after_shutdown. We validated the fix in our deployments by patching the module Registrable.
I can push the PR if this is acknowledged as the correct solution. Not entirely sure how to test it because there are no tests for the heartbeat functionality, I am thinking to add a test in the integration process_lifecycle_test.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Each time a solid queue supervisor receives a TERM signal the supervisor initiates graceful termination. By doing so, it sends a TERM signal to every worker process it supervises. This triggers the shutdown of the worker process :
Any Registrable process (workers are Registrable processes) execute stop_heartbeat in the
shutdown
hookbefore_shutdown
Only after that the worker shuts down the executor pool and waits for the shutdown_timeout for a graceful exits. This means workers stop heartbeating before shutting down. So during the graceful termination, they try to complete claimed jobs in flight, but other supervisors see them as dead and initiate pruning of a dead process, therefore failing jobs claimed by the worker.
The visible consequence of this bug is that we end up with completed jobs which also have an entry in failed executions. The occurrences can be easily detected by the following query in a rails console:
The fix seems very simple, move
stop_heartbeat
from thebefore_shutdown
hook toafter_shutdown
. We validated the fix in our deployments by patching the module Registrable.I can push the PR if this is acknowledged as the correct solution. Not entirely sure how to test it because there are no tests for the heartbeat functionality, I am thinking to add a test in the integration process_lifecycle_test.
The text was updated successfully, but these errors were encountered: