Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

doctomarculescu · 2025-05-22T15:34:06Z

Each time a solid queue supervisor receives a TERM signal the supervisor initiates graceful termination. By doing so, it sends a TERM signal to every worker process it supervises. This triggers the shutdown of the worker process :

before_shutdown
shutdown
after_shutdown

Any Registrable process (workers are Registrable processes) execute stop_heartbeat in the shutdown hook before_shutdown
Only after that the worker shuts down the executor pool and waits for the shutdown_timeout for a graceful exits. This means workers stop heartbeating before shutting down. So during the graceful termination, they try to complete claimed jobs in flight, but other supervisors see them as dead and initiate pruning of a dead process, therefore failing jobs claimed by the worker.

The visible consequence of this bug is that we end up with completed jobs which also have an entry in failed executions. The occurrences can be easily detected by the following query in a rails console:

SolidQueue::Job.joins(:failed_execution)
               .where.not(finished_at: nil)
               .where.not(failed_execution: nil)

The fix seems very simple, move stop_heartbeat from the before_shutdown hook to after_shutdown. We validated the fix in our deployments by patching the module Registrable.

I can push the PR if this is acknowledged as the correct solution. Not entirely sure how to test it because there are no tests for the heartbeat functionality, I am thinking to add a test in the integration process_lifecycle_test.

The text was updated successfully, but these errors were encountered:

doctomarculescu mentioned this issue May 22, 2025

fix(568): moving stop_heartbeat after shutdown #569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

doctomarculescu commented May 22, 2025 •

edited

Loading

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

Comments

doctomarculescu commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

doctomarculescu commented May 22, 2025 •

edited

Loading