Skip to content

Jobs end up in inconsistent state both completed and with a failed execution when completed during a graceful worker shutdown #568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
doctomarculescu opened this issue May 22, 2025 · 0 comments

Comments

@doctomarculescu
Copy link

doctomarculescu commented May 22, 2025

Each time a solid queue supervisor receives a TERM signal the supervisor initiates graceful termination. By doing so, it sends a TERM signal to every worker process it supervises. This triggers the shutdown of the worker process :

  • before_shutdown
  • shutdown
  • after_shutdown

Any Registrable process (workers are Registrable processes) execute stop_heartbeat in the shutdown hook before_shutdown
Only after that the worker shuts down the executor pool and waits for the shutdown_timeout for a graceful exits. This means workers stop heartbeating before shutting down. So during the graceful termination, they try to complete claimed jobs in flight, but other supervisors see them as dead and initiate pruning of a dead process, therefore failing jobs claimed by the worker.

The visible consequence of this bug is that we end up with completed jobs which also have an entry in failed executions. The occurrences can be easily detected by the following query in a rails console:

SolidQueue::Job.joins(:failed_execution)
               .where.not(finished_at: nil)
               .where.not(failed_execution: nil)

The fix seems very simple, move stop_heartbeat from the before_shutdown hook to after_shutdown. We validated the fix in our deployments by patching the module Registrable.

I can push the PR if this is acknowledged as the correct solution. Not entirely sure how to test it because there are no tests for the heartbeat functionality, I am thinking to add a test in the integration process_lifecycle_test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant