Skip to content

fix(app): guard against possible race conditions during enqueue #8098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 13, 2025

Conversation

psychedelicious
Copy link
Collaborator

Summary

In #7724 we made a number of perf optimisations related to enqueuing. One of these optimisations included moving the enqueue logic - including expensive prep work and db writes - to a separate thread.

At the same time manual DB locking was abandoned in favor of WAL mode.

Finally, we set check_same_thread=False to allow multiple threads to access the connection at a given time.

I think this may be the cause of #7950:

  • We start an enqueue in a thread (running in bg)
  • We dequeue
  • Dequeue pulls a partially-written queue item from DB and we get the errors in the linked issue

To be honest, I don't understand enough about SQLite to confidently say that this kind of race condition is actually possible. But:

  • The error started popping up around the time we made this change.
  • I have reviewed the logic from enqueue to dequeue very carefully many times over the past month or so, and I am confident that the error is only possible if we are getting unexpectedly NULL values from the DB.
  • The DB schema includes NOT NULL constraints for the column that is apparently returning NULL.
  • Therefore, without some kind of race condition or schema issue, the error should not be possible.
  • The enqueue_batch call is the only place I can find where we have the possibility of a race condition due to async logic. Everywhere else, all DB interaction for the queue is synchronous, as far as I can tell.

This change retains the perf benefits by running the heavy enqueue prep logic in a separate thread, but moves back to the main thread for the DB write. It also uses an explicit transaction for the write.

Will just have to wait and see if this fixes the issue.

Related Issues / Discussions

QA Instructions

Queuing should still work and it should respect max queue item size limits. I've tested and its working fine. Still just as responsive as before.

I've not been able to reproduce the issue, though, so I can't test it directly.

Merge Plan

n/a

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@github-actions github-actions bot added python PRs that change python files services PRs that change app services labels Jun 11, 2025
@psychedelicious psychedelicious enabled auto-merge (rebase) June 13, 2025 13:36
In #7724 we made a number of perf optimisations related to enqueuing. One of these optimisations included moving the enqueue logic - including expensive prep work and db writes - to a separate thread.

At the same time manual DB locking was abandoned in favor of WAL mode.

Finally, we set `check_same_thread=False` to allow multiple threads to access the connection at a given time.

I think this may be the cause of #7950:
- We start an enqueue in a thread (running in bg)
- We dequeue
- Dequeue pulls a partially-written queue item from DB and we get the errors in the linked issue

To be honest, I don't understand enough about SQLite to confidently say that this kind of race condition is actually possible. But:
- The error started popping up around the time we made this change.
- I have reviewed the logic from enqueue to dequeue very carefully _many_ times over the past month or so, and I am confident that the error is only possible if we are getting unexpectedly `NULL` values from the DB.
- The DB schema includes `NOT NULL` constraints for the column that is apparently returning `NULL`.
- Therefore, without some kind of race condition or schema issue, the error should not be possible.
- The `enqueue_batch` call is the only place I can find where we have the possibility of a race condition due to async logic. Everywhere else, all DB interaction for the queue is synchronous, as far as I can tell.

This change retains the perf benefits by running the heavy enqueue prep logic in a separate thread, but moves back to the main thread for the DB write. It also uses an explicit transaction for the write.

Will just have to wait and see if this fixes the issue.
@psychedelicious psychedelicious force-pushed the psyche/fix/enqueue-race-cond branch from 7cced2e to 117a1aa Compare June 13, 2025 13:36
@psychedelicious psychedelicious merged commit 1ff3d44 into main Jun 13, 2025
17 of 24 checks passed
@psychedelicious psychedelicious deleted the psyche/fix/enqueue-race-cond branch June 13, 2025 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python PRs that change python files services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants