Skip to content

Use exponentially increasing retry delays for pending runs #2519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 16, 2025

Conversation

r4victor
Copy link
Collaborator

Fixes #2420

Previously pending runs were retried with 15s delay which caused lots of job submissions being created in case of constant no capacity. The new logic is to use exponential retry delays with 10m max delay. As a result, there will be 144 retries/day max as compared to the previous 5760 retires/day (40 times diff). The retry latency will not change for failed and retried jobs unless there is long no capacity. Runs that wait for capacity for hours/days can wait additional 5-10m.

There is still an issue of too many job submissions being returned in the API, e.g. a run that sits in pending for months. This is to be addressed in a separate issue if proved to be necessary.

@r4victor r4victor merged commit 20597ec into master Apr 16, 2025
24 checks passed
@r4victor r4victor deleted the issue_2420_many_job_submissions branch April 16, 2025 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Run job submissions may grow infinitely leading to server slowdown
1 participant