Skip to content

Introduce JOB_DISCONNECTED_RETRY_TIMEOUT #2627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 14, 2025

Conversation

r4victor
Copy link
Collaborator

@r4victor r4victor commented May 13, 2025

Closes #2626

The PR changes process_running_jobs handling of ssh connection errors for pulling/running jobs. Previously, it failed jobs with no capacity after establishing ssh connection fails. Now, it keeps job active for at least JOB_DISCONNECTED_RETRY_TIMEOUT (2 minutes). This allows surviving temporary networking issues.

A possible regression is that dstack becomes slower to replace failed jobs/replicas. IMO, not a big problem since there are no expectations/guarantees on how quick the replacement is.

Later we can make timeout smarter depending on the job (e.g. short for spots, if retry enabled, etc). For now, keeping it as constant for simplicity.

Also introduced JobTerminationReason.INSTANCE_UNREACHABLE – not yet used because of backward compatibility.

@r4victor r4victor requested a review from jvstme May 14, 2025 07:06
else:
# No job_model.termination_reason set means ssh connection failed
if job_model.disconnected_at is None:
job_model.disconnected_at = common_utils.get_current_datetime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that runner_ssh_tunnel also does 3 blocking retries, and each can take 15 seconds. So this disconnected_at can be imprecise, since we only set it after the runner_ssh_tunnel retries.

But I'm not sure we need to do anything about it. I thought about disabling the runner_ssh_tunnel retries in favor of the new disconnected_at retries, but having retry logic in both places may further improve stability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. Let's keep both! :)

@r4victor r4victor merged commit 4ff4d63 into master May 14, 2025
25 checks passed
@r4victor r4victor deleted the issue_2626_job_no_connectivity branch May 14, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not fail jobs immediately in case of connectivity issues
2 participants