Skip to content

Do not fail jobs immediately in case of connectivity issues #2626

Closed
@r4victor

Description

@r4victor

Currently process_running_jobs background task fails jobs with INTERRUPTED_BY_NO_CAPACITY immediately in case ssh connections cannot be established (3 attempts with 1s interval). There should be a more graceful handling of connectivity issues (e.g. for the idle instance to be terminated it has to be unavailable for 20m). We could start by introducing a larger timeout interval before the job is marked as failed, like 1m or 2m.

INTERRUPTED_BY_NO_CAPACITY can also be replaced for on-demand instances with a new termination reason like INSTANCE_UNREACHABLE.

Metadata

Metadata

Assignees

Labels

enhancementA non-feature improvement

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions