Do not fail jobs immediately in case of connectivity issues

Currently process_running_jobs background task fails jobs with INTERRUPTED_BY_NO_CAPACITY immediately in case ssh connections cannot be established (3 attempts with 1s interval). There should be a more graceful handling of connectivity issues (e.g. for the idle instance to be terminated it has to be unavailable for 20m). We could start by introducing a larger timeout interval before the job is marked as failed, like 1m or 2m. 

INTERRUPTED_BY_NO_CAPACITY can also be replaced for on-demand instances with a new termination reason like INSTANCE_UNREACHABLE.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not fail jobs immediately in case of connectivity issues #2626

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Do not fail jobs immediately in case of connectivity issues #2626

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions