Closed
Description
Currently process_running_jobs background task fails jobs with INTERRUPTED_BY_NO_CAPACITY immediately in case ssh connections cannot be established (3 attempts with 1s interval). There should be a more graceful handling of connectivity issues (e.g. for the idle instance to be terminated it has to be unavailable for 20m). We could start by introducing a larger timeout interval before the job is marked as failed, like 1m or 2m.
INTERRUPTED_BY_NO_CAPACITY can also be replaced for on-demand instances with a new termination reason like INSTANCE_UNREACHABLE.