Skip to content

GH Runners Orphaned when Spot Instance is Interrupted #4376

Open
@iNoahNothing

Description

@iNoahNothing

I have been running into many of my spot runners getting orphaned in my organizations github runner pool as offline runners. I believe this is due to the scale down lambda filtering the runners to be scaled down and removed from github from the list of active ec2 instances for that runner pool.

This code snippet shows the logic for the scale down runner which itterates through the running ec2 instances, checking that they should be owned by this lambda/runner pool, checks if it should spin it down based on defined constraints, and removes it if it should.

    for (const ec2Runner of ec2RunnersFiltered) {
      const ghRunners = await listGitHubRunners(ec2Runner);
      const ghRunnersFiltered = ghRunners.filter((runner: { name: string }) =>
        runner.name.endsWith(ec2Runner.instanceId),
      );
      logger.debug(
        `Found: '${ghRunnersFiltered.length}' GitHub runners for AWS runner instance: '${ec2Runner.instanceId}'`,
      );
      logger.debug(
        `GitHub runners for AWS runner instance: '${ec2Runner.instanceId}': ${JSON.stringify(ghRunnersFiltered)}`,
      );
      if (ghRunnersFiltered.length) {
        if (runnerMinimumTimeExceeded(ec2Runner)) {
          if (idleCounter > 0) {
            idleCounter--;
            logger.info(`Runner '${ec2Runner.instanceId}' will be kept idle.`);
          } else {
            logger.info(`Terminating all non busy runners.`);
            await removeRunner(
              ec2Runner,
              ghRunnersFiltered.map((runner: { id: number }) => runner.id),
            );
          }
        }
      } else if (bootTimeExceeded(ec2Runner)) {
        await markOrphan(ec2Runner.instanceId);
      } else {
        logger.debug(`Runner ${ec2Runner.instanceId} has not yet booted.`);
      }
    }
  }

Since we are basing this iteration on the live ec2 instances, if a spot instance is terminated while a job is active on it, the scale down runner does cannot remove it from github when it tries to and never removes it from the github runner pool after the instance has been terminated because it never tries to.

This lambda should be removing offline runners from github even if there is no active ec2 instance in the account. The fix for this is to remove the runner manually in github which is not a viable solution when AWS increases how often they are interrupting spot instances.

I would be happy to submit a PR to address this if it is agreed this is a bug.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions