Skip to content

Pagination Data Slippage Issue causing EC2 instance to be scaled down #4584

Open
@stuartp44

Description

@stuartp44

We have been having a increased uptick of users getting loss of communication during a healthy workflow run. After deep diving into the issue, we discovered that at some point, the scale-down lambda was getting a blank runner info. This then, as designed, meant the runner got tagged as being orphaned. At this point, we don't check it again and it gets deleted and thus is removed from AWS. This is causing the loss of communication!

At this point we looked into why we are getting a blank response and potentially discovered that it was due to the pagination and data slippage. Doing a quick postman check against the API renders fresh data every request and that data is not consistent amongst requests in terms of pagination. Speaking with Github, it seems that this is known with the REST API but not with the GraphQL, where you can, for instance, use cursor-based pagination/sorting the result pre-pagination. To us, this looks likely our issue. An example would be if runner X is initially in page 7 but then moves to page 5 before we get to page 7, it will not exist by the time we get to page 7 hence, it goes missing and gets orphaned. See https://docs.github.com/en/enterprise-cloud@latest/rest/actions/self-hosted-runners?apiVersion=2022-11-28#list-self-hosted-runners-for-an-organization.

Being that this is causing loss of communication, this is not a desired output at all and hence why I have logged it here. I do plan of doing something to help mitigate this, which is WIP at the moment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions