Description
We have been having a increased uptick of users getting loss of communication during a healthy workflow run. After deep diving into the issue, we discovered that at some point, the scale-down lambda was getting a blank runner info. This then, as designed, meant the runner got tagged as being orphaned. At this point, we don't check it again and it gets deleted and thus is removed from AWS. This is causing the loss of communication!
At this point we looked into why we are getting a blank response and potentially discovered that it was due to the pagination and data slippage. Doing a quick postman check against the API renders fresh data every request and that data is not consistent amongst requests in terms of pagination. Speaking with Github, it seems that this is known with the REST API but not with the GraphQL, where you can, for instance, use cursor-based pagination/sorting the result pre-pagination. To us, this looks likely our issue. An example would be if runner X is initially in page 7 but then moves to page 5 before we get to page 7, it will not exist by the time we get to page 7 hence, it goes missing and gets orphaned. See https://docs.github.com/en/enterprise-cloud@latest/rest/actions/self-hosted-runners?apiVersion=2022-11-28#list-self-hosted-runners-for-an-organization.
Being that this is causing loss of communication, this is not a desired output at all and hence why I have logged it here. I do plan of doing something to help mitigate this, which is WIP at the moment.