Unfortunately, GitHub doesn't have an API endpoint to get all workflow jobs for the repository. We just get all queued workflow runs for the repository, then get workflow jobs for each workflow run. We have a 2-minute limit in respirate for each run. If it exceeds this limit, respirate considers the run stuck and terminates itself. We encountered this issue in production when we needed to poll over 200 workflow runs in one iteration, which took more than 2 minutes. As a result, respirate crashed multiple times. The tricky part is that, since runners are job/run agnostic, we sum up all queued labels and compare them with the existing runners for this repository. If there are fewer runners, we provision extra ones. Since we limit polling to the first 200 runs per iteration, the existing runner count will likely be higher, and we won't provision extra ones. However, this is a rare case, and we poll jobs as a nice-to-have when the webhook is missing every 5 minutes, which is acceptable. The number of queued runs goes down when their jobs are assigned to runners, so it shouldn't always be high.
6.2 KiB
6.2 KiB