Infrastructure Failures on Container Runner using karpenter

Overview

Infrastructure failures can occur intermittently on some jobs. These failures, commonly known as 'infra fails', arise when the pod stops sending build information to CircleCI. The lack of incoming build information makes it challenging to debug these failures, as they could be associated with network issues or abrupt pod/node termination.

Check how k8s are managed

When utilizing Kubernetes management resources, such as Karpenter, it's not uncommon for the system to terminate a running pod. Karpenter, for example, can terminate a pod busy with a job, as illustrated in the following example:

Solution: Preventing Pod Eviction

In order to avoid these disruptions, we recommend adding karpenter.sh/do-not-evict: true to the annotations of the agent. This will instruct Karpenter not to evict the pod while it is running.

Here is an example of how to apply the annotation:

agent:
  resourceClasses:
    circleci-runner/resourceClass:
      token: ***
      metadata:
        annotations:
          karpenter.sh/do-not-evict: true


Additional Resource

https://github.com/aws/karpenter/blob/main/designs/termination.md#user-configuration

Was this article helpful?
1 out of 1 found this helpful

Comments

0 comments

Article is closed for comments.