Infrastructure failures can occur intermittently on some jobs. These failures, commonly known as 'infra fails', arise when the pod stops sending build information to CircleCI. The lack of incoming build information makes it challenging to debug these failures, as they could be associated with network issues or abrupt pod/node termination.
Check how k8s are managed
When utilizing Kubernetes management resources, such as Karpenter, it's not uncommon for the system to terminate a running pod. Karpenter, for example, can terminate a pod busy with a job, as illustrated in the following example:
Solution: Preventing Pod Eviction
In order to avoid these disruptions, we recommend adding
karpenter.sh/do-not-evict: true to the annotations of the agent. This will instruct Karpenter not to evict the pod while it is running.
Here is an example of how to apply the annotation:
agent: resourceClasses: circleci-runner/resourceClass: token: *** metadata: annotations: karpenter.sh/do-not-evict: true