[Server] Crashloop State and OOM Errors in Test-Results-Service Pods

Overview

Test Service Pods can enter a crash loop state due to Out of Memory (OOM) errors when they attempt to process large test files from the queuing service. The continuous loop occurs as the queuing service keeps waiting for the file to be processed, causing the pod to repeatedly attempt to handle the large file upon each restart, leading to OOM crash loop errors.

Prerequisites

Access to CircleCI server and logs
Kubernetes and Helm charts values

Identifying the test-results-service crash loop

The issue can be identified when the test-results-service pods start to enter a crashloop state. This can be confirmed by checking the logs (seeing OOM) and the status of the pods (seeing CrashLoop).

$ kubectl get pod -l app=test-results-service -n circleci-server
NAME                                  READY STATUS            RESTARTS AGE
test-results-service-3f23b1703-a65a5a 0/1   CrashLoopBackOff  16       8m

$ kubectl logs deployment/test-results-service -n circleci-server
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/local/bin/oom "
# Executing /bin/sh -c "/usr/local/bin/oom "...
Sending: _e{7,79}:JVM OOM|Experienced a JVM out of memory event for test-results-service-3f23b1703-a65a5a

Resolving the Issue

The issue can be resolved by increasing the memory limit and JVM heap size for the test-results-service pod. This would be done by directly editing the deployment of the service. The memory limit should be doubled for example from 8Gi to 16Gi and the JVM_HEAP_SIZE environment variable to 13g. This should resolve the OOM errors and the test-results-service pods should stop entering the crashloop state.

1. Run the following command to edit deployment of test-results-service

$ kubectl edit deployments test-results-service -n circleci-server

2. Navigate to the resource.limits.memory which should be 8Gi and change that to 16Gi.

resources:
  limits:
    cpu: "2"
    memory: 16Gi
  requests:
    cpu: 100m
    memory: 300Mi

3. Navigate to the env section and add name: JVM_HEAP_SIZE with the value: 13g. This should increase the heap size and reduce the possibility of hitting the OOM issue. The reason for setting it to 13g is because it is the ~80% of the resource.limits.memory

spec:
  containers:
    - env:
      - name: CIRCLE_ENV
        value: production
      ...
      ...truncated...
      ...
      - name: JVM_HEAP_SIZE
        value: 13g

Important Note

Please note that any edits made on deployments will not persist when the next Helm upgrade/operation happens. We are currently making this editable from the Values.yaml.

Additional Resources