Overview
Test Service Pods can enter a crash loop state due to Out of Memory (OOM) errors when they attempt to process large test files from the queuing service. The continuous loop occurs as the queuing service keeps waiting for the file to be processed, causing the pod to repeatedly attempt to handle the large file upon each restart, leading to OOM crash loop errors.
Prerequisites
- Access to CircleCI server and logs
- Kubernetes and Helm charts values
Identifying the test-results-service crash loop
The issue can be identified when the test-results-service pods start to enter a crashloop state. This can be confirmed by checking the logs (seeing OOM) and the status of the pods (seeing CrashLoop).
$ kubectl get pod -l app=test-results-service -n circleci-server
NAME READY STATUS RESTARTS AGE
test-results-service-3f23b1703-a65a5a 0/1 CrashLoopBackOff 16 8m
$ kubectl logs deployment/test-results-service -n circleci-server
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/local/bin/oom "
# Executing /bin/sh -c "/usr/local/bin/oom "...
Sending: _e{7,79}:JVM OOM|Experienced a JVM out of memory event for test-results-service-3f23b1703-a65a5a
Resolving the Issue
The issue can be resolved by increasing the memory limit and JVM heap size for the test-results-service pod. This would be done by directly editing the deployment of the service. The memory limit should be doubled for example from 8Gi to 16Gi and the JVM_HEAP_SIZE environment variable to 13g. This should resolve the OOM errors and the test-results-service pods should stop entering the crashloop state.
1. Run the following command to edit deployment of test-results-service
$ kubectl edit deployments test-results-service -n circleci-server
2. Navigate to the resource.limits.memory
which should be 8Gi and change that to 16Gi.
resources:
limits:
cpu: "2"
memory: 16Gi
requests:
cpu: 100m
memory: 300Mi
3. Navigate to the env
section and add name: JVM_HEAP_SIZE
with the value: 13g
. This should increase the heap size and reduce the possibility of hitting the OOM issue. The reason for setting it to 13g
is because it is the ~80% of the resource.limits.memory
spec:
containers:
- env:
- name: CIRCLE_ENV
value: production
...
...truncated...
...
- name: JVM_HEAP_SIZE
value: 13g
Important Note
Please note that any edits made on deployments will not persist when the next Helm upgrade/operation happens. We are currently making this editable from the Values.yaml.
Comments
Article is closed for comments.