[Server] Crashloop State and OOM Errors in Test-Results-Service Pods

Overview

Test Service Pods can enter a crash loop state due to Out of Memory (OOM) errors when they attempt to process large test files from the queuing service. The continuous loop occurs as the queuing service keeps waiting for the file to be processed, causing the pod to repeatedly attempt to handle the large file upon each restart, leading to OOM crash loop errors.

Prerequisites

  • Access to CircleCI server and logs
  • Kubernetes and Helm charts values

Identifying the test-results-service crash loop

The issue can be identified when the test-results-service pods start to enter a crashloop state. This can be confirmed by checking the logs (seeing OOM) and the status of the pods (seeing CrashLoop). 

$ kubectl get pod -l app=test-results-service -n circleci-server
NAME READY STATUS RESTARTS AGE
test-results-service-3f23b1703-a65a5a 0/1 CrashLoopBackOff 16     8m
$ kubectl logs deployment/test-results-service -n circleci-server
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/local/bin/oom "
# Executing /bin/sh -c "/usr/local/bin/oom "...
Sending: _e{7,79}:JVM OOM|Experienced a JVM out of memory event for test-results-service-3f23b1703-a65a5a

Resolving the Issue

The issue can be resolved by increasing the memory limit and JVM heap size for the test-results-service pod. This would be done by directly editing the deployment of the service. The memory limit should be doubled for example from 8Gi to 16Gi and the JVM_HEAP_SIZE environment variable to 13g. This should resolve the OOM errors and the test-results-service pods should stop entering the crashloop state.

1. Run the following command to edit deployment of test-results-service

$ kubectl edit deployments test-results-service -n circleci-server

2. Navigate to the resource.limits.memory which should be 8Gi and change that to 16Gi.

resources:
limits:
cpu: "2"
memory: 16Gi
requests:
cpu: 100m
memory: 300Mi

3. Navigate to the env section and add name: JVM_HEAP_SIZE with the value: 13g. This should increase the heap size and reduce the possibility of hitting the OOM issue. The reason for setting it to 13g is because it is the  ~80% of the resource.limits.memory

spec:
containers:
- env:
- name: CIRCLE_ENV
value: production
...
...truncated...
...
- name: JVM_HEAP_SIZE
value: 13g

Important Note

Please note that any edits made on deployments will not persist when the next Helm upgrade/operation happens. We are currently making this editable from the Values.yaml.

Additional Resources

Was this article helpful?
1 out of 1 found this helpful

Comments

0 comments

Article is closed for comments.