How to troubleshoot the duplicated pod issue during server update

Overview

When you upgrade the CircleCI server version, you may find that old pods are not terminated cleanly and continue running alongside the newly deployed ones, causing services to fail. If you check the pod status and see the similar result like below, this is the case.

$ kubectl get pods -n <namespace>

NAME                                                        READY   STATUS              RESTARTS         AGE
api-service-84894f748c-5bw7z                                0/1     CrashLoopBackOff    9 (59s ago)      54m
api-service-f45d8b7c4-2fbc9                                 0/1     CrashLoopBackOff    9 (96s ago)      54m
audit-log-service-5857698db4-qtghq                          0/1     CrashLoopBackOff    9 (43s ago)      54m
audit-log-service-5b6b5dcb9c-qskzd                          0/1     CrashLoopBackOff    9 (86s ago)      54m

 

Check if a pod is failing to connect to RabbitMQ

Check the logs from a duplicated pod first. Let's find the logs from the API service pod as the following example. You can find the connection refused error when trying to connect to RabbitMQ.

$ kubectl logs <pod_name> -n <namespace>

2025-10-07T08:33:07.689+0000 [] [main] ERROR circleci.backplane.trace backplane.rabbitmq/connect; attempt=15; canary=false; deploy_environment=production; duration_ms=30014.716913; exception.message=Connection refused; exception.type=class java.net.ConnectException; hostname=api-service-f45d8b7c4-2fbc9; k8s_pod_name=api-service-f45d8b7c4-2fbc9; k8s_pod_namespace=circleci-server-cj; k8s_replicaset=api-service-f45d8b7c4; meta.location=circleci.backplane.rabbitmq:99; revision=0b28f56bd4d926218074253c7d218887d6f4086d; service=api-service; span_kind=internal; status_code=2; status_desc=retries exceeded; version=1.0.23610
...
2025-10-07T08:33:07.691+0000 [] [main] ERROR circleci.backplane.exceptions Exiting due to uncaught exception; ... java.net.ConnectException: Connection refused
  ...
  at circleci.backplane.rabbitmq$connect.invokeStatic(rabbitmq.clj:99)

 

Check if the RabbitMQ pod is pending

Next, let's look at the events from the RabbitMQ pod. The pod is pending and and hasn't been assigned to any node because of insufficient cpu resources in the cluster.

$ kubectl describe pod <pod_name>

Name:             rabbitmq-0
Namespace:        circleci-server
...
Status:           Pending
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  67m (x11 over 75m)  default-scheduler  0/4 nodes are available: 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
  Warning  FailedScheduling  56m                 default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  43m (x79 over 56m)  default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  43m                 default-scheduler  0/4 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 3 Insufficient cpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
  Warning  FailedScheduling  43m                 default-scheduler  0/4 nodes are available: 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.
  Warning  FailedScheduling  30m (x11 over 43m)  default-scheduler  0/4 nodes are available: 4 Insufficient cpu. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.

 

Solution

This can be resolved either by adding more nodes to the cluster or resizing existing nodes.

 

Additional Resources:

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.