Nomad Jobs Are Failing With "No Servers Available" message

Overview

Nomad fails to start new jobs. This problem can be caused by a downstream service being unavailable (e.g. MongoDB is not available)

The following errors may appear in the Nomad logs:

<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="no servers" rpc=Node.Register
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"<IP_ADDRESS:PORT>\": dial tcp <IP_ADDRESS:PORT>: connect: connection refused"
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] agent.joiner: join failed: error=
<DATETIME> <IP_ADDRESS> nomad[11111]:     | 1 error occurred:
<DATETIME> <IP_ADDRESS> nomad[11111]:     | * Server at address <IP_ADDRESS> failed ping: rpc error: failed to get conn: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<CERTIFICATE_NAME>")
<DATETIME> <IP_ADDRESS> nomad[11111]:     |
<DATETIME> <IP_ADDRESS> nomad[11111]:     retry=30s
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available

Prerequisites

kubectl access to your CircleCI Server cluster
Your CircleCI Server Kubernetes namespace name

Root Cause

These Nomad errors can be a symptom of an upstream service failure rather than a Nomad issue.

One of the service that can cause this is CircleCI Server's MongoDB.

Solution:

Find pods that are not in Running status
```
kubectl get pods -n <namespace> | grep -v "Running"
```
Look for pods in CrashLoopBackOff, Error, Pending statuses.
Inspect the affected pod
```
kubectl describe pod <pod_name> -n <namespace>
```
Check the "Events" section at the bottom of output.
Check container logs
If the events do not show a clear reason for a pod to fail, check container logs
```
kubectl logs <pod_name> -n <namespace>
```
Once the root cause is identified. Fix the failing pod.

After the pod is up an running you should see Nomad is back online.

Important

MongoDB Version Requirement for server 4.9+ has changed.

Starting from CIrcleCI Server 4.9.0, MongoDB 4.4. or is required. Upgrading to Server 4.9.x without first upgrading MongoDB will cause MongoDB pod to enter CrashLoopBackOff due to WiredTiger incompatibility.

You will see log messages similar to the below:


"msg":"Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."

In order to fix this, please follow the steps describe in the "Upgrade MongoDB to 4.4" guide.

Overview

Prerequisites

Root Cause

Solution:

Important

Comments

Articles in this section