Overview
Nomad fails to start new jobs. This problem can be caused by a downstream service being unavailable (e.g. MongoDB is not available)
The following errors may appear in the Nomad logs:
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="no servers" rpc=Node.Register <DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"<IP_ADDRESS:PORT>\": dial tcp <IP_ADDRESS:PORT>: connect: connection refused" <DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available <DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] agent.joiner: join failed: error= <DATETIME> <IP_ADDRESS> nomad[11111]: | 1 error occurred: <DATETIME> <IP_ADDRESS> nomad[11111]: | * Server at address <IP_ADDRESS> failed ping: rpc error: failed to get conn: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<CERTIFICATE_NAME>") <DATETIME> <IP_ADDRESS> nomad[11111]: | <DATETIME> <IP_ADDRESS> nomad[11111]: retry=30s <DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available
Prerequisites
-
kubectlaccess to your CircleCI Server cluster - Your CircleCI Server Kubernetes namespace name
Root Cause
These Nomad errors can be a symptom of an upstream service failure rather than a Nomad issue.
One of the service that can cause this is CircleCI Server's MongoDB.
Solution:
-
Find pods that are not in Running status
kubectl get pods -n <namespace> | grep -v "Running"Look for pods in CrashLoopBackOff, Error, Pending statuses.
-
Inspect the affected pod
kubectl describe pod <pod_name> -n <namespace>Check the "Events" section at the bottom of output.
-
Check container logs
If the events do not show a clear reason for a pod to fail, check container logskubectl logs <pod_name> -n <namespace>Once the root cause is identified. Fix the failing pod.
After the pod is up an running you should see Nomad is back online.
Important
MongoDB Version Requirement for server 4.9+ has changed.
Starting from CIrcleCI Server 4.9.0, MongoDB 4.4. or is required. Upgrading to Server 4.9.x without first upgrading MongoDB will cause MongoDB pod to enter CrashLoopBackOff due to WiredTiger incompatibility.
You will see log messages similar to the below:
"msg":"Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."In order to fix this, please follow the steps describe in the "Upgrade MongoDB to 4.4" guide.
Comments
Article is closed for comments.