Nomad Jobs Are Failing With "No Servers Available" message

Overview

Nomad fails to start new jobs. This problem can be caused by a downstream service being unavailable (e.g. MongoDB is not available)

The following errors may appear in the Nomad logs:

<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="no servers" rpc=Node.Register
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"<IP_ADDRESS:PORT>\": dial tcp <IP_ADDRESS:PORT>: connect: connection refused"
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] agent.joiner: join failed: error=
<DATETIME> <IP_ADDRESS> nomad[11111]:     | 1 error occurred:
<DATETIME> <IP_ADDRESS> nomad[11111]:     | * Server at address <IP_ADDRESS> failed ping: rpc error: failed to get conn: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "<CERTIFICATE_NAME>")
<DATETIME> <IP_ADDRESS> nomad[11111]:     |
<DATETIME> <IP_ADDRESS> nomad[11111]:     retry=30s
<DATETIME> <IP_ADDRESS> nomad[11111]: <DATETIME> [WARN] client.server_mgr: no servers available

 

Prerequisites

  • kubectl access to your CircleCI Server cluster
  • Your CircleCI Server Kubernetes namespace name
     

Root Cause

These Nomad errors can be a symptom of an upstream service failure rather than a Nomad issue.

One of the service that can cause this is CircleCI Server's MongoDB.

Solution:

  1. Find pods that are not in Running status 

    kubectl get pods -n <namespace> | grep -v "Running"

    Look for pods in CrashLoopBackOff, Error, Pending statuses.

  2. Inspect the affected pod 

    kubectl describe pod <pod_name> -n <namespace>

    Check the "Events" section at the bottom of output.

  3. Check container logs
    If the events do not show a clear reason for a pod to fail, check container logs

    kubectl logs <pod_name> -n <namespace>

    Once the root cause is identified. Fix the failing pod. 

    After the pod is up an running you should see Nomad is back online.

Important 

MongoDB Version Requirement for server 4.9+ has changed.

Starting from CIrcleCI Server 4.9.0, MongoDB 4.4. or is required. Upgrading to Server 4.9.x without first upgrading MongoDB will cause MongoDB pod to enter CrashLoopBackOff due to WiredTiger incompatibility.

You will see log messages similar to the below:


"msg":"Failed to start up WiredTiger under any compatibility version. This may be due to an unsupported upgrade or downgrade."

In order to fix this, please follow the steps describe in the "Upgrade MongoDB to 4.4" guide.

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.