[SERVER] CircleCI Server Diagnostic Guide: Collecting Critical Information

This guide helps you collect the right diagnostic information when troubleshooting issues in your CircleCI self-hosted server environment. Providing complete logs upfront significantly reduces resolution time by eliminating back-and-forth requests.

Important: Timely Log Collection

CircleCI logs are retained for a limited time and log rotation may cause critical information to be lost. Collect support bundles and relevant logs within 10 minutes of the issue occurring to prevent loss of relevant data.

Quick Reference

Issue Type Essential Logs Section
Docker Executor (Job Delay, Infra Fail) Support bundle + Nomad alloc logs Docker Executor Issues
Machine Executor (Job Delay, Infra Fail) Support bundle + Journalctl + machine-provisioner logs Machine Executor Issues
Runner (Infra Fail, Task Claim) Support bundle + Runner pod/service logs Runner Issues
Nomad Server / Scheduling Support bundle + Nomad server debug logs Nomad Server Issues
Nomad Autoscaler / Scaling Support bundle + nomad-autoscaler logs Nomad Autoscaler Issues
Nomad Client / Docker Network Support bundle + Nomad client logs + docker network ls Nomad Client Issues
Permission / Cloud Provider Errors Support bundle + Cloud provider error messages Cloud Provider Permission Issues
API Connection / Webhook Support bundle + API request logs with -vvv API Connection Issues
Custom Integration (VCS, Proxy) Support bundle + Integration logs with -vvv Custom Integration Issues
Widespread Infra Failures (Cascade) Support bundle + distributor/execution-gateway/contexts-service logs Cascade / Widespread Infra Failures

Initial Diagnostics

1. Support Bundle Collection (REQUIRED for all issues)

For every issue, start by collecting a support bundle:

kubectl support-bundle \
  https://raw.githubusercontent.com/CircleCI-Public/server-scripts/main/support/support-bundle.yaml \
  -n <namespace>

Replace <namespace> with your CircleCI Server namespace (commonly circleci-server).

Prerequisites:

Important notes:

  • If you receive a timeout or rate limiter error, the bundle may still contain valuable information. See Troubleshooting: Rate Limiter Error below.
  • Support bundles have a default limit of 10,000 lines per pod log. If your pods are generating high log volume, critical error messages may be pushed out before the bundle captures them. Generate the bundle immediately after reproducing the issue.
  • If the issue occurred in the past and you cannot reproduce it, generate a bundle anyway — it still captures cluster state, pod status, and recent logs.
  • Bundles do not contain job execution network detail (e.g., outbound connections from inside a Docker executor). For network-level forensics, rely on step logs, VPC Flow Logs, or proxy logs instead.

2. Retrieving Job Details (IMPORTANT)

For job-specific issues, collect the job details via the API:

curl -H "Circle-Token:${CIRCLE_TOKEN}" -s \
  "https://<your-server-domain>/api/v1.1/project/<vcs-type>/<org>/<project>/<job-number>" \
  | tee job-details.json

This provides step timing details, build parameters, start/completion times, and job history.

3. Server Version Information

When submitting a support ticket, always include:

  • CircleCI Server version (e.g., 4.5.3)
  • Service image tags if known — you can retrieve these with:
kubectl get deploy -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

4. Support Bundle Log Triage (What to Search)

Once you have the support bundle, these are the most useful log files and search patterns for common issues:

Service Log (inside bundle) What to Search For Issue Type
distributor-external infra_fail, INFRASTRUCTURE_FAIL, time_in_queue, context deadline exceeded Job failures, queueing delays
execution-gateway-api (api.log) gateway end, CompleteTask, task/config, timeout Task lifecycle failures
contexts-service HikariPool, connection timeout, vault, gRPC Context fetch failures (cascade)
machine-provisioner (externalapi.log) infra_fail, UnauthorizedOperation, No space left Machine executor provisioning
nomad-server rpc error, failed to get conn, lead thread Nomad scheduling / leader
nomad-autoscaler expected 1 Autoscaling Group, got 0, policy Autoscaler misconfiguration
runner-admin task claim, infra_fail Runner task failures
legacy-notifier infra_fail Notification pipeline
distributor-cleaner infra_fail Job cleanup failures

Docker Executor Issues

If experiencing delays between job steps, infrastructure failures, or job timeouts:

  1. Collect a support bundle immediately (see above).
  2. Get the specific job ID from the CircleCI UI or API.
  3. Check the Nomad allocation:

    kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \
      -n <namespace> -- nomad status <job-id>
  4. Examine allocation logs for the specific job:

    kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \
      -n <namespace> -- nomad alloc logs -stderr <allocation-id>
  5. Check distributor for queueing delays — search the distributor logs for the time_in_queue metric associated with the job. High values indicate jobs are waiting for available Nomad capacity.
  6. For comprehensive logging of all running jobs, use the CircleCI support script to continuously capture Nomad job state and container logs:

    CircleCI Support Scripts - Server Docker Executor Logger

    Alternatively, you can use this inline script to capture a snapshot of all running Nomad jobs:

    #!/bin/bash
    mkdir -p ba-logs
    
    NOMAD_POD=$(kubectl get pods -l app=nomad-server -n <namespace> -o jsonpath='{.items[0].metadata.name}')
    NS="<namespace>"
    
    while :; do
        kubectl exec $NOMAD_POD -n $NS -- nomad status | tail -n +2 | awk '{ print $1 }' | while read -r job; do
            date_dir=$(date +%s)
            mkdir -p "ba-logs/${date_dir}/${job}"
    
            kubectl exec $NOMAD_POD -n $NS -- nomad status "${job}" > "ba-logs/${date_dir}/${job}/status.txt"
            kubectl exec $NOMAD_POD -n $NS -- nomad logs -stderr -job "${job}" > "ba-logs/${date_dir}/${job}/stderr.txt"
    
            kubectl exec $NOMAD_POD -n $NS -- nomad status "${job}" | tail -n +18 | awk '{ print $1 }' | while read -r alloc; do
                kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker ps -a > "ba-logs/${date_dir}/${job}/docker-ps.txt"
                kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker ps -a | tail -n +2 | awk '{ print $1 }' | while read -r cid; do
                    kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker logs $cid > "ba-logs/${date_dir}/${job}/${cid}.txt"
                done
            done
        done
    
        find ba-logs -type f -mtime +1 -exec rm {} \;
        find ba-logs -mindepth 1 -type d -empty -delete
        echo "Snapshot captured at $(date)"
        sleep 1
    done

    Note: Replace <namespace> with your CircleCI Server namespace before running.


Machine Executor Issues

For issues with machine executors (delays, infra failures, provisioning errors):

  1. Collect a support bundle immediately (within 10 minutes of the issue).
  2. Add a background logging step to your CircleCI config to capture system logs during job execution:

    jobs:
      your-job-name:
        machine: true
        steps:
          # Your regular job steps here
    
          - run:
              name: Retrieve system logs
              command: journalctl --no-pager -f
              background: true
              when: always
  3. Collect machine provisioner logs:

    kubectl logs -l app=machine-provisioner-provisioner -n <namespace> > machine-provisioner-logs.txt
  4. Look for common error patterns in the logs:
    • Disk space: No space left on device
    • Network: Connection timed out
    • Cloud provider permissions: UnauthorizedOperation
    • Resources: Cannot allocate memory
    • Instance provisioning: dpkg lock or unattended-upgrades (race condition at boot — see Nomad Client Issues)

Runner Issues

For issues with self-hosted runners (task claim failures, infrastructure failures):

Note: Container runners and machine runners do not use Nomad for job execution. However, they share upstream services with the Docker executor path (distributor, Postgres, RabbitMQ, etc.). This means a backend service outage can cause correlated failures across both Docker executor and runner jobs simultaneously.

  1. Collect a support bundle immediately.
  2. Collect runner service logs:

    kubectl logs -l app=runner-admin -n <namespace> --tail=5000 > runner-admin-logs.txt
  3. Collect runner agent logs from the machine or pod running the runner:
    • For container runner: kubectl logs -l app=container-agent -n <runner-namespace> --tail=5000 > container-agent-logs.txt
    • For machine runner: Check the runner agent log file (default location varies by OS)
  4. If using container runner, generate a container runner support bundle as well:

    kubectl support-bundle \
      https://raw.githubusercontent.com/CircleCI-Public/circleci-support-scripts/refs/heads/main/container-runner-support-bundle/support-bundle.yaml

Nomad Server Issues

For Nomad scheduling problems, leader election failures, or RPC errors:

  1. Collect a support bundle immediately.
  2. Enable debug-level logging on Nomad server if the issue is intermittent:
    • Set log_level for the Nomad server to DEBUG or TRACE via your Helm values
    • Reproduce the issue
    • Collect the support bundle
  3. Manually capture Nomad server logs (especially if the pod is about to restart or be deleted):

    kubectl logs deploy/nomad-server -n <namespace> --tail=1000000 > nomad-server.log
  4. Check for common Nomad errors:
    • RPC failures: rpc error: failed to get conn
    • Leader issues: lead thread didn't get connection
    • Eval failures: failed to update evaluation
    • Connection limits: rpc_max_conns_per_client (if many Nomad clients overwhelm server)
  5. RPC connection limits — If you see RPC connection errors at scale, note that rpc_max_conns_per_client and http_max_conns_per_client are Nomad server settings. Changing them requires only a Nomad server pod restart (via Helm rollout); Nomad clients will reconnect automatically without requiring a client-side restart.

Nomad Autoscaler Issues

For problems with Nomad client scaling (nodes not scaling up/down, drain storms, ASG issues):

  1. Collect a support bundle immediately.
  2. Collect nomad-autoscaler logs:

    kubectl logs -l app=nomad-autoscaler -n <namespace> --tail=5000 > nomad-autoscaler-logs.txt
  3. Common autoscaler issues:
    • expected 1 Autoscaling Group, got 0 — The autoscaler cannot find the configured ASG. Verify the ASG name in your autoscaler policy matches the actual AWS ASG name and that IAM permissions allow autoscaling:DescribeAutoScalingGroups.
    • min > max in policy — If min is set higher than max, the policy fails validation and is not loaded, meaning the autoscaler is effectively broken for that resource class. Check the autoscaler ConfigMap:

      kubectl get configmap -l app=nomad-autoscaler -n <namespace> -o yaml
    • Drain storms (mass node draining) — On older autoscaler versions, scale-in can drain too many nodes at once. Upgrade to autoscaler v0.4.6+ (Server 4.8.x+) which supports max_scale_down to limit how many nodes drain simultaneously. Also consider node_filter_ignore_drain to prevent cascading drain decisions.
    • Node selection for scale-in — The node_selector_strategy setting (e.g., empty, empty_ignore_system) controls which nodes are drained first. This is in the autoscaler target block, separate from CPU/memory checks.
  4. Verify ASG and node health:

    # Check current Nomad nodes and their status
    kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \
      -n <namespace> -- nomad node status
    
    # Check if nodes match expected resource classes
    kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \
      -n <namespace> -- nomad node status -verbose <node-id> | grep -i class

Nomad Client Issues

For problems with Nomad client nodes (EC2/ASG instances running Docker executor jobs):

  1. Collect a support bundle immediately.
  2. Docker network ci-privileged missing — If docker network prune was run on a Nomad client, the ci-privileged bridge network (required for Docker executor jobs) will be removed. This causes all jobs on that node to fail.
    • Why this happens: ci-privileged is created during instance bootstrap (cloud-init / user-data), which runs only on first boot. A simple reboot does not recreate it.
    • Recovery option A (recommended): Terminate the instance and let the ASG replace it. A new instance will run cloud-init and recreate the network.
    • Recovery option B (manual):

      docker network create --label keep --driver=bridge \
        --opt com.docker.network.bridge.name=ci-privileged ci-privileged
      systemctl restart docker-gc
    • Prevention: Never run docker network prune on Nomad client instances. The keep label is on the network, but docker network prune ignores labels.
  3. Bootstrap race condition (unattended-upgrades) — On Ubuntu-based Nomad clients, the unattended-upgrades service can hold a dpkg lock at boot, causing the startup apt-get commands to fail. This results in a Nomad client that never joins the cluster.
    • Check for dpkg lock or Could not get lock in instance system logs
    • Workarounds: disable unattended-upgrades in the AMI, or reprovision the instance
  4. Collect Nomad client logs from the instance if accessible:

    # If you have SSH access to the Nomad client EC2 instance
    journalctl -u nomad --no-pager > nomad-client.log
    docker network ls > docker-networks.txt
    docker ps -a > docker-containers.txt

Cascade / Widespread Infra Failures

When many or all jobs fail simultaneously with infrastructure_fail, the root cause is typically a shared backend service under stress. The most common pattern involves the contexts-service:

Request path: build-agentdistributorexecution-gateway 

  1. Collect a support bundle immediately.
  2. Collect targeted service logs:

    # All two services in the critical path
    kubectl logs -l app=execution-gateway -n <namespace> --tail=10000 > execution-gateway-logs.txt
    kubectl logs -l app=distributor-external -n <namespace> --tail=10000 > distributor-external-logs.txt
  3. What to look for:
    • execution-gateway: context deadline exceeded on downstream calls, /api/v2/task/config timeouts
    • distributor: gateway end: httpclient do: context deadline exceeded — indicates the execution-gateway was slow or unresponsive
  4. Understanding retry amplification — A single slow downstream service can cause cascading timeouts:
    • build-agent has a ~30s budget, retrying via ex/httpclient
    • distributorexecution-gateway contexts call has a ~5s budget with ~2s per attempt

Cloud Provider Permission Issues

When encountering cloud provider (AWS, GCP, Azure) errors:

  1. Collect a support bundle immediately.
  2. For AWS — decode authorization error messages:

    aws sts decode-authorization-message --encoded-message "<encoded_message>"
  3. Check CloudTrail logs for denied actions:

    # Search for errors related to a specific IAM role
    aws cloudtrail lookup-events \
      --lookup-attributes AttributeKey=Username,AttributeValue=<role-name> \
      --max-items 100
    
    # Check for errors in recent events
    aws cloudtrail lookup-events \
      --start-time $(date -u -d "1 hour ago" +"%Y-%m-%dT%H:%M:%SZ") \
      --query "Events[?contains(CloudTrailEvent, 'errorCode') || contains(CloudTrailEvent, 'errorMessage')]"
  4. Verify IAM policies for:
    • Correct resource ARNs (check account IDs, regions)
    • Service control policies (look for explicit denies)
    • Cross-account access permissions

API Connection Issues

If experiencing issues with API connections, webhook failures, or approval delays:

  1. Collect a support bundle immediately.
  2. Capture API response details with verbose output:

    curl -vvv -X POST "https://<your-server-domain>/api/v2/workflow/<workflow-id>/approve/<approval-request-id>"
  3. Check Nginx/ingress logs:

    kubectl logs -l app=nginx -n <namespace> --tail=500 > nginx-logs.txt
  4. Look for specific HTTP response codes:
    • 404 — the resource may not be ready yet
    • 403 — permission issues (check token validity, presigned URL expiry)
    • 5xx — backend service errors
    • Slow responses (>1s) — backend processing delays
  5. For webhook or approval timing issues, capture timestamps of:
    • Job completion events in logs
    • API call attempts
    • Webhook delivery attempts
  6. S3 presigned URL 403 errors — If test results or artifacts return 403 Forbidden, the presigned URL may have expired (default TTL is 900 seconds / 15 minutes). Refreshing the page or re-fetching the resource generates a new presigned URL.

Custom Integration Issues

For issues with VCS integrations (GitHub Enterprise, Bitbucket Data Center), proxy setups, or custom networking:

  1. Collect a support bundle immediately.
  2. For VCS integration issues:
    • Capture webhook delivery logs from your VCS provider's UI
    • Check TLS certificate configuration and expiry
    • Verify network connectivity between CircleCI and your VCS
  3. For proxy integrations:
    • Collect complete request/response cycles including headers
    • Log both incoming and outgoing payloads (if possible)
    • Verify that signatures and headers are preserved through the proxy

Monitoring and Observability

Proactive monitoring helps catch issues before they escalate. If you have Prometheus available in your cluster:

  1. Nomad metrics — Expose Nomad server metrics for Prometheus scraping by adding a scrape target for the Nomad /v1/metrics endpoint via a ConfigMap. Key metrics to watch:
    • nomad_nomad_rpc_request — RPC request rates and errors
    • nomad_client_allocs_running — running allocation count
    • nomad_nomad_blocked_evals — scheduling backlog
  2. Pod resource usage — Monitor CPU and memory for critical services:
    • contexts-service — watch for connection pool exhaustion under load
    • execution-gateway — watch for high response times
    • nomad-server — watch for memory pressure during high concurrency
  3. Health check endpoints — Most CircleCI services expose health endpoints. Use Kubernetes liveness/readiness probes to detect unresponsive services early.

Troubleshooting: Rate Limiter Error

When generating support bundles, you might encounter:

failed to get log stream: client rate limiter Wait returned an error: context deadline exceeded

This happens when Kubernetes API rate limiting is too restrictive for the bundle collection process.

Solution:

  1. Check current settings:

    kubectl get configmap kube-proxy-config -n kube-system -o yaml
  2. Edit the configuration:

    kubectl edit configmap kube-proxy-config -n kube-system
  3. Update the rate limits (the support bundle plugin expects up to 100):

    clientConnection:
      burst: 100
      qps: 100
  4. Re-generate the support bundle after applying these changes.

For more details, see Creating a Support Bundle - Rate Limit Error.


Troubleshooting: Collecting Logs When Pods Are Crashing

If a pod is in a crash loop, standard kubectl logs shows the current (often empty) container. To get logs from the previous container:

kubectl logs <pod-name> -n <namespace> --previous > <service>-previous.log

If you need to restart a pod but want to preserve logs first:

# Capture logs BEFORE restarting
kubectl logs <pod-name> -n <namespace> --tail=1000000 > <service>-pre-restart.log

# Then restart
kubectl delete pod <pod-name> -n <namespace> --now

Submitting a Support Ticket

When contacting CircleCI Support, include the following for fastest resolution:

Item Details
Support bundle Collected within 10 minutes of the issue
CircleCI Server version e.g., 4.5.3
Job ID and URL The specific job experiencing the issue
Timestamps When the issue occurred (with timezone)
Error messages Exact error text from UI or logs
Issue-specific logs See the relevant section above for which logs to collect
Recent changes Upgrades, config changes, infrastructure changes, new projects

Tips for effective log collection:

  • Reproduce the issue if possible — then immediately collect the bundle
  • Capture logs before restarting pods — restarting pods will lose the current log history
  • Include the full job URL — this helps us quickly identify the pipeline, workflow, and job
  • Note any recent changes — upgrades, config changes, infrastructure changes, or new projects added
  • If multiple executor types are failing simultaneously, this typically points to a shared backend issue rather than an executor-specific problem — collect the cascade logs described above

Additional Resources

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.