This guide helps you collect the right diagnostic information when troubleshooting issues in your CircleCI self-hosted server environment. Providing complete logs upfront significantly reduces resolution time by eliminating back-and-forth requests.
Important: Timely Log Collection
CircleCI logs are retained for a limited time and log rotation may cause critical information to be lost. Collect support bundles and relevant logs within 10 minutes of the issue occurring to prevent loss of relevant data.
Quick Reference
| Issue Type | Essential Logs | Section |
|---|---|---|
| Docker Executor (Job Delay, Infra Fail) | Support bundle + Nomad alloc logs | Docker Executor Issues |
| Machine Executor (Job Delay, Infra Fail) | Support bundle + Journalctl + machine-provisioner logs | Machine Executor Issues |
| Runner (Infra Fail, Task Claim) | Support bundle + Runner pod/service logs | Runner Issues |
| Nomad Server / Scheduling | Support bundle + Nomad server debug logs | Nomad Server Issues |
| Nomad Autoscaler / Scaling | Support bundle + nomad-autoscaler logs | Nomad Autoscaler Issues |
| Nomad Client / Docker Network | Support bundle + Nomad client logs + docker network ls
|
Nomad Client Issues |
| Permission / Cloud Provider Errors | Support bundle + Cloud provider error messages | Cloud Provider Permission Issues |
| API Connection / Webhook | Support bundle + API request logs with -vvv
|
API Connection Issues |
| Custom Integration (VCS, Proxy) | Support bundle + Integration logs with -vvv
|
Custom Integration Issues |
| Widespread Infra Failures (Cascade) | Support bundle + distributor/execution-gateway/contexts-service logs | Cascade / Widespread Infra Failures |
Initial Diagnostics
1. Support Bundle Collection (REQUIRED for all issues)
For every issue, start by collecting a support bundle:
kubectl support-bundle \ https://raw.githubusercontent.com/CircleCI-Public/server-scripts/main/support/support-bundle.yaml \ -n <namespace>
Replace <namespace> with your CircleCI Server namespace (commonly circleci-server).
Prerequisites:
-
kubectlaccess to the cluster/namespace - Krew installed
-
support-bundle kubectl plugin installed (
kubectl krew install support-bundle)
Important notes:
- If you receive a timeout or rate limiter error, the bundle may still contain valuable information. See Troubleshooting: Rate Limiter Error below.
- Support bundles have a default limit of 10,000 lines per pod log. If your pods are generating high log volume, critical error messages may be pushed out before the bundle captures them. Generate the bundle immediately after reproducing the issue.
- If the issue occurred in the past and you cannot reproduce it, generate a bundle anyway — it still captures cluster state, pod status, and recent logs.
- Bundles do not contain job execution network detail (e.g., outbound connections from inside a Docker executor). For network-level forensics, rely on step logs, VPC Flow Logs, or proxy logs instead.
2. Retrieving Job Details (IMPORTANT)
For job-specific issues, collect the job details via the API:
curl -H "Circle-Token:${CIRCLE_TOKEN}" -s \
"https://<your-server-domain>/api/v1.1/project/<vcs-type>/<org>/<project>/<job-number>" \
| tee job-details.jsonThis provides step timing details, build parameters, start/completion times, and job history.
3. Server Version Information
When submitting a support ticket, always include:
-
CircleCI Server version (e.g.,
4.5.3) - Service image tags if known — you can retrieve these with:
kubectl get deploy -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'4. Support Bundle Log Triage (What to Search)
Once you have the support bundle, these are the most useful log files and search patterns for common issues:
| Service Log (inside bundle) | What to Search For | Issue Type |
|---|---|---|
distributor-external |
infra_fail, INFRASTRUCTURE_FAIL, time_in_queue, context deadline exceeded
|
Job failures, queueing delays |
execution-gateway-api (api.log) |
gateway end, CompleteTask, task/config, timeout
|
Task lifecycle failures |
contexts-service |
HikariPool, connection timeout, vault, gRPC
|
Context fetch failures (cascade) |
machine-provisioner (externalapi.log) |
infra_fail, UnauthorizedOperation, No space left
|
Machine executor provisioning |
nomad-server |
rpc error, failed to get conn, lead thread
|
Nomad scheduling / leader |
nomad-autoscaler |
expected 1 Autoscaling Group, got 0, policy
|
Autoscaler misconfiguration |
runner-admin |
task claim, infra_fail
|
Runner task failures |
legacy-notifier |
infra_fail |
Notification pipeline |
distributor-cleaner |
infra_fail |
Job cleanup failures |
Docker Executor Issues
If experiencing delays between job steps, infrastructure failures, or job timeouts:
- Collect a support bundle immediately (see above).
- Get the specific job ID from the CircleCI UI or API.
-
Check the Nomad allocation:
kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \ -n <namespace> -- nomad status <job-id>
-
Examine allocation logs for the specific job:
kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \ -n <namespace> -- nomad alloc logs -stderr <allocation-id>
-
Check distributor for queueing delays — search the distributor logs for the
time_in_queuemetric associated with the job. High values indicate jobs are waiting for available Nomad capacity. -
For comprehensive logging of all running jobs, use the CircleCI support script to continuously capture Nomad job state and container logs:
CircleCI Support Scripts - Server Docker Executor Logger
Alternatively, you can use this inline script to capture a snapshot of all running Nomad jobs:
#!/bin/bash mkdir -p ba-logs NOMAD_POD=$(kubectl get pods -l app=nomad-server -n <namespace> -o jsonpath='{.items[0].metadata.name}') NS="<namespace>" while :; do kubectl exec $NOMAD_POD -n $NS -- nomad status | tail -n +2 | awk '{ print $1 }' | while read -r job; do date_dir=$(date +%s) mkdir -p "ba-logs/${date_dir}/${job}" kubectl exec $NOMAD_POD -n $NS -- nomad status "${job}" > "ba-logs/${date_dir}/${job}/status.txt" kubectl exec $NOMAD_POD -n $NS -- nomad logs -stderr -job "${job}" > "ba-logs/${date_dir}/${job}/stderr.txt" kubectl exec $NOMAD_POD -n $NS -- nomad status "${job}" | tail -n +18 | awk '{ print $1 }' | while read -r alloc; do kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker ps -a > "ba-logs/${date_dir}/${job}/docker-ps.txt" kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker ps -a | tail -n +2 | awk '{ print $1 }' | while read -r cid; do kubectl exec $NOMAD_POD -n $NS -- nomad alloc exec "${alloc}" docker logs $cid > "ba-logs/${date_dir}/${job}/${cid}.txt" done done done find ba-logs -type f -mtime +1 -exec rm {} \; find ba-logs -mindepth 1 -type d -empty -delete echo "Snapshot captured at $(date)" sleep 1 doneNote: Replace
<namespace>with your CircleCI Server namespace before running.
Machine Executor Issues
For issues with machine executors (delays, infra failures, provisioning errors):
- Collect a support bundle immediately (within 10 minutes of the issue).
-
Add a background logging step to your CircleCI config to capture system logs during job execution:
jobs: your-job-name: machine: true steps: # Your regular job steps here - run: name: Retrieve system logs command: journalctl --no-pager -f background: true when: always -
Collect machine provisioner logs:
kubectl logs -l app=machine-provisioner-provisioner -n <namespace> > machine-provisioner-logs.txt
-
Look for common error patterns in the logs:
- Disk space:
No space left on device - Network:
Connection timed out - Cloud provider permissions:
UnauthorizedOperation - Resources:
Cannot allocate memory - Instance provisioning:
dpkg lockorunattended-upgrades(race condition at boot — see Nomad Client Issues)
- Disk space:
Runner Issues
For issues with self-hosted runners (task claim failures, infrastructure failures):
Note: Container runners and machine runners do not use Nomad for job execution. However, they share upstream services with the Docker executor path (distributor, Postgres, RabbitMQ, etc.). This means a backend service outage can cause correlated failures across both Docker executor and runner jobs simultaneously.
- Collect a support bundle immediately.
-
Collect runner service logs:
kubectl logs -l app=runner-admin -n <namespace> --tail=5000 > runner-admin-logs.txt
-
Collect runner agent logs from the machine or pod running the runner:
- For container runner:
kubectl logs -l app=container-agent -n <runner-namespace> --tail=5000 > container-agent-logs.txt - For machine runner: Check the runner agent log file (default location varies by OS)
- For container runner:
-
If using container runner, generate a container runner support bundle as well:
kubectl support-bundle \ https://raw.githubusercontent.com/CircleCI-Public/circleci-support-scripts/refs/heads/main/container-runner-support-bundle/support-bundle.yaml
Nomad Server Issues
For Nomad scheduling problems, leader election failures, or RPC errors:
- Collect a support bundle immediately.
-
Enable debug-level logging on Nomad server if the issue is intermittent:
- Set
log_levelfor the Nomad server toDEBUGorTRACEvia your Helm values - Reproduce the issue
- Collect the support bundle
- Set
-
Manually capture Nomad server logs (especially if the pod is about to restart or be deleted):
kubectl logs deploy/nomad-server -n <namespace> --tail=1000000 > nomad-server.log
-
Check for common Nomad errors:
- RPC failures:
rpc error: failed to get conn - Leader issues:
lead thread didn't get connection - Eval failures:
failed to update evaluation - Connection limits:
rpc_max_conns_per_client(if many Nomad clients overwhelm server)
- RPC failures:
-
RPC connection limits — If you see RPC connection errors at scale, note that
rpc_max_conns_per_clientandhttp_max_conns_per_clientare Nomad server settings. Changing them requires only a Nomad server pod restart (via Helm rollout); Nomad clients will reconnect automatically without requiring a client-side restart.
Nomad Autoscaler Issues
For problems with Nomad client scaling (nodes not scaling up/down, drain storms, ASG issues):
- Collect a support bundle immediately.
-
Collect nomad-autoscaler logs:
kubectl logs -l app=nomad-autoscaler -n <namespace> --tail=5000 > nomad-autoscaler-logs.txt
-
Common autoscaler issues:
-
expected 1 Autoscaling Group, got 0— The autoscaler cannot find the configured ASG. Verify the ASG name in your autoscaler policy matches the actual AWS ASG name and that IAM permissions allowautoscaling:DescribeAutoScalingGroups. -
min > maxin policy — Ifminis set higher thanmax, the policy fails validation and is not loaded, meaning the autoscaler is effectively broken for that resource class. Check the autoscaler ConfigMap:kubectl get configmap -l app=nomad-autoscaler -n <namespace> -o yaml
-
Drain storms (mass node draining) — On older autoscaler versions, scale-in can drain too many nodes at once. Upgrade to autoscaler v0.4.6+ (Server 4.8.x+) which supports
max_scale_downto limit how many nodes drain simultaneously. Also considernode_filter_ignore_drainto prevent cascading drain decisions. -
Node selection for scale-in — The
node_selector_strategysetting (e.g.,empty,empty_ignore_system) controls which nodes are drained first. This is in the autoscalertargetblock, separate from CPU/memory checks.
-
-
Verify ASG and node health:
# Check current Nomad nodes and their status kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \ -n <namespace> -- nomad node status # Check if nodes match expected resource classes kubectl exec -it $(kubectl get pods -l app=nomad-server -n <namespace> -o name | head -1) \ -n <namespace> -- nomad node status -verbose <node-id> | grep -i class
Nomad Client Issues
For problems with Nomad client nodes (EC2/ASG instances running Docker executor jobs):
- Collect a support bundle immediately.
-
Docker network
ci-privilegedmissing — Ifdocker network prunewas run on a Nomad client, theci-privilegedbridge network (required for Docker executor jobs) will be removed. This causes all jobs on that node to fail.-
Why this happens:
ci-privilegedis created during instance bootstrap (cloud-init / user-data), which runs only on first boot. A simple reboot does not recreate it. - Recovery option A (recommended): Terminate the instance and let the ASG replace it. A new instance will run cloud-init and recreate the network.
-
Recovery option B (manual):
docker network create --label keep --driver=bridge \ --opt com.docker.network.bridge.name=ci-privileged ci-privileged systemctl restart docker-gc
-
Prevention: Never run
docker network pruneon Nomad client instances. Thekeeplabel is on the network, butdocker network pruneignores labels.
-
Why this happens:
-
Bootstrap race condition (
unattended-upgrades) — On Ubuntu-based Nomad clients, theunattended-upgradesservice can hold a dpkg lock at boot, causing the startupapt-getcommands to fail. This results in a Nomad client that never joins the cluster.- Check for
dpkg lockorCould not get lockin instance system logs - Workarounds: disable
unattended-upgradesin the AMI, or reprovision the instance
- Check for
-
Collect Nomad client logs from the instance if accessible:
# If you have SSH access to the Nomad client EC2 instance journalctl -u nomad --no-pager > nomad-client.log docker network ls > docker-networks.txt docker ps -a > docker-containers.txt
Cascade / Widespread Infra Failures
When many or all jobs fail simultaneously with infrastructure_fail, the root cause is typically a shared backend service under stress. The most common pattern involves the contexts-service:
Request path: build-agent → distributor → execution-gateway
- Collect a support bundle immediately.
-
Collect targeted service logs:
# All two services in the critical path kubectl logs -l app=execution-gateway -n <namespace> --tail=10000 > execution-gateway-logs.txt kubectl logs -l app=distributor-external -n <namespace> --tail=10000 > distributor-external-logs.txt
-
What to look for:
-
execution-gateway:
context deadline exceededon downstream calls,/api/v2/task/configtimeouts -
distributor:
gateway end: httpclient do: context deadline exceeded— indicates the execution-gateway was slow or unresponsive
-
execution-gateway:
-
Understanding retry amplification — A single slow downstream service can cause cascading timeouts:
-
build-agenthas a ~30s budget, retrying viaex/httpclient -
distributor→execution-gatewaycontexts call has a ~5s budget with ~2s per attempt
-
Cloud Provider Permission Issues
When encountering cloud provider (AWS, GCP, Azure) errors:
- Collect a support bundle immediately.
-
For AWS — decode authorization error messages:
aws sts decode-authorization-message --encoded-message "<encoded_message>"
-
Check CloudTrail logs for denied actions:
# Search for errors related to a specific IAM role aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=Username,AttributeValue=<role-name> \ --max-items 100 # Check for errors in recent events aws cloudtrail lookup-events \ --start-time $(date -u -d "1 hour ago" +"%Y-%m-%dT%H:%M:%SZ") \ --query "Events[?contains(CloudTrailEvent, 'errorCode') || contains(CloudTrailEvent, 'errorMessage')]"
-
Verify IAM policies for:
- Correct resource ARNs (check account IDs, regions)
- Service control policies (look for explicit denies)
- Cross-account access permissions
API Connection Issues
If experiencing issues with API connections, webhook failures, or approval delays:
- Collect a support bundle immediately.
-
Capture API response details with verbose output:
curl -vvv -X POST "https://<your-server-domain>/api/v2/workflow/<workflow-id>/approve/<approval-request-id>"
-
Check Nginx/ingress logs:
kubectl logs -l app=nginx -n <namespace> --tail=500 > nginx-logs.txt
-
Look for specific HTTP response codes:
-
404— the resource may not be ready yet -
403— permission issues (check token validity, presigned URL expiry) -
5xx— backend service errors - Slow responses (>1s) — backend processing delays
-
-
For webhook or approval timing issues, capture timestamps of:
- Job completion events in logs
- API call attempts
- Webhook delivery attempts
-
S3 presigned URL 403 errors — If test results or artifacts return
403 Forbidden, the presigned URL may have expired (default TTL is 900 seconds / 15 minutes). Refreshing the page or re-fetching the resource generates a new presigned URL.
Custom Integration Issues
For issues with VCS integrations (GitHub Enterprise, Bitbucket Data Center), proxy setups, or custom networking:
- Collect a support bundle immediately.
-
For VCS integration issues:
- Capture webhook delivery logs from your VCS provider's UI
- Check TLS certificate configuration and expiry
- Verify network connectivity between CircleCI and your VCS
-
For proxy integrations:
- Collect complete request/response cycles including headers
- Log both incoming and outgoing payloads (if possible)
- Verify that signatures and headers are preserved through the proxy
Monitoring and Observability
Proactive monitoring helps catch issues before they escalate. If you have Prometheus available in your cluster:
-
Nomad metrics — Expose Nomad server metrics for Prometheus scraping by adding a scrape target for the Nomad
/v1/metricsendpoint via a ConfigMap. Key metrics to watch:-
nomad_nomad_rpc_request— RPC request rates and errors -
nomad_client_allocs_running— running allocation count -
nomad_nomad_blocked_evals— scheduling backlog
-
-
Pod resource usage — Monitor CPU and memory for critical services:
-
contexts-service— watch for connection pool exhaustion under load -
execution-gateway— watch for high response times -
nomad-server— watch for memory pressure during high concurrency
-
- Health check endpoints — Most CircleCI services expose health endpoints. Use Kubernetes liveness/readiness probes to detect unresponsive services early.
Troubleshooting: Rate Limiter Error
When generating support bundles, you might encounter:
failed to get log stream: client rate limiter Wait returned an error: context deadline exceeded
This happens when Kubernetes API rate limiting is too restrictive for the bundle collection process.
Solution:
-
Check current settings:
kubectl get configmap kube-proxy-config -n kube-system -o yaml
-
Edit the configuration:
kubectl edit configmap kube-proxy-config -n kube-system
-
Update the rate limits (the support bundle plugin expects up to 100):
clientConnection: burst: 100 qps: 100
- Re-generate the support bundle after applying these changes.
For more details, see Creating a Support Bundle - Rate Limit Error.
Troubleshooting: Collecting Logs When Pods Are Crashing
If a pod is in a crash loop, standard kubectl logs shows the current (often empty) container. To get logs from the previous container:
kubectl logs <pod-name> -n <namespace> --previous > <service>-previous.log
If you need to restart a pod but want to preserve logs first:
# Capture logs BEFORE restarting kubectl logs <pod-name> -n <namespace> --tail=1000000 > <service>-pre-restart.log # Then restart kubectl delete pod <pod-name> -n <namespace> --now
Submitting a Support Ticket
When contacting CircleCI Support, include the following for fastest resolution:
| Item | Details |
|---|---|
| Support bundle | Collected within 10 minutes of the issue |
| CircleCI Server version | e.g., 4.5.3
|
| Job ID and URL | The specific job experiencing the issue |
| Timestamps | When the issue occurred (with timezone) |
| Error messages | Exact error text from UI or logs |
| Issue-specific logs | See the relevant section above for which logs to collect |
| Recent changes | Upgrades, config changes, infrastructure changes, new projects |
Tips for effective log collection:
- Reproduce the issue if possible — then immediately collect the bundle
- Capture logs before restarting pods — restarting pods will lose the current log history
- Include the full job URL — this helps us quickly identify the pipeline, workflow, and job
- Note any recent changes — upgrades, config changes, infrastructure changes, or new projects added
- If multiple executor types are failing simultaneously, this typically points to a shared backend issue rather than an executor-specific problem — collect the cascade logs described above
Comments
Article is closed for comments.