CircleCI Infrastructure Failures: Detection and Handling

What Are Infrastructure Failures?

Infrastructure failures occur when a job cannot complete due to issues with the underlying execution environment rather than problems in your code or configuration. CircleCI automatically distinguishes these from user errors to provide better diagnostics and enable automatic recovery through retries.

Infrastructure Failure vs. User Error

Infrastructure Failures:

Problems with the execution environment (container, VM, or runner)
Network connectivity issues during job setup
Resource allocation or provisioning failures
Communication loss between CircleCI and the execution environment

User Errors:

Test failures or build errors in your code
Misconfigured commands or invalid syntax
Missing dependencies or incorrect environment variables
Out-of-memory errors from your application

How CircleCI Detects Infrastructure Failures

CircleCI uses four primary detection mechanisms:

Failure Type	Description	What You'll See	Typical Cause
Startup Timeout	Job doesn't start within 10 minutes	"Job failed to start within the expected timeframe"	Resource constraints, provisioning delays
Heartbeat Timeout	No communication for 5+ minutes	"Lost connection to job executor"	Network issues, executor crash
Task Infrastructure Failure	Executor reports infrastructure problem	"Infrastructure error detected during job execution"	Container/VM issues, disk failures
Explicit Infrastructure Failure	Direct failure signal from executor	Specific error message from the system	Critical system errors requiring immediate failure

Executor-Specific Behaviors

Docker Executor

Common Infrastructure Failures:

Container fails to start or pull image
Docker daemon becomes unresponsive
Network connectivity lost during execution

Example Scenario:

# Your job configuration

docker:

- image: node:14

If the Docker daemon cannot start this container within 10 minutes, it triggers a startup timeout infrastructure failure.

Machine Executor (Linux/Windows)

Common Infrastructure Failures:

VM fails to provision or boot
Network configuration fails during setup
Machine becomes unresponsive after starting

Example Scenario:

machine:

image: ubuntu-2204:current

A machine executor job that loses network connectivity after provisioning will be marked as an infrastructure failure.

Remote Docker

Common Infrastructure Failures:

Cannot establish connection to remote Docker environment
Docker layer pull timeouts
Remote Docker daemon crashes

Example Scenario: When using setup_remote_docker, if the remote Docker environment fails to become available, the job fails with an infrastructure error.

Self-Hosted Runners (Container & Machine)

Common Infrastructure Failures:

Runner goes offline during job execution
Insufficient resources on self-hosted infrastructure
Network connectivity issues between runner and CircleCI

Important: For self-hosted runners, infrastructure failures often indicate issues with your infrastructure rather than CircleCI's platform. Check your runner logs and resource availability.

Troubleshooting Guide: Self-Hosted Runner Troubleshooting

Automatic Retry Behavior

How Retries Work

Infrastructure failures trigger automatic retries with these characteristics:

Retry Limit: Maximum of 2 automatic retries (3 total attempts, including the original)
Smart Retry Logic: Only retries when safe and likely to succeed

When Retries Occur

Always Retried (when under limit):

Startup timeouts on cloud executors
Pre-user-step failures (before your commands run)

Conditionally Retried:

Heartbeat timeouts (only if user steps haven't started)
Task infrastructure failures (only if user steps haven't started)

Never Retried:

Self-hosted runner failures (requires manual intervention)
Explicit infrastructure failures with critical errors
After 2 retry attempts have been exhausted

Common Real-World Scenarios

Scenario 1: Docker Image Pull Timeout

What happens: Large Docker image takes too long to download Detection: Startup timeout after 10 minutes

Result: Automatic retry with potential success on second attempt

Scenario 2: Spot Instance Termination

NOTE: This only applies to self-hosted runner customers

What happens: AWS spot instance terminated during job Detection: Heartbeat timeout Result: Automatic retry on different instance

Scenario 3: Runner Out of Disk Space

What happens: Self-hosted runner runs out of disk space Detection: Task infrastructure failure

Result: Job marked as failed, manual intervention required

Scenario 4: Network Partition

What happens: Temporary network issue during job execution Detection: Heartbeat timeout if user steps haven't started

Result: Automatic retry once connectivity restored

Best Practices

Minimize Infrastructure Failures

For Cloud Executors:
- Use smaller Docker images when possible
- Implement proper timeout configurations
- Monitor job duration trends
For Self-Hosted Runners:
- Ensure adequate resources (CPU, memory, disk)
- Implement monitoring and alerting
- Keep runners updated to latest versions
- Maintain stable network connectivity

Configure Retry Settings

version: 2.1

workflows:

my-workflow:

max_auto_reruns: 3

jobs:

- build

- test

- deploy:

requires:

- build

- test

What You See in the UI

When an infrastructure failure occurs:

Job Status: Shows as INFRASTRUCTURE_FAILURE
Error Message: Clear description of the failure type
Retry Indicator: Shows retry attempts (e.g., "Attempt 2 of 3")

Getting Help

If you experience frequent infrastructure failures:

For Cloud Executors: Contact CircleCI Support with job URLs
For Self-Hosted Runners: Check runner logs and system resources first
Review: Status Page for platform-wide issues

What Are Infrastructure Failures?

Infrastructure Failure vs. User Error

How CircleCI Detects Infrastructure Failures

Executor-Specific Behaviors

Docker Executor

Machine Executor (Linux/Windows)

Remote Docker

Self-Hosted Runners (Container & Machine)

Automatic Retry Behavior

How Retries Work

When Retries Occur

Common Real-World Scenarios

Scenario 1: Docker Image Pull Timeout

Scenario 2: Spot Instance Termination

Scenario 3: Runner Out of Disk Space

Scenario 4: Network Partition

Best Practices

Minimize Infrastructure Failures

Configure Retry Settings

What You See in the UI

Getting Help

Comments

Articles in this section