What Are Infrastructure Failures?
Infrastructure failures occur when a job cannot complete due to issues with the underlying execution environment rather than problems in your code or configuration. CircleCI automatically distinguishes these from user errors to provide better diagnostics and enable automatic recovery through retries.
Infrastructure Failure vs. User Error
Infrastructure Failures:
- Problems with the execution environment (container, VM, or runner)
- Network connectivity issues during job setup
- Resource allocation or provisioning failures
- Communication loss between CircleCI and the execution environment
User Errors:
- Test failures or build errors in your code
- Misconfigured commands or invalid syntax
- Missing dependencies or incorrect environment variables
- Out-of-memory errors from your application
How CircleCI Detects Infrastructure Failures
CircleCI uses four primary detection mechanisms:
| Failure Type | Description | What You'll See | Typical Cause |
| Startup Timeout | Job doesn't start within 10 minutes | "Job failed to start within the expected timeframe" | Resource constraints, provisioning delays |
| Heartbeat Timeout | No communication for 5+ minutes | "Lost connection to job executor" | Network issues, executor crash |
| Task Infrastructure Failure | Executor reports infrastructure problem | "Infrastructure error detected during job execution" | Container/VM issues, disk failures |
| Explicit Infrastructure Failure | Direct failure signal from executor | Specific error message from the system | Critical system errors requiring immediate failure |
Executor-Specific Behaviors
Docker Executor
Common Infrastructure Failures:
- Container fails to start or pull image
- Docker daemon becomes unresponsive
- Network connectivity lost during execution
Example Scenario:
# Your job configuration
docker:
- image: node:14
If the Docker daemon cannot start this container within 10 minutes, it triggers a startup timeout infrastructure failure.
Machine Executor (Linux/Windows)
Common Infrastructure Failures:
- VM fails to provision or boot
- Network configuration fails during setup
- Machine becomes unresponsive after starting
Example Scenario:
machine:
image: ubuntu-2204:current
A machine executor job that loses network connectivity after provisioning will be marked as an infrastructure failure.
Remote Docker
Common Infrastructure Failures:
- Cannot establish connection to remote Docker environment
- Docker layer pull timeouts
- Remote Docker daemon crashes
Example Scenario: When using setup_remote_docker, if the remote Docker environment fails to become available, the job fails with an infrastructure error.
Self-Hosted Runners (Container & Machine)
Common Infrastructure Failures:
- Runner goes offline during job execution
- Insufficient resources on self-hosted infrastructure
- Network connectivity issues between runner and CircleCI
Important: For self-hosted runners, infrastructure failures often indicate issues with your infrastructure rather than CircleCI's platform. Check your runner logs and resource availability.
Troubleshooting Guide: Self-Hosted Runner Troubleshooting
Automatic Retry Behavior
How Retries Work
Infrastructure failures trigger automatic retries with these characteristics:
- Retry Limit: Maximum of 2 automatic retries (3 total attempts, including the original)
- Smart Retry Logic: Only retries when safe and likely to succeed
When Retries Occur
Always Retried (when under limit):
- Startup timeouts on cloud executors
- Pre-user-step failures (before your commands run)
Conditionally Retried:
- Heartbeat timeouts (only if user steps haven't started)
- Task infrastructure failures (only if user steps haven't started)
Never Retried:
- Self-hosted runner failures (requires manual intervention)
- Explicit infrastructure failures with critical errors
- After 2 retry attempts have been exhausted
Common Real-World Scenarios
Scenario 1: Docker Image Pull Timeout
What happens: Large Docker image takes too long to download Detection: Startup timeout after 10 minutes
Result: Automatic retry with potential success on second attempt
Scenario 2: Spot Instance Termination
NOTE: This only applies to self-hosted runner customers
What happens: AWS spot instance terminated during job Detection: Heartbeat timeout Result: Automatic retry on different instance
Scenario 3: Runner Out of Disk Space
What happens: Self-hosted runner runs out of disk space Detection: Task infrastructure failure
Result: Job marked as failed, manual intervention required
Scenario 4: Network Partition
What happens: Temporary network issue during job execution Detection: Heartbeat timeout if user steps haven't started
Result: Automatic retry once connectivity restored
Best Practices
Minimize Infrastructure Failures
-
For Cloud Executors:
- Use smaller Docker images when possible
- Implement proper timeout configurations
- Monitor job duration trends
-
For Self-Hosted Runners:
- Ensure adequate resources (CPU, memory, disk)
- Implement monitoring and alerting
- Keep runners updated to latest versions
- Maintain stable network connectivity
Configure Retry Settings
version: 2.1
workflows:
my-workflow:
max_auto_reruns: 3
jobs:
- build
- test
- deploy:
requires:
- build
- test
What You See in the UI
When an infrastructure failure occurs:
- Job Status: Shows as INFRASTRUCTURE_FAILURE
- Error Message: Clear description of the failure type
- Retry Indicator: Shows retry attempts (e.g., "Attempt 2 of 3")
Getting Help
If you experience frequent infrastructure failures:
- For Cloud Executors: Contact CircleCI Support with job URLs
- For Self-Hosted Runners: Check runner logs and system resources first
- Review: Status Page for platform-wide issues
Comments
Article is closed for comments.