CircleCI Infrastructure Failures: Detection and Handling

What Are Infrastructure Failures?

Infrastructure failures occur when a job cannot complete due to issues with the underlying execution environment rather than problems in your code or configuration. CircleCI automatically distinguishes these from user errors to provide better diagnostics and enable automatic recovery through retries.

Infrastructure Failure vs. User Error

Infrastructure Failures:

  • Problems with the execution environment (container, VM, or runner)
  • Network connectivity issues during job setup
  • Resource allocation or provisioning failures
  • Communication loss between CircleCI and the execution environment

User Errors:

  • Test failures or build errors in your code
  • Misconfigured commands or invalid syntax
  • Missing dependencies or incorrect environment variables
  • Out-of-memory errors from your application

How CircleCI Detects Infrastructure Failures

CircleCI uses four primary detection mechanisms:

Failure Type Description What You'll See Typical Cause
Startup Timeout Job doesn't start within 10 minutes "Job failed to start within the expected timeframe" Resource constraints, provisioning delays
Heartbeat Timeout No communication for 5+ minutes "Lost connection to job executor" Network issues, executor crash
Task Infrastructure Failure Executor reports infrastructure problem "Infrastructure error detected during job execution" Container/VM issues, disk failures
Explicit Infrastructure Failure Direct failure signal from executor Specific error message from the system Critical system errors requiring immediate failure

Executor-Specific Behaviors

Docker Executor

Common Infrastructure Failures:

  • Container fails to start or pull image
  • Docker daemon becomes unresponsive
  • Network connectivity lost during execution

Example Scenario:

# Your job configuration

docker:

  - image: node:14

If the Docker daemon cannot start this container within 10 minutes, it triggers a startup timeout infrastructure failure.

Machine Executor (Linux/Windows)

Common Infrastructure Failures:

  • VM fails to provision or boot
  • Network configuration fails during setup
  • Machine becomes unresponsive after starting

Example Scenario:

machine:

  image: ubuntu-2204:current

A machine executor job that loses network connectivity after provisioning will be marked as an infrastructure failure.

Remote Docker

Common Infrastructure Failures:

  • Cannot establish connection to remote Docker environment
  • Docker layer pull timeouts
  • Remote Docker daemon crashes

Example Scenario: When using setup_remote_docker, if the remote Docker environment fails to become available, the job fails with an infrastructure error.

Self-Hosted Runners (Container & Machine)

Common Infrastructure Failures:

  • Runner goes offline during job execution
  • Insufficient resources on self-hosted infrastructure
  • Network connectivity issues between runner and CircleCI

Important: For self-hosted runners, infrastructure failures often indicate issues with your infrastructure rather than CircleCI's platform. Check your runner logs and resource availability.

Troubleshooting Guide: Self-Hosted Runner Troubleshooting

Automatic Retry Behavior

How Retries Work

Infrastructure failures trigger automatic retries with these characteristics:

  1. Retry Limit: Maximum of 2 automatic retries (3 total attempts, including the original)
  2. Smart Retry Logic: Only retries when safe and likely to succeed

When Retries Occur

Always Retried (when under limit):

  • Startup timeouts on cloud executors
  • Pre-user-step failures (before your commands run)

Conditionally Retried:

  • Heartbeat timeouts (only if user steps haven't started)
  • Task infrastructure failures (only if user steps haven't started)

Never Retried:

  • Self-hosted runner failures (requires manual intervention)
  • Explicit infrastructure failures with critical errors
  • After 2 retry attempts have been exhausted

 

Common Real-World Scenarios

Scenario 1: Docker Image Pull Timeout

What happens: Large Docker image takes too long to download Detection: Startup timeout after 10 minutes 

Result: Automatic retry with potential success on second attempt

Scenario 2: Spot Instance Termination

NOTE: This only applies to self-hosted runner customers

What happens: AWS spot instance terminated during job Detection: Heartbeat timeout Result: Automatic retry on different instance

Scenario 3: Runner Out of Disk Space

What happens: Self-hosted runner runs out of disk space Detection: Task infrastructure failure 

Result: Job marked as failed, manual intervention required

Scenario 4: Network Partition

What happens: Temporary network issue during job execution Detection: Heartbeat timeout if user steps haven't started 

Result: Automatic retry once connectivity restored

 

Best Practices

Minimize Infrastructure Failures

  1. For Cloud Executors:
    • Use smaller Docker images when possible
    • Implement proper timeout configurations
    • Monitor job duration trends
  2. For Self-Hosted Runners:
    • Ensure adequate resources (CPU, memory, disk)
    • Implement monitoring and alerting
    • Keep runners updated to latest versions
    • Maintain stable network connectivity

Configure Retry Settings

version: 2.1

workflows:

  my-workflow:

    max_auto_reruns: 3

    jobs:

      - build

      - test

      - deploy:

          requires:

            - build

            - test

 

What You See in the UI

When an infrastructure failure occurs:

  1. Job Status: Shows as INFRASTRUCTURE_FAILURE
  2. Error Message: Clear description of the failure type
  3. Retry Indicator: Shows retry attempts (e.g., "Attempt 2 of 3")

 

Getting Help

If you experience frequent infrastructure failures:

  1. For Cloud Executors: Contact CircleCI Support with job URLs
  2. For Self-Hosted Runners: Check runner logs and system resources first
  3. Review: Status Page for platform-wide issues
Was this article helpful?
1 out of 2 found this helpful

Comments

0 comments

Article is closed for comments.