Nomad Autoscaler 404 Error for GCE Managed Instance Group After Upgrade to CircleCI Server 4.9.x

Overview

After upgrading CircleCI Server to 4.9.x, the following errors appear in the nomad-autoscaler pod logs:

 
[WARN] policy_manager.policy_handler: failed to get target status: policy_id=<policy_id> error="failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound"

[ERROR] policy_manager.policy_handler: failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound: policy_id=<policy_id>
    

The autoscaler fails to locate the GCE Managed Instance Group (MIG), and Nomad client scaling stops functioning.

Root Cause

The issue is caused by the change introduced into the google_compute_instance_group_manager terraform resource in the server-terraform module

Prior to 4.9.0, the MIG name was:

name = "${var.name}-nomad" 
# e.g. "prod-nomad"

From 4.9.0, the MIG name changed to:

name = "${var.name}-nomad-client-group"
# e.g. "prod-nomad-client-group""

Solution

  1. Confirm the current MIG name in Terraform state

    terraform state show google_compute_instance_group_manager.nomad
    # or
    terraform show | grep -A5 "nomad_client_group"
    
  2. Update values.yaml with the correct MIG name

    nomad:
      auto_scaler:
        gcp:
          mig_name: "prod-nomad-client-group"
        
  3. Apply the Helm upgrade

    helm upgrade

Please be informed that you may also need to run kubectl rollout restart deployment/nomad-autoscaler -n <circleci_namespace> so the pod definitely picks up the new mounted policy

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.