Overview
After upgrading CircleCI Server to 4.9.x, the following errors appear in the nomad-autoscaler pod logs:
[WARN] policy_manager.policy_handler: failed to get target status: policy_id=<policy_id> error="failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound"
[ERROR] policy_manager.policy_handler: failed to describe GCE Managed Instance Group: googleapi: Error 404: The resource '<resource_name>/prod-nomad' was not found, notFound: policy_id=<policy_id>
The autoscaler fails to locate the GCE Managed Instance Group (MIG), and Nomad client scaling stops functioning.
Root Cause
The issue is caused by the change introduced into the google_compute_instance_group_manager terraform resource in the server-terraform module
Prior to 4.9.0, the MIG name was:
name = "${var.name}-nomad"
# e.g. "prod-nomad"
From 4.9.0, the MIG name changed to:
name = "${var.name}-nomad-client-group"
# e.g. "prod-nomad-client-group""
Solution
-
Confirm the current MIG name in Terraform state
terraform state show google_compute_instance_group_manager.nomad # or terraform show | grep -A5 "nomad_client_group"
-
Update
values.yamlwith the correct MIG namenomad: auto_scaler: gcp: mig_name: "prod-nomad-client-group" -
Apply the Helm upgrade
helm upgrade
Please be informed that you may also need to run kubectl rollout restart deployment/nomad-autoscaler -n <circleci_namespace> so the pod definitely picks up the new mounted policy
Comments
Article is closed for comments.