Problem Description
You are experiencing parallel jobs that are complete, but are not being updated as such. Their timers on the workflow page are continuing to increment and are holding back the rest of the workflow.
Solution
AWS by default will rebalance an ASG to maintain similar numbers of instances in multiple Availability Zones. See their documentation:
This rebalancing will hard stop a nomad client without any drain delay, killing all the actively running jobs and logs an activity on the ASG.
Please try running your clients in a single Availability Zone to limit the impact of these rebalancing events. If that succeeds, you may look into using multiple ASGs, one per AZ, to eliminate those rebalances.
Additional Resources
Amazon EC2 Auto Scaling benefits
Comments
Article is closed for comments.