Clusters deployed with CfnCluster are elastic in several ways. The first is by
simply setting the initial_queue_size
and max_queue_size
parameters of a cluster
settings. The initial_queue_size
sets minimum size value of the ComputeFleet
Auto Scaling Group (ASG) and also the desired capacity value . The max_queue_size
sets maximum size value of the ComputeFleet ASG.
Every minute, a process called jobwatcher runs on the master instance and evaluates
the current number of instances requested in the queue. If this number is greater than the
current autoscaling desired, it adds more instances. If you submit more jobs,
the queue will get re-evaluated and the ASG updated up to the max_queue_size
.
On each compute node, a process called nodewatcher runs and evaluates the
work left in the queue. If an instance has had no jobs for longer than scaledown_idletime
(which defaults to 10 minutes), the instance is terminated.
It specifically calls the TerminateInstanceInAutoScalingGroup API call, which will remove an instance as long as the size of the ASG is at least the minimum ASG size. That handles scale down of the cluster, without affecting running jobs and also enables an elastic cluster with a fixed base number of instances.
The value of the auto scaling is the same for HPC as with any other workloads,
the only difference here is CfnCluster has code to specifically make it interact
in a more intelligent manner. If a static cluster is required, this can be
achieved by setting initial_queue_size
and max_queue_size
parameters to the size
of cluster required and also setting the maintain_initial_size
parameter to
true. This will cause the ComputeFleet ASG to have the same value for minimum,
maximum and desired capacity.