CfnCluster

CfnCluster (“cloud formation cluster”) is a framework that deploys and maintains high performance computing clusters on Amazon Web Services (AWS). Developed by AWS, CfnCluster facilitates both quick start proof of concepts (POCs) and production deployments. CfnCluster supports many different types of clustered applications and can easily be extended to support different frameworks. Download CfnCluster today to see how CfnCluster’s command line interface leverages AWS CloudFormation templates and other AWS cloud services.

Getting started with CfnCluster

CfnCluster (“cloud formation cluster”) is a framework that deploys and maintains high performance computing clusters on Amazon Web Services (AWS). Developed by AWS, CfnCluster facilitates both quick start proof of concepts (POCs) and production deployments. CfnCluster supports many different types of clustered applications and can easily be extended to support different frameworks. Download CfnCluster today to see how CfnCluster’s command line interface leverages AWS CloudFormation templates and other AWS cloud services.

Installing CfnCluster

The current working version is CfnCluster-1.4. The CLI is written in Python and uses BOTO for AWS actions. You can install the CLI with the following commands, depending on your OS.

Linux/OSX

$ sudo pip install cfncluster

or:

$ sudo easy_install cfncluster

Windows

Windows support is experimental!!

Install the following packages:

Once installed, you should update the Environment Variables to have the Python install directory and Python Scripts directory in the PATH, for example: C:\Python27;C:\Python27\Scripts

Now it should be possible to run the following within a command prompt window:

C:\> easy_install CfnCluster

Upgrading

To upgrade an older version of CfnCluster, you can use either of the following commands, depening on how it was originally installed:

$ sudo pip install --upgrade cfncluster

or

$ sudo easy_install -U cfncluster

Remember when upgrading to check that the exiting config is compatible with the latest version installed.

Configuring CfnCluster

Once installed you will need to setup some initial config. The easiest way to do this is below:

$ cfncluster configure

This configure wizard will prompt you for everything you need to create your cluster. You will first be prompted for your cluster template name, which is the logical name of the template you will create a cluster from.

Cluster Template [mycluster]:

Next, you will be prompted for your AWS Access & Secret Keys. Enter the keys for an IAM user with administrative privledges. These can also be read from your environment variables or the aws CLI config.

AWS Access Key ID []:
AWS Secret Access Key ID []:

Now, you will be presented with a list of valid AWS region identifiers. Choose the region in which you’d like your cluster to run.

Acceptable Values for AWS Region ID:
    us-east-1
    cn-north-1
    ap-northeast-1
    eu-west-1
    ap-southeast-1
    ap-southeast-2
    us-west-2
    us-gov-west-1
    us-west-1
    eu-central-1
    sa-east-1
AWS Region ID []:

Choose a descriptive name for your VPC. Typically, this will something like production or test.

VPC Name [myvpc]:

Next, you will need to choose a keypair that already exists in EC2 in order to log into your master instance. If you do not already have a keypair, refer to the EC2 documentation on EC2 Key Pairs.

Acceptable Values for Key Name:
    keypair1
    keypair-test
    production-key
Key Name []:

Choose the VPC ID in which you’d like your cluster launched into.

Acceptable Values for VPC ID:
    vpc-1kd24879
    vpc-blk4982d
VPC ID []:

Finally, choose the subnet in which you’d like your master server to run in.

Acceptable Values for Master Subnet ID:
    subnet-9k284a6f
    subnet-1k01g357
    subnet-b921nv04
Master Subnet ID []:

Next, a simple cluster launches into a VPC and uses an existing subnet which supports public IP’s i.e. the route table for the subnet is 0.0.0.0/0 => igw-xxxxxx. The VPC must have DNS Resolution = yes and DNS Hostnames = yes. It should also have DHCP options with the correct domain-name for the region, as defined in the docs: VPC DHCP Options.

Once all of those settings contain valid values, you can launch the cluster by running the create command:

$ cfncluster create mycluster

Once the cluster reaches the “CREATE_COMPLETE” status, you can connect using your normal SSH client/settings. For more details on connecting to EC2 instances, check the EC2 User Guide.

Working with CfnCluster

CfnCluster Commands

Most commands provided are just wrappers around CloudFormation functions.

Note

When a command is called and it starts polling for status of that call it is safe to Ctrl-C out. you can always return to that status by calling cfncluster status mycluster

create

Creates a CloudFormation stack with the name cfncluster-[stack_name]. To read more about CloudFormation see AWS CloudFormation.

positional arguments:
cluster_name create a cfncluster with the provided name.
optional arguments:
-h, --help show this help message and exit
--norollback, -nr
 disable stack rollback on error
--template-url TEMPLATE_URL, -u TEMPLATE_URL
 specify a URL for a custom cloudformation template
--cluster-template CLUSTER_TEMPLATE, -t CLUSTER_TEMPLATE
 specify a specific cluster template to use
--extra-parameters EXTRA_PARAMETERS, -p EXTRA_PARAMETERS
 add extra parameters to stack create
--tags TAGS, -g TAGS
 tags to be added to the stack, TAGS is a JSON formatted string encapsulated by single quotes
$ cfncluster create mycluster

create cluster with tags:

$ cfncluster create mycluster --tags '{ "Key1" : "Value1" , "Key2" : "Value2" }'

update

Updates the CloudFormation stack using the values in the config file or a TEMPLATE_URL provided. For more information see AWS CloudFormation Stacks Updates.

positional arguments:
cluster_name update a cfncluster with the provided name.
optional arguments:
-h, --help show this help message and exit
--norollback, -nr
 disable stack rollback on error
--template-url TEMPLATE_URL, -u TEMPLATE_URL
 specify a URL for a custom cloudformation template
--cluster-template CLUSTER_TEMPLATE, -t CLUSTER_TEMPLATE
 specify a specific cluster template to use
--extra-parameters EXTRA_PARAMETERS, -p EXTRA_PARAMETERS
 add extra parameters to stack update
--reset-desired, -rd
 reset the current ASG desired capacity to initial config values
$ cfncluster update mycluster

stop

Sets the Auto Scaling Group parameters to min/max/desired = 0/0/0

Note

A stopped cluster will only terminate the complete-fleet.

Previous versions of CfnCluster stopped the master node after terminating the compute fleet. Due to a number of challenges with the implementation of that feature, the current version only terminates the compute fleet. The master will remain running. To terminate all EC2 resources and avoid EC2 charges, consider deleting the cluster.

positional arguments:
cluster_name stops the compute-fleet of the provided cluster name.
optional arguments:
-h, --help show this help message and exit
$ cfncluster stop mycluster

start

Starts a cluster. This sets Auto Scaling Group parameters to min/max/desired = 0/max_queue_size/0 where max_queue_size defaults to 10. If you specify the --reset-desired flag, the min/desired values will be set to the initial_queue_size.

positional arguments:
cluster_name starts the compute-fleet of the provided cluster name.
optional arguments:
-h, --help show this help message and exit
--reset-desired, -rd
 Set the ASG desired capacity to initial config values.
$ cfncluster start mycluster

delete

Delete a cluster. This causes a CloudFormation delete call which deletes all the resources associated with that stack.

positional arguments:
cluster_name delete a cfncluster with the provided name.
optional arguments:
-h, --help show this help message and exit
$ cfncluster delete mycluster

status

Pull the current status of the cluster. Polls if the status is not CREATE_COMPLETE or UPDATE_COMPLETE. For more info on possible statuses see the Stack Status Codes page.

positional arguments:
cluster_name show the status of cfncluster with the provided name.
optional arguments:
-h, --help show this help message and exit
$cfncluster status mycluster

list

Lists clusters currently running or stopped. Lists the stack_name of the CloudFormation stacks with the name cfncluster-[stack_name].

optional arguments:
-h, --help show this help message and exit
$ cfncluster list

instances

Shows EC2 instances currently running on the given cluster.

positional arguments:
cluster_name show the status of cfncluster with the provided name.
optional arguments:
-h, --help show this help message and exit
$ cfncluster instances mycluster

configure

Configures the cluster. See Configuring CfnCluster.

optional arguments:
-h, --help show this help message and exit
$ cfncluster configure mycluster

version

Displays CfnCluster version.

optional arguments:
-h, --help show this help message and exit
$ cfncluster version

Network Configurations

CfnCluster leverages Amazon Virtual Private Cloud (VPC) for networking. This provides a very flexible and configurable networking platform to deploy clusters within. CfnCluster support the following high-level configurations:

  • Single subnet, master and compute in the same subnet
  • Two subnets, master in one subnet and compute new private subnet
  • Two subnets, master in one subnet and compute in existing private subnet

All of these configurations can operate with or without public IP addressing. It can also be deployed to leverage an HTTP proxy for all AWS requests. The combinations of these configurations result in many different deployment scenarios, ranging from a single public subnet with all access over the Internet, to fully private via AWS Direct Connect and HTTP proxy for all traffic.

Below are some architecture diagrams for some of those scenarios:

CfnCluster single subnet

CfnCluster in a single public subnet

The configuration for this architecture requires the following settings:

note that all values are examples only

[vpc public]
vpc_id = vpc-a1b2c3d4
master_subnet_id = subnet-a1b2c3d4
CfnCluster two subnets

CfnCluster using two subnets (new private)

The configuration for this architecture requires the following settings:

note that all values are examples only

[vpc public-private]
vpc_id = vpc-a1b2c3d4
master_subnet_id = subnet-a1b2c3d4
compute_subnet_cidr = 10.0.1.0/24
CfnCluster private with DX

CfnCluster in a private subnet connected using Direct Connect

The configuration for this architecture requires the following settings:

note that all values are examples only

[vpc private-dx]
vpc_id = vpc-a1b2c3d4
master_subnet_id = subnet-a1b2c3d4
proxy_server = http://proxy.corp.net:8080
use_public_ips = false

Custom Bootstrap Actions

CfnCluster can execute arbitrary code either before(pre) or after(post) the main bootstrap action during cluster creation. This code is typically stored in S3 and accessed via HTTP(S) during cluster creation. The code will be executed as root and can be in any script language supported by the cluster OS, typically bash or python.

pre-install actions are called before any cluster deployment bootstrap such as configuring NAT, EBS and the scheduler. Typical pre-install actions may include modifying storage, adding extra users or packages.

post-install actions are called after cluster bootstrap is complete, as the last action before an instance is considered complete. Typical post-install actions may include changing scheduler settings, modifying storage or packages.

Arguments can be passed to scripts by specifying them in the config. These will be passed double-quoted to the pre/post-install actions.

If a pre/post-install actions fails, then the instance bootstrap will be considered failed and it will not continue. Success is signalled with an exit code of 0, any other exit code will be considered a fail.

Configuration

The following config settings are used to define pre/post-install actions and arguments. All options are optional and are not required for basic cluster install.

# URL to a preinstall script. This is executed before any of the boot_as_* scripts are run
# (defaults to NONE for the default template)
pre_install = NONE
# Arguments to be passed to preinstall script
# (defaults to NONE for the default template)
pre_install_args = NONE
# URL to a postinstall script. This is executed after any of the boot_as_* scripts are run
# (defaults to NONE for the default template)
post_install = NONE
# Arguments to be passed to postinstall script
# (defaults to NONE for the default template)
post_install_args = NONE

Example

The following are some steps to create a simple post install script that installs the R packages in a cluster.

  1. Create a script. For the R example, see below
#!/bin/bash

yum -y install --enablerepo=epel R
  1. Upload the script with the correct permissions to S3

aws s3 cp --acl public-read /path/to/myscript.sh s3://<bucket-name>/myscript.sh

  1. Update CfnCluster config to include the new post install action
[cluster default]
...
post_install = https://<bucket-name>.s3.amazonaws.com/myscript.sh
  1. Launch a cluster

cfncluster create mycluster

Working with S3

Accessing S3 within CfnCluster can be controlled through two parameters in the CfnCluster config.

# Specify S3 resource which cfncluster nodes will be granted read-only access
# (defaults to NONE for the default template)
s3_read_resource = NONE
# Specify S3 resource which cfncluster nodes will be granted read-write access
# (defaults to NONE for the default template)
s3_read_write_resource = NONE

Both parameters accept either * or a valid S3 ARN. For details of how to specify S3 ARNs, please see http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-s3

IAM in CfnCluster

CfnCluster utilizes multiple AWS services to deploy and operate a cluster. The services used are listed in the AWS Services used in CfnCluster section of the documentation.

CfnCluster uses EC2 IAM roles to enable instances access to AWS services for the deployment and operation of the cluster. By default the EC2 IAM role is created as part of the cluster creation by CloudFormation. This means that the user creating the cluster must have the appropriate level of permissions

Defaults

When using defaults, during cluster launch an EC2 IAM Role is created by the cluster, as well as all the resources required to launch the cluster. The user calling the create call must have the right level of permissions to create all the resources including an EC2 IAM Role. This level of permissions is typically an IAM user with the AdministratorAccess managed policy. More details on managed policies can be found here: http://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html#aws-managed-policies

Using an existing EC2 IAM role

When using CfnCluster with an existing EC2 IAM role, you must first define the IAM policy and role before attempting to launch the cluster. Typically the reason for using an exisiting EC2 IAM role within CfnCluster is to reduce the permissions granted to users launching clusters. Below is an example IAM policy for both the EC2 iam role and the CfnCluster IAM user. You should create both as individual policies in IAM and then attach to the approiate resources. In both policies, you should replace REGION and AWS ACCOUNT ID with the appropriate values.

CfnClusterInstancePolicy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "ec2:AttachVolume",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeInstances"
            ],
            "Sid": "EC2",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "dynamodb:ListTables"
            ],
            "Sid": "DynamoDBList",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:cfncluster-*"
            ],
            "Action": [
                "sqs:SendMessage",
                "sqs:ReceiveMessage",
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl"
            ],
            "Sid": "SQSQueue",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "autoscaling:SetDesiredCapacity"
            ],
            "Sid": "Autoscaling",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Sid": "CloudWatch",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/cfncluster-*"
            ],
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:GetItem",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable"
            ],
            "Sid": "DynamoDBTable",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "sqs:ListQueues"
            ],
            "Sid": "SQSList",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:logs:*:*:*"
            ],
            "Action": [
                "logs:*"
            ],
            "Sid": "CloudWatchLogs",
            "Effect": "Allow"
        }
    ]
}

CfnClusterUserPolicy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EC2Describe",
            "Action": [
                "ec2:DescribeKeyPairs",
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribePlacementGroups",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeSnapshots",
                "ec2:DescribeVolumes",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeAddresses",
                "ec2:CreateTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeAvailabilityZones"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "EC2Modify",
            "Action": [
                "ec2:CreateVolume",
                "ec2:RunInstances",
                "ec2:AllocateAddress",
                "ec2:AssociateAddress",
                "ec2:AttachNetworkInterface",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateNetworkInterface",
                "ec2:CreateSecurityGroup",
                "ec2:ModifyVolumeAttribute",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteVolume",
                "ec2:TerminateInstances",
                "ec2:DeleteSecurityGroup",
                "ec2:DisassociateAddress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:ReleaseAddress"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "AutoScalingDescribe",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "AutoScalingModify",
            "Action": [
                "autoscaling:CreateAutoScalingGroup",
                "autoscaling:CreateLaunchConfiguration",
                "autoscaling:PutNotificationConfiguration",
                "autoscaling:UpdateAutoScalingGroup",
                "autoscaling:PutScalingPolicy",
                "autoscaling:DeleteLaunchConfiguration",
                "autoscaling:DescribeScalingActivities",
                "autoscaling:DeleteAutoScalingGroup",
                "autoscaling:DeletePolicy"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "DynamoDBDescribe",
            "Action": [
                "dynamodb:DescribeTable"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "DynamoDBModify",
            "Action": [
            "dynamodb:CreateTable",
            "dynamodb:DeleteTable"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "CloudWatchModify",
            "Action": [
                "cloudwatch:PutMetricAlarm",
                "cloudwatch:DeleteAlarms"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SQSDescribe",
            "Action": [
                "sqs:GetQueueAttributes"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SQSModify",
            "Action": [
                "sqs:CreateQueue",
                "sqs:SetQueueAttributes",
                "sqs:DeleteQueue"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SNSDescribe",
            "Action": [
            "sns:ListTopics",
            "sns:GetTopicAttributes"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SNSModify",
            "Action": [
                "sns:CreateTopic",
                "sns:Subscribe",
                "sns:DeleteTopic"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "CloudFormationDescribe",
            "Action": [
                "cloudformation:DescribeStackEvents",
                "cloudformation:DescribeStackResources",
                "cloudformation:DescribeStacks",
                "cloudformation:ListStacks"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "CloudFormationModify",
            "Action": [
                "cloudformation:CreateStack",
                "cloudformation:DeleteStack",
                "cloudformation:UpdateStack"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "S3CfnClusterReadOnly",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<REGION>-cfncluster*"
            ]
        },
        {
            "Sid": "IAMModify",
            "Action": [
                "iam:PassRole"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:iam::<AWS ACCOUNT ID>:role/<CFNCLUSTER EC2 ROLE NAME>"
        }
    ]
}

Configuration

cfncluster uses the file ~/.cfncluster/config by default for all configuration parameters.

You can see an example configuration file site-packages/cfncluster/examples/config

Layout

Configuration is defined in multiple sections. Required sections are “global”, “aws”, one “cluster”, and one “subnet”.

A section starts with the section name in brackets, followed by parameters and configuration.

[global]
cluster_template = default
update_check = true
sanity_check = true

Configuration Options

global

Global configuration options related to cfncluster.

[global]
cluster_template

The name of the cluster section used for the cluster.

See the Cluster Definition.

cluster_template = default
update_check

Whether or not to check for updates to cfncluster.

update_check = true
sanity_check

Attempts to validate that resources defined in parameters actually exist.

sanity_check = true

aws

This is the AWS credentials section (required). These settings apply to all clusters.

If not defined, boto will attempt to use a) enviornment or b) EC2 IAM role.

[aws]
aws_access_key_id = #your_aws_access_key_id
aws_secret_access_key = #your_secret_access_key

# Defaults to us-east-1 if not defined in enviornment or below
aws_region_name = #region

cluster

You can define one or more clusters for different types of jobs or workloads.

Each cluster has it’s own configuration based on your needs.

The format is [cluster <clustername>].

[cluster default]
key_name

Name of an existing EC2 KeyPair to enable SSH access to the instances.

key_name = mykey
template_url

Overrides the path to the cloudformation template used to create the cluster

Defaults to https://s3.amazonaws.com/cfncluster-<aws_region_name>/templates/cfncluster-<version>.cfn.json.

template_url = https://s3.amazonaws.com/cfncluster-us-east-1/templates/cfncluster.cfn.json
compute_instance_type

The EC2 instance type used for the cluster compute nodes.

Defaults to t2.micro for default template.

compute_instance_type = t2.micro
master_instance_type

The EC2 instance type use for the master node.

This defaults to t2.micro for default template.

master_instance_type = t2.micro
initial_queue_size

The inital number of EC2 instances to launch as compute nodes in the cluster.

The default is 2 for default template.

initial_queue_size = 2
max_queue_size

The maximum number of EC2 instances that can be launched in the cluster.

This defaults to 10 for the default template.

max_queue_size = 10
maintain_initial_size

Boolean flag to set autoscaling group to maintain initial size.

If set to true, the Auto Scaling group will never have fewer members than the value of initial_queue_size. It will still allow the cluster to scale up to the value of max_queue_size.

Setting to false allows the Auto Scaling group to scale down to 0 members, so resources will not sit idle when they aren’t needed.

Defaults to false for the default template.

maintain_initial_size = false
scheduler

Scheduler to be used with the cluster. Valid options are sge, torque, or slurm.

Defaults to sge for the default template.

scheduler = sge
cluster_type

Type of cluster to launch i.e. ondemand or spot

Defaults to ondemand for the default template.

cluster_type = ondemand
spot_price

If cluster_type is set to spot, the maximum spot price for the ComputeFleet. See the Spot Bid Advisor for assistance finding a bid price that meets your needs:

spot_price = 0.00
custom_ami

ID of a Custom AMI, to use instead of default published AMI’s.

custom_ami = NONE
s3_read_resource

Specify S3 resource for which cfncluster nodes will be granted read-only access

For example, ‘arn:aws:s3:::my_corporate_bucket/*’ would provide read-only access to all objects in the my_corporate_bucket bucket.

See working with S3 for details on format.

Defaults to NONE for the default template.

s3_read_resource = NONE
s3_read_write_resource

Specify S3 resource for which cfncluster nodes will be granted read-write access

For example, ‘arn:aws:s3:::my_corporate_bucket/Development/*’ would provide read-write access to all objects in the Development folder of the my_corporate_bucket bucket.

See working with S3 for details on format.

Defaults to NONE for the default template.

s3_read_write_resource = NONE
pre_install

URL to a preinstall script. This is executed before any of the boot_as_* scripts are run

Can be specified in “http://hostname/path/to/script.sh” or “s3://bucketname/path/to/script.sh” format.

Defaults to NONE for the default template.

pre_install = NONE
pre_install_args

Quoted list of arguments to be passed to preinstall script

Defaults to NONE for the default template.

pre_install_args = NONE
post_install

URL to a postinstall script. This is executed after any of the boot_as_* scripts are run

Can be specified in “http://hostname/path/to/script.sh” or “s3://bucketname/path/to/script.sh” format.

Defaults to NONE for the default template.

post_install = NONE
post_install_args

Arguments to be passed to postinstall script

Defaults to NONE for the default template.

post_install_args = NONE
proxy_server

HTTP(S) proxy server, typically http://x.x.x.x:8080

Defaults to NONE for the default template.

proxy_server = NONE
placement_group

Cluster placement group. The can be one of three values: NONE, DYNAMIC and an existing placement group name. When DYNAMIC is set, a unique placement group will be created as part of the cluster and deleted when the cluster is deleted.

Defaults to NONE for the default template. More information on placement groups can be found here:

placement_group = NONE
placement

Cluster placment logic. This enables the whole cluster or only compute to use the placement group.

Defaults to cluster in the default template.

placement = cluster
ephemeral_dir

If instance store volumes exist, this is the path/mountpoint they will be mounted on.

Defaults to /scratch in the default template.

ephemeral_dir = /scratch
shared_dir

Path/mountpoint for shared EBS volume

Defaults to /shared in the default template. See EBS Section for details on working with EBS volumes:

shared_dir = /shared
encrypted_ephemeral

Encrypted ephemeral drives. In-memory keys, non-recoverable. If true, CfnCluster will generate an ephemeral encryption key in memroy and using LUKS encryption, encrypt your instance store volumes.

Defaults to false in default template.

encrypted_ephemeral = false
master_root_volume_size

MasterServer root volume size in GB. (AMI must support growroot)

Defaults to 15 in default template.

master_root_volume_size = 15
compute_root_volume_size

ComputeFleet root volume size in GB. (AMI must support growroot)

Defaults to 15 in default template.

compute_root_volume_size = 15
base_os

OS type used in the cluster

Defaults to alinux in the default template. Available options are: alinux, centos6, centos7, ubuntu1404

Note: The base_os determines the username used to log into the cluster.

  • Centos 6 & 7: centos

  • Ubuntu: ubuntu

  • Amazon Linux: ec2-user

    base_os = alinux
    
cwl_region

CloudWatch Logs region

Defaults to NONE in the default template.

cwl_region = NONE
cwl_log_group

CloudWatch Logs Log Group name

Defaults to NONE in the default template.

cwl_log_group = NONE
ec2_iam_role

Existing EC2 IAM Role that will be attached to all instances in the cluster.

Defaults to NONE in the default template.

ec2_iam_role = NONE
extra_json

Extra JSON that will be merged into the dna.json used by Chef.

Defaults to {} in the default template.

extra_json = {}
additional_cfn_template

An additional CloudFormation template to launch along with the cluster. This allows you to create resources that exist outside of the cluster but are part of the cluster’s lifecycle.

Must be a HTTP URL to a public template with all parameters provided.

Defaults to NONE in the default template.

additional_cfn_template = NONE
vpc_settings

Settings section relating to VPC to be used

See VPC Section.

vpc_settings = public
ebs_settings

Settings section relating to EBS volume mounted on the master.

See EBS Section.

ebs_settings = custom
scaling_settings

Settings section relation to scaling

See Scaling Section.

scaling_settings = custom
tags

Defines tags to be used in CloudFormation.

If command line tags are specified via –tags, they get merged with config tags.

Command line tags overwrite config tags that have the same key.

Tags are JSON formatted and should not have quotes outside the curly braces.

See AWS CloudFormation Resource Tags Type.

tags = {"key" : "value", "key2" : "value2"}

vpc

VPC Configuration Settings:

[vpc public]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-xxxxxx
vpc_id

ID of the VPC you want to provision cluster into.:

vpc_id = vpc-xxxxxx
master_subnet_id

ID of an existing subnet you want to provision the Master server into.

master_subnet_id = subnet-xxxxxx
ssh_from

CIDR formatted IP range in which to allow SSH access from.

This is only used when cfncluster creates the security group.

Defaults to 0.0.0.0/0 in the default template.

ssh_from = 0.0.0.0/0
additional_sg

Additional VPC security group Id for all instances.

Defaults to NONE in the default template.

additional_sg = sg-xxxxxx
compute_subnet_id

ID of an existing subnet you want to provision the compute nodes into.

compute_subnet_id = subnet-xxxxxx
compute_subnet_cidr

If you wish for cfncluster to create a compute subnet, this is the CIDR that.

compute_subnet_cidr = 10.0.100.0/24
use_public_ips

Define whether or not to assign public IP addresses to EC2 instances.

Set to false if operating in a private VPC.

Defaults to true.

use_public_ips = true
vpc_security_group_id

Use an existing security group for all instances.

Defaults to NONE in the default template.

vpc_security_group_id = sg-xxxxxx

ebs

EBS Volume configuration settings for the volume mounted on the master node and shared via NFS to compute nodes.

[ebs custom]
ebs_snapshot_id = snap-xxxxx
volume_type = io1
volume_iops = 200
ebs_snapshot_id

Id of EBS snapshot if using snapshot as source for volume.

Defaults to NONE for default template.

ebs_snapshot_id = snap-xxxxx
volume_type

The API name for the type of volume you wish to launch.

Defaults to gp2 for default template.

volume_type = io1
volume_size

Size of volume to be created (if not using a snapshot).

Defaults to 20GB for default template.

volume_size = 20
volume_iops

Number of IOPS for io1 type volumes.

volume_iops = 200
encrypted

Whether or not the volume should be encrypted (should not be used with snapshots).

Defaults to false for default template.

encrypted = false
ebs_volume_id

EBS Volume Id of an existing volume that will be attached to the MasterServer.

Defaults to NONE for default template.

ebs_volume_id = vol-xxxxxx

scaling

Settings which define how the compute nodes scale.

[scaling custom]
scaling_period = 60
scaling_cooldown = 300
scaling_threshold

Threshold for triggering CloudWatch ScaleUp action.

Defaults to 1 for default template.

scaling_threshold = 1
scaling_adjustment

Number of instances to add when called CloudWatch ScaleUp action.

Defaults to 1 for default template.

scaling_adjustment = 1
scaling_threshold2

Threshold for triggering CloudWatch ScaleUp2 action.

Defaults to 200 for default template.

scaling_threshold2 = 200
scaling_adjustment2

Number of instances to add when called CloudWatch ScaleUp2 action

Defaults to 20 for default template.

scaling_adjustment2 = 20
scaling_period

Period to measure ScalingThreshold.

Defaults to 60 for default template.

scaling_period = 60
scaling_evaluation_periods

Number of periods to measure ScalingThreshold.

Defaults to 2 for default template.

scaling_evaluation_periods = 2
scaling_cooldown

Amount of time in seconds to wait before attempting further scaling actions.

Defaults to 300 for the default template.

scaling_cooldown = 300

How CfnCluster Works

CfnCluster was built not only as a way to manage clusters, but as a reference on how to use AWS services to build your HPC environment

CfnCluster Processes

There are a number of processes running within CfnCluster which are used to manage it’s behavior.

General Overview

A cluster’s lifecycle begins after it is created by a user. Typically, this is done from the Command Line Interface (CLI). Once created, a cluster will exist until it’s deleted.

_images/workflow.png

publish_pending_jobs

Once a cluster is running, a cronjob owned by the root user will monitor the configured scheduler (SGE, Torque, Openlava, etc) and publish the number of pending jobs to CloudWatch. This is the metric utilized by Auto Scaling to add more nodes to the cluster.

_images/publish_pending_jobs.png

Auto Scaling

Auto Scaling, along with Cloudwatch alarms are used to manage the number of running nodes in the cluster.

_images/auto_scaling.png

The number of instances added, along with the thresholds in which to add them are all configurable via the Scaling configuration section.

sqswatcher

The sqswatcher process monitors for SQS messages emitted by Auto Scaling which notifies of state changes within the cluster. When an instance comes online, it will submit an “instance ready” message to SQS, which is picked up by sqs_watcher running on the master server. These messages are used to notify the queue manager when new instances come online or are terminated, so they can be added or removed from the queue accordingly.

_images/sqswatcher.png

nodewatcher

The nodewatcher process runs on each node in the compute fleet. This process is used to determine when an instance is terminated. Because EC2 is billed by the instance hour, this process will wait until an instance has been running for 95% of an instance hour before it is terminated.

_images/nodewatcher.png

AWS Services used in CfnCluster

The following Amazon Web Services (AWS) services are used in CfnCluster.

  • AWS CloudFormation
  • AWS Identity and Access Management (IAM)
  • Amazon SNS
  • Amazon SQS
  • Amazon EC2
  • Auto Scaling
  • Amazon EBS
  • Amazon Cloud Watch
  • Amazon S3
  • Amazon DynamoDB

AWS CloudFormation

AWS CloudFormation is the core service used by CfnCluster. Each cluster is representated as a stack. All resources required by the cluster are defined within the CfnCluster CloudFormation template. CfnCluster cli commands typically map to CloudFormation stack commands, such as create, update and delete. Instances launched within a cluster make HTTPS calls to the CloudFormation Endpoint for the region the cluster is launched in.

For more details about AWS CloudFormation, see http://aws.amazon.com/cloudformation/

AWS Identity and Access Management (IAM)

IAM is used within CfnCluster to provide an Amazon EC2 IAM Role for the instances. This role is a least privilged role specifically created for each cluster. CfnCluster instances are given access only to the specific API calls that are required to deploy and manage the cluster.

For more details about AWS Identity and Access Management, see http://aws.amazon.com/iam/

Amazon SNS

Amazon Simple Notification Service is used to receive notifications from Auto Scaling. These events are called life cycle events, and are generated when an instance lauches or terminates in an Autoscaling Group. Within CfnCluster, the Amazon SNS topic for the Autoscaling Group is subscribed to an Amazon SQS queue.

For more details about Amazon SNS, see http://aws.amazon.com/sns/

Amazon SQS

Amazon Simple Queuing Service is used to hold notifications(messages) from Auto Scaling, sent through Amazon SNS and notifications from the ComputeFleet instanes. This decouples the sending of notifications from the receiving and allows the Master to handle them through polling. The MasterServer runs Amazon SQSwatcher and polls the queue. AutoScaling and the ComputeFleet instanes post messages to the queue.

For more details about Amazon SQS, see http://aws.amazon.com/sqs/

Amazon EC2

Amazon EC2 provides the compute for CfnCluster. The MasterServer and ComputeFleet are EC2 instances. Any instance type that support HVM can be selected. The MasterServer and ComputeFleet can be different instance types and the ComputeFleet can also be launched as Spot instances. Instance store volumes found on the instances are mounted as a striped LVM volume.

For more details about Amazon EC2, see http://aws.amazon.com/ec2/

Auto Scaling

Auto Scaling is used to manage the ComputeFleet instances. These instances are managed as an AutoScaling Group and can either be elastically driven by workload or static and driven by the config.

For more details about Auto Scaling, see http://aws.amazon.com/autoscaling/

Amazon EBS

Amazon EBS provides the persistent storage for the shared volume. Any EBS settings can be passed through the config. EBS volumes can either be initialized empty or from an exisiting EBS snapshot.

For more details about Amazon EBS, see http://aws.amazon.com/ebs/

Amazon CloudWatch

Amazon CloudWatch provides metric collection and alarms for CfnCluster. The MasterServer publishes pending tasks (jobs) for each cluster. Two alarms are defined that based on parameters defined in the config will automatically increase the size of the ComputeFleet Auto Scaling group.

For more details, see http://aws.amazon.com/cloudwatch/

Amazon S3

Amazon S3 is used to store the CfnCluster templates. Each region has a bucket with all templates. CfnCluster can be configured to allow allow CLI/SDK tools to use S3.

For more details, see http://aws.amazon.com/s3/

Amazon DynamoDB

Amazon DynamoDB is used to store minimal state of the cluster. The MasterServer tracks provisioned instances in a DynamoDB table.

For more details, see http://aws.amazon.com/dynamodb/

CfnCluster auto-scaling

Clusters deployed with CfnCluster are elastic in several ways. The first is by simply setting the initial_queue_size and max_queue_size parameters of a cluster settings. The initial_queue_size sets minimum size value of the ComputeFleet Auto Scaling Group(ASG) and also the desired capacity value . The max_queue_size sets maximum size value of the ComputeFleet ASG. As part of the CfnCluster, two Amazon CloudWatch alarms are created. These alarms monitor a custom Amazon CloudWatch metric[1] that is published by the MasterServer of each cluster, this is the second elastic nature of CfnCluster. This metric is called pending and is created per Stack and unique to each cluster. These Amazon CloudWatch alarms call ScaleUp policies associated with the ComputeFleet ASG. This is what handles the automatic addition of compute nodes when there is pending tasks in the cluster. It is actually capable to scaling the cluster with zero compute nodes until the alarms no longer trigger or the max_queue_size is reached.

Within AutoScaling, there is typically a Amazon CloudWatch alarm to remove instances when no longer needed. This alarm would operate on a aggregate metric such as CPU or network. When the aggregate metric fell below a certain level, it would make a call to a ScaleDown policy. The decision to remove which instance is complex[2] and is not aware of individual instance utilization. For that reason, each one of the instances in the ComputeFleet ASG run a process called nodewatcher[3]. The purpose of this process is to monitor the instance and if idle AND close to the end of the current hour, remove it from the ComputeFleet ASG. It specifically calls the TerminateInstanceInAutoScalingGroup[4] API call, which will remove an instance as long as the size of the ASG is larger than the desired capacity. That is what handles the scale down of the cluster, without affecting any running jobs and also enables an elastic cluster with a fixed base number of instances.

The value of the auto scaling is the same for HPC as with any other workloads, the only difference here is CfnCluster has code to specifically make it interact in a more intelligent manner. If a static cluster is required, this can be achieved by setting initial_queue_size and max_queue_size parameters to the size of cluster required and also setting the maintain_initial_size parameter to true. This will cause the ComputeFleet ASG to have the same value for minimum, maximum and desired capacity.

Tutorials

Here you can find tutorials for best practices guides for getting started with CfnCluster.

Running your first job on cfncluster

This tutorial will walk you through running your first “Hello World” job on cfncluster.

If you haven’t yet, you will need to following the getting started guide to install cfncluster and configure your CLI.

Verifying your installation

First, we’ll verify that cfncluster is correctly installed and configured.

$ cfncluster version

This should return the running version of cfncluster. If it gives you a message about configuration, you will need to run the following to configure cfncluster.

$ cfncluster configure

Creating your First Cluster

Now it’s time to create our first cluster. Because our workload isn’t performance intensive, we will use the default instance sizes of t2.micro. For production workloads, you’ll want to choose an instance size which better fits your needs.

We’re going to call our cluster “hello-world”.

$ cfncluster create hello-world

You’ll see some messages on your screen about the cluster creating. When it’s finished, it will provide the following output:

Starting: hello-world
Status: cfncluster-hello-world - CREATE_COMPLETE
Output:"MasterPrivateIP"="192.168.x.x"
Output:"MasterPublicIP"="54.148.x.x"
Output:"GangliaPrivateURL"="http://192.168.x.x/ganglia/"
Output:"GangliaPublicURL"="http://54.148.x.x/ganglia/"

The message “CREATE_COMPLETE” shows that the cluster created sucessfully. It also provided us with the public and private IP addresses of our master node. We’ll need this IP to log in.

Logging into your Master instance

You’ll use your OpenSSH pem file and the ec2-user to log into your master instance.

ssh -i /path/to/keyfile.pem ec2-user@54.148.x.x

Once logged in, run the command “qhost” to ensure that your compute nodes are setup and configured.

[ec2-user@ip-192-168-1-86 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-192-168-1-125        lx-amd64        2    1    2    2  0.15    3.7G  130.8M 1024.0M     0.0
ip-192-168-1-126        lx-amd64        2    1    2    2  0.15    3.7G  130.8M 1024.0M     0.0

As you can see, we have two compute nodes in our cluster, both with 2 threads available to them.

Running your first job

Now we’ll create a simple job which sleeps for a little while and then outputs it’s own hostname.

Create a file called “hellojob.sh” with the following contents.

#!/bin/bash
sleep 30
echo "Hello World from $(hostname)"

Next, submit the job using “qsub” and ensure it runs.

$ qsub hellojob.sh
Your job 1 ("hellojob.sh") has been submitted

Now, you can view your queue and check the status of the job.

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.55500 hellojob.s ec2-user     r     03/24/2015 22:23:48 all.q@ip-192-168-1-125.us-west     1

The job is currently in a running state. Wait 30 seconds for the job to finish and run qstat again.

$ qstat
$

Now that there are no jobs in the queue, we can check for output in our current directory.

$ ls -l
total 8
-rw-rw-r-- 1 ec2-user ec2-user 48 Mar 24 22:34 hellojob.sh
-rw-r--r-- 1 ec2-user ec2-user  0 Mar 24 22:34 hellojob.sh.e1
-rw-r--r-- 1 ec2-user ec2-user 34 Mar 24 22:34 hellojob.sh.o1

Here, we see our job script, an “e1” and “o1” file. Since the e1 file is empty, there was no output to stderr. If we view the .o1 file, we can see any output from our job.

$ cat hellojob.sh.o1
Hello World from ip-192-168-1-125

We can see that our job sucessfully ran on instance “ip-192-168-1-125”.

Building a custom CfnCluster AMI

Warning

Building a custom AMI is not the recomended approach for customizing CfnCluster.

Once you build your own AMI, you will no longer receive updates or bug fixes with future releases of CfnCluster. You will need to repeat the steps used to create your custom AMI with each new CfnCluster release.

Before reading any further, take a look at the Custom Bootstrap Actions section of the documentation to determine if the modifications you wish to make can be scripted and supported with future CfnCluster releases

While not ideal, there are a number of scenarios where building a custom AMI for CfnCluster is necessary. This tutorial will guide you through the process.

How to customize the CfnCluster AMI

The base CfnCluster AMI is often updated with new releases. This AMI has all of the components required for CfnCluster to function installed and configured. If you wish to customize an AMI for CfnCluster, you must start with this as the base.

  1. Find the AMI which corresponds with the region you will be utilizing in the list here: https://github.com/awslabs/cfncluster/blob/master/amis.txt.

  2. Within the EC2 Console, choose “Launch Instance”.

  3. Navigate to “Community AMIs”, and enter the AMI id for your region into the search box.

  4. Select the AMI, choose your instance type and properties, and launch your instance.

  5. Log into your instance using the ec2-user and your SSH key.

  6. Customize your instance as required

  7. Run the following command to prepare your instance for AMI creation:

    sudo /usr/local/sbin/ami_cleanup.sh
    
  8. Stop the instance

  9. Create a new AMI from the instance

  10. Enter the AMI id in the custom_ami field within your cluster configuration.

Setting Up an AMI Development Environment

Warning

Building a custom AMI is not the recomended approach for customizing CfnCluster.

Once you build your own AMI, you will no longer receive updates or bug fixes with future releases of CfnCluster. You will need to repeat the steps used to create your custom AMI with each new CfnCluster release.

Before reading any further, take a look at the Custom Bootstrap Actions section of the documentation to determine if the modifications you wish to make can be scripted and supported with future CfnCluster releases.

Steps

This guide is written assuming your OS is Ubuntu 14.04. If you don’t have an Ubuntu machine you can easily get an EC2 instance running Ubuntu.

  1. sudo apt-get -y install build-essential git

  2. Go to https://downloads.chef.io/chef-dk, grab the latest version for your OS and install.

    For example:

    wget https://packages.chef.io/stable/ubuntu/12.04/chefdk_0.17.17-1_amd64.deb
    sudo dpkg -i chefdk_0.17.17-1_amd64.deb
    
  3. git clone https://github.com/awslabs/cfncluster-cookbook

  4. Grab the latest go-lang link from https://golang.org/dl/

  5. Run the following:

    wget https://storage.googleapis.com/golang/go1.7.linux-amd64.tar.gz
    cd /usr/local
    sudo tar xf ~/go1.7.linux-amd64.tar.gz
    echo 'export GOPATH=~/work' >> ~/.bashrc
    echo 'export PATH=$GOPATH/bin:/usr/local/go/bin:$PATH' >> ~/.bashrc
    . ~/.bashrc
    
  6. Install packer from source

    go get github.com/mitchellh/packer
    

The next part of setting up your environment involves setting a lot of environment variables, you can either set them as I explain what they are or use the script provided at the bottom.

  1. Set your aws key pair name and path, if you don’t have a key pair create one.

    export AWS_KEYPAIR_NAME=your-aws-keypair                # Name of your key pair
    export EC2_SSH_KEY_PATH=~/.ssh/your-aws-keypair # Path to your key pair
    
  2. Set the AWS instance type you’d like to launch.

    export AWS_FLAVOR_ID=c3.4xlarge
    
  3. Set the availibility zone and region:

    export AWS_AVAILABILITY_ZONE=us-east-1c
    export AWS_DEFAULT_REGION=us-east-1
    
  4. Create a AWS VPC in that region:

    export AWS_VPC_ID=vpc-XXXXXXXXX
    
  5. Create a subnet in that region and set it below:

    export AWS_SUBNET_ID=subnet-XXXXXXXX
    
  6. Create a security group and set it:

    export AWS_SECURITY_GROUP_ID=sg-XXXXXXXX
    
  7. Create an IAM Profile from the template here.

    export AWS_IAM_PROFILE=CfnClusterEC2IAMRole             # IAM Role name
    
  8. Set the path to your kitchen yaml file. Note that this comes in CfnClusterCookbook.

    export KITCHEN_LOCAL_YAML=.kitchen.cloud.yml
    
  9. Create a 10G ebs backed volumne in the same availibity zone:

    export CFN_VOLUME=vol-XXXXXXXX  # create 10G EBS volume in same AZ
    
  10. Set the stack name.

    export AWS_STACK_NAME=cfncluster-test-kitchen
    
  11. Create an sqs queue:

    export CFN_SQS_QUEUE=cfncluster-chef                    # create an SQS queue
    
  12. Create a dynamoDB table with hash key instanceId type String and name it cfncluster-chef then export the following:

    export CFN_DDB_TABLE=cfncluster-chef  # setup table as cfncluster-chef
    
  13. You should now be able to run the following:

    kitchen list
    
  14. If something isn’t working you can run:

    kitchen diagnose all
    

Here’s a script to do all of the above, just fill out and the fields and source it like: . ~/path/to/script

export AWS_KEYPAIR_NAME=your-aws-keypair                # Name of your key pair
export EC2_SSH_KEY_PATH=~/.ssh/your-aws-keypair.pem     # Path to your key pair
export AWS_FLAVOR_ID=c3.4xlarge
export AWS_DEFAULT_REGION=us-east-1
export AWS_AVAILABILITY_ZONE=us-east-1c
export AWS_VPC_ID=vpc-XXXXXXXX
export AWS_SUBNET_ID=subnet-XXXXXXXX
export AWS_SECURITY_GROUP_ID=sg-XXXXXXXX
export AWS_IAM_PROFILE=CfnClusterEC2IAMRole     # create role using IAM docs for CfnCluster
export KITCHEN_LOCAL_YAML=.kitchen.cloud.yml
export CFN_VOLUME=vol-XXXXXXXX                                  # create 10G EBS volume in same AZ
export AWS_STACK_NAME=cfncluster-test-kitchen
export CFN_SQS_QUEUE=cfncluster-chef                    # create an SQS queue
export CFN_DDB_TABLE=cfncluster-chef                    # setup table as cfncluster-chef

Getting Started

If you’ve never used CfnCluster before, you should read the Getting Started with cfncluster guide to get familiar with cfncluster & its usage.