AWS ParallelCluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. Built on the Open Source CfnCluster project, AWS ParallelCluster enables you to quickly build an HPC compute environment in AWS. It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm. AWS ParallelCluster facilitates both quick start proof of concepts (POCs) and production deployments. You can build higher level workflows, such as a Genomics portal that automates the entire DNA sequencing workflow, on top of AWS ParallelCluster.

Getting started with AWS ParallelCluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool that makes it easy for you to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud. Built on the Open Source CfnCluster project, AWS ParallelCluster enables you to quickly build an HPC compute environment in AWS. It automatically sets up the required compute resources and a shared filesystem and offers a variety of batch schedulers such as AWS Batch, SGE, Torque, and Slurm. AWS ParallelCluster facilitates both quick start proof of concepts (POCs) and production deployments. You can build higher level workflows, such as a Genomics portal that automates the entire DNA sequencing workflow, on top of AWS ParallelCluster.

Installing AWS ParallelCluster

The current working version is aws-parallelcluster-2.1. The CLI is written in Python and uses BOTO for AWS actions. You can install the CLI with the following commands, depending on your OS.

Linux/OSX

$ sudo pip install aws-parallelcluster

Windows

Windows support is experimental!!

Install the following packages:

Once installed, you should update the Environment Variables to have the Python install directory and Python Scripts directory in the PATH, for example: C:\Python36-32;C:\Python36-32\Scripts

Now it should be possible to run the following within a command prompt window:

C:\> pip install aws-parallelcluster

Upgrading

To upgrade an older version of AWS ParallelCluster, you can use either of the following commands, depending on how it was originally installed:

$ sudo pip install --upgrade aws-parallelcluster

Remember when upgrading to check that the existing config is compatible with the latest version installed.

Configuring AWS ParallelCluster

Once installed you will need to setup some initial config. The easiest way to do this is below:

$ pcluster configure

This configure wizard will prompt you for everything you need to create your cluster. You will first be prompted for your cluster template name, which is the logical name of the template you will create a cluster from.

Cluster Template [mycluster]:

Next, you will be prompted for your AWS Access & Secret Keys. Enter the keys for an IAM user with administrative privileges. These can also be read from your environment variables or the AWS CLI config.

AWS Access Key ID []:
AWS Secret Access Key ID []:

Now, you will be presented with a list of valid AWS region identifiers. Choose the region in which you’d like your cluster to run.

Acceptable Values for AWS Region ID:
    us-east-1
    cn-north-1
    ap-northeast-1
    eu-west-1
    ap-southeast-1
    ap-southeast-2
    us-west-2
    us-gov-west-1
    us-gov-east-1
    us-west-1
    eu-central-1
    sa-east-1
AWS Region ID []:

Choose a descriptive name for your VPC. Typically, this will be something like production or test.

VPC Name [myvpc]:

Next, you will need to choose a key pair that already exists in EC2 in order to log into your master instance. If you do not already have a key pair, refer to the EC2 documentation on EC2 Key Pairs.

Acceptable Values for Key Name:
    keypair1
    keypair-test
    production-key
Key Name []:

Choose the VPC ID into which you’d like your cluster launched.

Acceptable Values for VPC ID:
    vpc-1kd24879
    vpc-blk4982d
VPC ID []:

Finally, choose the subnet in which you’d like your master server to run.

Acceptable Values for Master Subnet ID:
    subnet-9k284a6f
    subnet-1k01g357
    subnet-b921nv04
Master Subnet ID []:

Next, a simple cluster launches into a VPC and uses an existing subnet which supports public IP’s i.e. the route table for the subnet is 0.0.0.0/0 => igw-xxxxxx. The VPC must have DNS Resolution = yes and DNS Hostnames = yes. It should also have DHCP options with the correct domain-name for the region, as defined in the docs: VPC DHCP Options.

Once all of those settings contain valid values, you can launch the cluster by running the create command:

$ pcluster create mycluster

Once the cluster reaches the “CREATE_COMPLETE” status, you can connect using your normal SSH client/settings. For more details on connecting to EC2 instances, check the EC2 User Guide.

Moving from CfnCluster to AWS ParallelCluster

AWS ParallelCluster is an enhanced and productized version of CfnCluster.

If you are a previous CfnCluster user, we encourage you to start using and creating new clusters only with AWS ParallelCluster. Although you can still use CfnCluster, it will no longer be developed.

The main differences between CfnCluster and AWS ParallelCluster are listed below.


AWS ParallelCluster CLI manages a different set of clusters

Clusters created by cfncluster CLI cannot be managed with pcluster CLI. The following commands will no longer work on clusters created by CfnCluster:

pcluster list
pcluster update cluster_name
pcluster start cluster_name
pcluster status cluster_name

You need to use the cfncluster CLI to manage your old clusters.

If you need an old CfnCluster package to manage your old clusters, we recommend you install and use it from a Python Virtual Environment.


Distinct IAM Custom Policies

Custom IAM Policies, previously used for CfnCluster cluster creation, cannot be used with AWS ParallelCluster. If you require custom policies you need to create the new ones by following IAM in AWS ParallelCluster guide.


Different configuration files

The AWS ParallelCluster configuration file resides in the ~/.parallelcluster folder, unlike the CfnCluster one that was created in the ~/.cfncluster folder.

You can still use your existing configuration file but this needs to be moved from ~/.cfncluster/config to ~/.parallelcluster/config.

If you use the extra_json configuration parameter, it must be changed as described below:

extra_json = { "cfncluster" : { } }

has been changed to

extra_json = { "cluster" : { } }


Ganglia disabled by default

Ganglia is disabled by default. You can enable it by setting the extra_json parameter as described below:

extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }

and changing the Master SG to allow connections to port 80. The parallelcluster-<CLUSTER_NAME>-MasterSecurityGroup-<xxx> Security Group has to be modified by adding a new Security Group Rule to allow Inbound connection to the port 80 from your Public IP.

Working with AWS ParallelCluster

AWS ParallelCluster CLI commands

pcluster is the AWS ParallelCluster CLI and permits to launch and manage HPC clusters in the AWS cloud.

usage: pcluster [-h]
                {create,update,delete,start,stop,status,list,instances,ssh,createami,configure,version}
                ...

Positional Arguments

command Possible choices: create, update, delete, start, stop, status, list, instances, ssh, createami, configure, version

Sub-commands:

create

Creates a new cluster.

pcluster create [-h] [-c CONFIG_FILE] [-r REGION] [-nw] [-nr]
                [-u TEMPLATE_URL] [-t CLUSTER_TEMPLATE] [-p EXTRA_PARAMETERS]
                [-g TAGS]
                cluster_name
Positional Arguments
cluster_name name for the cluster. The CloudFormation Stack name will be parallelcluster-[cluster_name]
Named Arguments
-c, --config alternative config file
-r, --region region to connect to
-nw, --nowait

do not wait for stack events, after executing stack command

Default: False

-nr, --norollback
 

disable stack rollback on error

Default: False

-u, --template-url
 specify URL for the custom CloudFormation template, if it has been used at creation time
-t, --cluster-template
 cluster template to use
-p, --extra-parameters
 add extra parameters to stack create
-g, --tags tags to be added to the stack

When the command is called and it starts polling for status of that call it is safe to “Ctrl-C” out. You can always return to that status by calling “pcluster status mycluster”.

Examples:

$ pcluster create mycluster
$ pcluster create mycluster --tags '{ "Key1" : "Value1" , "Key2" : "Value2" }'
update

Updates a running cluster by using the values in the config file or a TEMPLATE_URL provided.

pcluster update [-h] [-c CONFIG_FILE] [-r REGION] [-nw] [-nr]
                [-u TEMPLATE_URL] [-t CLUSTER_TEMPLATE] [-p EXTRA_PARAMETERS]
                [-rd]
                cluster_name
Positional Arguments
cluster_name name of the cluster to update
Named Arguments
-c, --config alternative config file
-r, --region region to connect to
-nw, --nowait

do not wait for stack events, after executing stack command

Default: False

-nr, --norollback
 

disable CloudFormation Stack rollback on error

Default: False

-u, --template-url
 URL for a custom CloudFormation template
-t, --cluster-template
 specific cluster template to use
-p, --extra-parameters
 add extra parameters to stack update
-rd, --reset-desired
 

reset the current ASG desired capacity to initial config values

Default: False

When the command is called and it starts polling for status of that call it is safe to “Ctrl-C” out. You can always return to that status by calling “pcluster status mycluster”

delete

Deletes a cluster.

pcluster delete [-h] [-c CONFIG_FILE] [-r REGION] [-nw] cluster_name
Positional Arguments
cluster_name name of the cluster to delete
Named Arguments
-c, --config alternative config file
-r, --region region to connect to
-nw, --nowait

do not wait for stack events, after executing stack command

Default: False

When the command is called and it starts polling for status of that call it is safe to “Ctrl-C” out. You can always return to that status by calling “pcluster status mycluster”

start

Starts the compute fleet for a cluster that has been stopped.

pcluster start [-h] [-c CONFIG_FILE] [-r REGION] cluster_name
Positional Arguments
cluster_name starts the compute fleet of the provided cluster name
Named Arguments
-c, --config alternative config file
-r, --region region to connect to

This command sets the Auto Scaling Group parameters to either the initial configuration values (max_queue_size and initial_queue_size) from the template that was used to create the cluster or to the configuration values that were used to update the cluster since creation.

stop

Stops the compute fleet, leaving the master server running.

pcluster stop [-h] [-c CONFIG_FILE] [-r REGION] cluster_name
Positional Arguments
cluster_name stops the compute fleet of the provided cluster name
Named Arguments
-c, --config alternative config file
-r, --region region to connect to

Sets the Auto Scaling Group parameters to min/max/desired = 0/0/0 and terminates the compute fleet. The master will remain running. To terminate all EC2 resources and avoid EC2 charges, consider deleting the cluster.

status

Pulls the current status of the cluster.

pcluster status [-h] [-c CONFIG_FILE] [-r REGION] [-nw] cluster_name
Positional Arguments
cluster_name Shows the status of the cluster with the provided name.
Named Arguments
-c, --config alternative config file
-r, --region region to connect to
-nw, --nowait

do not wait for stack events, after executing stack command

Default: False

list

Displays a list of stacks associated with AWS ParallelCluster.

pcluster list [-h] [-c CONFIG_FILE] [-r REGION]
Named Arguments
-c, --config alternative config file
-r, --region region to connect to

Lists the Stack Name of the CloudFormation stacks named parallelcluster-*

instances

Displays a list of all instances in a cluster.

pcluster instances [-h] [-c CONFIG_FILE] [-r REGION] cluster_name
Positional Arguments
cluster_name Display the instances for the cluster with the provided name.
Named Arguments
-c, --config alternative config file
-r, --region region to connect to
ssh

Run ssh command with username and IP address pre-filled. Arbitrary arguments are appended to the end of the ssh command. This command may be customized in the aliases section of the config file.

pcluster ssh [-h] [-d] cluster_name
Positional Arguments
cluster_name name of the cluster to connect to
Named Arguments
-d, --dryrun

print command and exit

Default: False

Example:

$ pcluster ssh mycluster -i ~/.ssh/id_rsa

results in an ssh command with username and IP address pre-filled:

$ ssh ec2-user@1.1.1.1 -i ~/.ssh/id_rsa

SSH command is defined in the global config file, under the aliases section and can be customized:

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Variables substituted:

{CFN_USER}
{MASTER_IP}
{ARGS} (only if specified on the cli)
createami

(Linux/OSX) Creates a custom AMI to use with AWS ParallelCluster.

pcluster createami [-h] -ai BASE_AMI_ID -os BASE_AMI_OS
                   [-ap CUSTOM_AMI_NAME_PREFIX] [-cc CUSTOM_AMI_COOKBOOK]
                   [-c CONFIG_FILE] [-r REGION]
Named Arguments
-ai, --ami-id specify the base AMI to use for building the AWS ParallelCluster AMI
-os, --os specify the OS of the base AMI. Valid values are alinux, ubuntu1404, ubuntu1604, centos6 or centos7
-ap, --ami-name-prefix
 

specify the prefix name of the resulting AWS ParallelCluster AMI

Default: “custom-ami-“

-cc, --custom-cookbook
 specify the cookbook to use to build the AWS ParallelCluster AMI
-c, --config alternative config file
-r, --region region to connect to
configure

Start initial AWS ParallelCluster configuration.

pcluster configure [-h] [-c CONFIG_FILE]
Named Arguments
-c, --config alternative config file
version

Display version of AWS ParallelCluster.

pcluster version [-h]

For command specific flags run “pcluster [command] –help”

Network Configurations

AWS ParallelCluster leverages Amazon Virtual Private Cloud (VPC) for networking. This provides a very flexible and configurable networking platform to deploy clusters within. AWS ParallelCluster support the following high-level configurations:

  • Single subnet for both master and compute instances
  • Two subnets, master in one public subnet and compute instances in a private subnet (new or already existing)

All of these configurations can operate with or without public IP addressing. It can also be deployed to leverage an HTTP proxy for all AWS requests. The combinations of these configurations result in many different deployment scenarios, ranging from a single public subnet with all access over the Internet, to fully private via AWS Direct Connect and HTTP proxy for all traffic.

Below are some architecture diagrams for some of those scenarios:

AWS ParallelCluster in a single public subnet

AWS ParallelCluster single subnet

The configuration for this architecture requires the following settings:

[vpc public]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-<public>

AWS ParallelCluster using two subnets

AWS ParallelCluster two subnets

The configuration to create a new private subnet for compute instances requires the following settings:

note that all values are examples only

[vpc public-private-new]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-<public>
compute_subnet_cidr = 10.0.1.0/24

The configuration to use an existing private network requires the following settings:

[vpc public-private-existing]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-<public>
compute_subnet_id = subnet-<private>

Both these configuration require to have a NAT Gateway or an internal PROXY to enable web access for compute instances.

AWS ParallelCluster in a single private subnet connected using Direct Connect

AWS ParallelCluster private with DX

The configuration for this architecture requires the following settings:

[cluster private-proxy]
proxy_server = http://proxy.corp.net:8080

[vpc private-proxy]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-<private>
use_public_ips = false

With use_public_ips set to false The VPC must be correctly setup to use the Proxy for all traffic. Web access is required for both Master and Compute instances.

AWS ParallelCluster with awsbatch scheduler

When using awsbatch as scheduler type, ParallelCluster creates an AWS Batch Managed Compute Environment which takes care of managing ECS container instances, launched in the compute_subnet. In order for AWS Batch to function correctly, Amazon ECS container instances need external network access to communicate with the Amazon ECS service endpoint. This translates into the following scenarios:

  • The compute_subnet uses a NAT Gateway to access the Internet. (Recommended approach)
  • Instances launched in the compute_subnet have public IP addresses and can reach the Internet through an Internet Gateway.

Additionally, if you are interested in Multi-node Parallel jobs (according to AWS Batch docs):

AWS Batch multi-node parallel jobs use the Amazon ECS awsvpc network mode, which gives your multi-node parallel job containers the same networking properties as Amazon EC2 instances. Each multi-node parallel job container gets its own elastic network interface, a primary private IP address, and an internal DNS hostname. The network interface is created in the same VPC subnet as its host compute resource. Any security groups that are applied to your compute resources are also applied to it.

When using ECS Task Networking, the awsvpc network mode does not provide task elastic network interfaces with public IP addresses for tasks that use the EC2 launch type. To access the Internet, tasks that use the EC2 launch type must be launched in a private subnet that is configured to use a NAT Gateway.

This leaves us with the only option of configuring a NAT Gateway in order to enable the cluster to execute Multi-node Parallel Jobs.

AWS ParallelCluster networking with awsbatch scheduler

Additional details can be found in the official AWS docs:

Custom Bootstrap Actions

AWS ParallelCluster can execute arbitrary code either before(pre) or after(post) the main bootstrap action during cluster creation. This code is typically stored in S3 and accessed via HTTP(S) during cluster creation. The code will be executed as root and can be in any script language supported by the cluster OS, typically bash or python.

pre-install actions are called before any cluster deployment bootstrap such as configuring NAT, EBS and the scheduler. Typical pre-install actions may include modifying storage, adding extra users or packages.

post-install actions are called after cluster bootstrap is complete, as the last action before an instance is considered complete. Typical post-install actions may include changing scheduler settings, modifying storage or packages.

Arguments can be passed to scripts by specifying them in the config. These will be passed double-quoted to the pre/post-install actions.

If a pre/post-install actions fails, then the instance bootstrap will be considered failed and it will not continue. Success is signalled with an exit code of 0, any other exit code will be considered a fail.

Configuration

The following config settings are used to define pre/post-install actions and arguments. All options are optional and are not required for basic cluster install.

# URL to a preinstall script. This is executed before any of the boot_as_* scripts are run
# (defaults to NONE)
pre_install = NONE
# Arguments to be passed to preinstall script
# (defaults to NONE)
pre_install_args = NONE
# URL to a postinstall script. This is executed after any of the boot_as_* scripts are run
# (defaults to NONE)
post_install = NONE
# Arguments to be passed to postinstall script
# (defaults to NONE)
post_install_args = NONE

Arguments

The first two arguments $0 and $1 are reserved for the script name and url.

$0 => the script name
$1 => s3 url
$n => args set by pre/post_install_args

Example

The following are some steps to create a simple post install script that installs the R packages in a cluster.

  1. Create a script. For the R example, see below
#!/bin/bash

yum -y install --enablerepo=epel R
  1. Upload the script with the correct permissions to S3

aws s3 cp --acl public-read /path/to/myscript.sh s3://<bucket-name>/myscript.sh

  1. Update AWS ParallelCluster config to include the new post install action.
[cluster default]
...
post_install = https://<bucket-name>.s3.amazonaws.com/myscript.sh

If the bucket does not have public-read permission use s3 as URL scheme.

[cluster default]
...
post_install = s3://<bucket-name>/myscript.sh
  1. Launch a cluster

pcluster create mycluster

Working with S3

Accessing S3 within AWS ParallelCluster can be controlled through two parameters in the AWS ParallelCluster config.

# Specify S3 resource which AWS ParallelCluster nodes will be granted read-only access
# (defaults to NONE)
s3_read_resource = NONE
# Specify S3 resource which AWS ParallelCluster nodes will be granted read-write access
# (defaults to NONE)
s3_read_write_resource = NONE

Both parameters accept either * or a valid S3 ARN. For details of how to specify S3 ARNs, please see http://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html#arn-syntax-s3

Examples

The following example gives you read access to any object in the bucket my_corporate_bucket.

s3_read_resource = arn:aws:s3:::my_corporate_bucket/*

This next example gives you read access to the bucket. This does not let you read items from the bucket.

s3_read_resource = arn:aws:s3:::my_corporate_bucket

IAM in AWS ParallelCluster

AWS ParallelCluster utilizes multiple AWS services to deploy and operate a cluster. The services used are listed in the AWS Services used in AWS ParallelCluster section of the documentation.

AWS ParallelCluster uses EC2 IAM roles to enable instances access to AWS services for the deployment and operation of the cluster. By default the EC2 IAM role is created as part of the cluster creation by CloudFormation. This means that the user creating the cluster must have the appropriate level of permissions

Defaults

When using defaults, during cluster launch an EC2 IAM Role is created by the cluster, as well as all the resources required to launch the cluster. The user calling the create call must have the right level of permissions to create all the resources including an EC2 IAM Role. This level of permissions is typically an IAM user with the AdministratorAccess managed policy. More details on managed policies can be found here: http://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html#aws-managed-policies

Using an existing EC2 IAM role

When using AWS ParallelCluster with an existing EC2 IAM role, you must first define the IAM policy and role before attempting to launch the cluster. Typically the reason for using an existing EC2 IAM role within AWS ParallelCluster is to reduce the permissions granted to users launching clusters. Below is an example IAM policy for both the EC2 IAM role and the AWS ParallelCluster IAM user. You should create both as individual policies in IAM and then attach to the appropriate resources. In both policies, you should replace REGION and AWS ACCOUNT ID with the appropriate values.

ParallelClusterInstancePolicy

In case you are using SGE, Slurm or Torque as a scheduler:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:AttachVolume",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeInstances",
                "ec2:DescribeRegions"
            ],
            "Sid": "EC2",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "dynamodb:ListTables"
            ],
            "Sid": "DynamoDBList",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
            ],
            "Action": [
                "sqs:SendMessage",
                "sqs:ReceiveMessage",
                "sqs:ChangeMessageVisibility",
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl"
            ],
            "Sid": "SQSQueue",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:DescribeTags",
                "autoScaling:UpdateAutoScalingGroup"
            ],
            "Sid": "Autoscaling",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
            ],
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:Query",
                "dynamodb:GetItem",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable"
            ],
            "Sid": "DynamoDBTable",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:s3:::<REGION>-aws-parallelcluster/*"
            ],
            "Action": [
                "s3:GetObject"
            ],
            "Sid": "S3GetObj",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
            ],
            "Action": [
                "cloudformation:DescribeStacks"
            ],
            "Sid": "CloudFormationDescribe",
            "Effect": "Allow"
        },
        {
            "Resource": [
                "*"
            ],
            "Action": [
                "sqs:ListQueues"
            ],
            "Sid": "SQSList",
            "Effect": "Allow"
        }
    ]
}

In case you are using awsbatch as a scheduler, you need to include the same policies as the ones assigned to the BatchUserRole that is defined in the Batch CloudFormation nested stack. The BatchUserRole ARN is provided as a stack output. Here is an overview of the required permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "batch:SubmitJob",
                "batch:RegisterJobDefinition",
                "cloudformation:DescribeStacks",
                "ecs:ListContainerInstances",
                "ecs:DescribeContainerInstances",
                "logs:GetLogEvents",
                "logs:FilterLogEvents",
                "s3:PutObject",
                "s3:Get*",
                "s3:DeleteObject",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:batch:<REGION>:<AWS ACCOUNT ID>:job-definition/<AWS_BATCH_STACK - JOB_DEFINITION_SERIAL_NAME>:1",
                "arn:aws:batch:<REGION>:<AWS ACCOUNT ID>:job-definition/<AWS_BATCH_STACK - JOB_DEFINITION_MNP_NAME>*",
                "arn:aws:batch:<REGION>:<AWS ACCOUNT ID>:job-queue/<AWS_BATCH_STACK - JOB_QUEUE_NAME>",
                "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/<STACK NAME>/*",
                "arn:aws:s3:::<RESOURCES S3 BUCKET>/batch/*",
                "arn:aws:iam::<AWS ACCOUNT ID>:role/<AWS_BATCH_STACK - JOB_ROLE>",
                "arn:aws:ecs:<REGION>:<AWS ACCOUNT ID>:cluster/<ECS COMPUTE ENVIRONMENT>",
                "arn:aws:ecs:<REGION>:<AWS ACCOUNT ID>:container-instance/*",
                "arn:aws:logs:<REGION>:<AWS ACCOUNT ID>:log-group:/aws/batch/job:log-stream:*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::<RESOURCES S3 BUCKET>"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "batch:DescribeJobQueues",
                "batch:TerminateJob",
                "batch:DescribeJobs",
                "batch:CancelJob",
                "batch:DescribeJobDefinitions",
                "batch:ListJobs",
                "batch:DescribeComputeEnvironments",
                "ec2:DescribeInstances"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

ParallelClusterUserPolicy

In case you are using SGE, Slurm or Torque as a scheduler:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EC2Describe",
            "Action": [
                "ec2:DescribeKeyPairs",
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribePlacementGroups",
                "ec2:DescribeImages",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceStatus",
                "ec2:DescribeSnapshots",
                "ec2:DescribeVolumes",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeAddresses",
                "ec2:CreateTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeAvailabilityZones"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "EC2Modify",
            "Action": [
                "ec2:CreateVolume",
                "ec2:RunInstances",
                "ec2:AllocateAddress",
                "ec2:AssociateAddress",
                "ec2:AttachNetworkInterface",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateNetworkInterface",
                "ec2:CreateSecurityGroup",
                "ec2:ModifyVolumeAttribute",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteVolume",
                "ec2:TerminateInstances",
                "ec2:DeleteSecurityGroup",
                "ec2:DisassociateAddress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:ReleaseAddress",
                "ec2:CreatePlacementGroup",
                "ec2:DeletePlacementGroup"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "AutoScalingDescribe",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "AutoScalingModify",
            "Action": [
                "autoscaling:CreateAutoScalingGroup",
                "autoscaling:CreateLaunchConfiguration",
                "ec2:CreateLaunchTemplate",
                "ec2:ModifyLaunchTemplate",
                "ec2:DeleteLaunchTemplate",
                "ec2:DescribeLaunchTemplates",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:PutNotificationConfiguration",
                "autoscaling:UpdateAutoScalingGroup",
                "autoscaling:PutScalingPolicy",
                "autoscaling:DeleteLaunchConfiguration",
                "autoscaling:DescribeScalingActivities",
                "autoscaling:DeleteAutoScalingGroup",
                "autoscaling:DeletePolicy"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "DynamoDBDescribe",
            "Action": [
                "dynamodb:DescribeTable"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "DynamoDBModify",
            "Action": [
              "dynamodb:CreateTable",
              "dynamodb:DeleteTable"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SQSDescribe",
            "Action": [
                "sqs:GetQueueAttributes"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SQSModify",
            "Action": [
                "sqs:CreateQueue",
                "sqs:SetQueueAttributes",
                "sqs:DeleteQueue"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SNSDescribe",
            "Action": [
              "sns:ListTopics",
              "sns:GetTopicAttributes"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "SNSModify",
            "Action": [
                "sns:CreateTopic",
                "sns:Subscribe",
                "sns:DeleteTopic"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "CloudFormationDescribe",
            "Action": [
                "cloudformation:DescribeStackEvents",
                "cloudformation:DescribeStackResource",
                "cloudformation:DescribeStackResources",
                "cloudformation:DescribeStacks",
                "cloudformation:ListStacks",
                "cloudformation:GetTemplate"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "CloudFormationModify",
            "Action": [
                "cloudformation:CreateStack",
                "cloudformation:DeleteStack",
                "cloudformation:UpdateStack"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "S3ParallelClusterReadOnly",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<REGION>-aws-parallelcluster*"
            ]
        },
        {
            "Sid": "IAMModify",
            "Action": [
                "iam:PassRole",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:GetRole",
                "iam:SimulatePrincipalPolicy"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:iam::<AWS ACCOUNT ID>:role/<PARALLELCLUSTER EC2 ROLE NAME>"
        },
        {
            "Sid": "IAMCreateInstanceProfile",
            "Action": [
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:iam::<AWS ACCOUNT ID>:instance-profile/*"
        },
        {
            "Sid": "IAMInstanceProfile",
            "Action": [
                "iam:AddRoleToInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:PutRolePolicy",
                "iam:DeleteRolePolicy"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Sid": "EFSDescribe",
            "Action": [
                "efs:DescribeMountTargets",
                "efs:DescribeMountTargetSecurityGroups",
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

In case you are using awsbatch as a scheduler:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EC2Describe",
      "Action": [
        "ec2:DescribeKeyPairs",
        "ec2:DescribeVpcs",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribePlacementGroups",
        "ec2:DescribeImages",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeSnapshots",
        "ec2:DescribeVolumes",
        "ec2:DescribeVpcAttribute",
        "ec2:DescribeAddresses",
        "ec2:CreateTags",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeAvailabilityZones"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "EC2Modify",
      "Action": [
        "ec2:CreateVolume",
        "ec2:RunInstances",
        "ec2:AllocateAddress",
        "ec2:AssociateAddress",
        "ec2:AttachNetworkInterface",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:CreateNetworkInterface",
        "ec2:CreateSecurityGroup",
        "ec2:ModifyVolumeAttribute",
        "ec2:ModifyNetworkInterfaceAttribute",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteVolume",
        "ec2:TerminateInstances",
        "ec2:DeleteSecurityGroup",
        "ec2:DisassociateAddress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:ReleaseAddress",
        "ec2:CreatePlacementGroup",
        "ec2:DeletePlacementGroup"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "DynamoDB",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:CreateTable",
        "dynamodb:DeleteTable"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:dynamodb:<REGION>:<AWS ACCOUNT ID>:table/parallelcluster-*"
    },
    {
      "Sid": "CloudFormation",
      "Action": [
        "cloudformation:DescribeStackEvents",
        "cloudformation:DescribeStackResource",
        "cloudformation:DescribeStackResources",
        "cloudformation:DescribeStacks",
        "cloudformation:ListStacks",
        "cloudformation:GetTemplate",
        "cloudformation:CreateStack",
        "cloudformation:DeleteStack",
        "cloudformation:UpdateStack"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:cloudformation:<REGION>:<AWS ACCOUNT ID>:stack/parallelcluster-*"
    },
    {
      "Sid": "SQS",
      "Action": [
        "sqs:GetQueueAttributes",
        "sqs:CreateQueue",
        "sqs:SetQueueAttributes",
        "sqs:DeleteQueue"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "SQSQueue",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:ChangeMessageVisibility",
        "sqs:DeleteMessage",
        "sqs:GetQueueUrl"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:sqs:<REGION>:<AWS ACCOUNT ID>:parallelcluster-*"
    },
    {
      "Sid": "SNS",
      "Action": [
        "sns:ListTopics",
        "sns:GetTopicAttributes",
        "sns:CreateTopic",
        "sns:Subscribe",
        "sns:DeleteTopic"],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "IAMRole",
      "Action": [
        "iam:PassRole",
        "iam:CreateRole",
        "iam:DeleteRole",
        "iam:GetRole",
        "iam:SimulatePrincipalPolicy"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:iam::<AWS ACCOUNT ID>:role/parallelcluster-*"
    },
    {
      "Sid": "IAMInstanceProfile",
      "Action": [
        "iam:CreateInstanceProfile",
        "iam:DeleteInstanceProfile",
        "iam:GetInstanceProfile",
        "iam:PassRole"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:iam::<AWS ACCOUNT ID>:instance-profile/*"
    },
    {
      "Sid": "IAM",
      "Action": [
        "iam:AddRoleToInstanceProfile",
        "iam:RemoveRoleFromInstanceProfile",
        "iam:PutRolePolicy",
        "iam:DeleteRolePolicy",
        "iam:AttachRolePolicy",
        "iam:DetachRolePolicy"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "S3ResourcesBucket",
      "Action": ["s3:*"],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::parallelcluster-*"]
    },
    {
      "Sid": "S3ParallelClusterReadOnly",
      "Action": [
        "s3:Get*",
        "s3:List*"
      ],
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::<REGION>-aws-parallelcluster/*"]
    },
    {
      "Sid": "Lambda",
      "Action": [
        "lambda:CreateFunction",
        "lambda:DeleteFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:InvokeFunction",
        "lambda:AddPermission",
        "lambda:RemovePermission"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:lambda:<REGION>:<AWS ACCOUNT ID>:function:parallelcluster-*"
    },
    {
      "Sid": "Logs",
      "Effect": "Allow",
      "Action": ["logs:*"],
      "Resource": "arn:aws:logs:<REGION>:<AWS ACCOUNT ID>:*"
    },
    {
      "Sid": "CodeBuild",
      "Effect": "Allow",
      "Action": ["codebuild:*"],
      "Resource": "arn:aws:codebuild:<REGION>:<AWS ACCOUNT ID>:project/parallelcluster-*"
    },
    {
      "Sid": "ECR",
      "Effect": "Allow",
      "Action": ["ecr:*"],
      "Resource": "*"
    },
    {
      "Sid": "Batch",
      "Effect": "Allow",
      "Action": ["batch:*"],
      "Resource": "*"
    },
    {
      "Sid": "AmazonCloudWatchEvents",
      "Effect": "Allow",
      "Action": ["events:*"],
      "Resource": "*"
    }
  ]
}

AWS ParallelCluster Batch CLI Commands

The AWS ParallelCluster Batch CLI commands will be automatically installed in the AWS ParallelCluster Master Node when the selected scheduler is awsbatch. The CLI uses AWS Batch APIs and permits to submit and manage jobs and to monitor jobs, queues, hosts, mirroring traditional schedulers commands.

awsbsub

Submits jobs to the cluster’s Job Queue.

usage: awsbsub [-h] [-jn JOB_NAME] [-c CLUSTER] [-cf] [-w WORKING_DIR]
               [-pw PARENT_WORKING_DIR] [-if INPUT_FILE] [-p VCPUS]
               [-m MEMORY] [-e ENV] [-eb ENV_BLACKLIST] [-r RETRY_ATTEMPTS]
               [-t TIMEOUT] [-n NODES] [-a ARRAY_SIZE] [-d DEPENDS_ON]
               [command] [arguments [arguments ...]]
Positional Arguments
command

The command to submit (it must be available on the compute instances) or the file name to be transferred (see –command-file option).

Default: <open file ‘<stdin>’, mode ‘r’ at 0x7f5befddd0c0>

arguments Arguments for the command or the command-file (optional).
Named Arguments
-jn, --job-name
 The name of the job. The first character must be alphanumeric, and up to 128 letters (uppercase and lowercase), numbers, hyphens, and underscores are allowed
-c, --cluster Cluster to use
-cf, --command-file
 

Identifies that the command is a file to be transferred to the compute instances

Default: False

-w, --working-dir
 The folder to use as job working directory. If not specified the job will be executed in the job-<AWS_BATCH_JOB_ID> subfolder of the user’s home
-pw, --parent-working-dir
 Parent folder for the job working directory. If not specified it is the user’s home. A subfolder named job-<AWS_BATCH_JOB_ID> will be created in it. Alternative to the –working-dir parameter
-if, --input-file
 File to be transferred to the compute instances, in the job working directory. It can be expressed multiple times
-p, --vcpus

The number of vCPUs to reserve for the container. When used in conjunction with –nodes it identifies the number of vCPUs per node. Default is 1

Default: 1

-m, --memory

The hard limit (in MiB) of memory to present to the job. If your job attempts to exceed the memory specified here, the job is killed. Default is 128

Default: 128

-e, --env Comma separated list of environment variable names to export to the Job environment. Use ‘all’ to export all the environment variables, except the ones listed to the –env-blacklist parameter and variables starting with PCLUSTER_* and AWS_* prefix.
-eb, --env-blacklist
 Comma separated list of environment variable names to NOT export to the Job environment. Default: HOME, PWD, USER, PATH, LD_LIBRARY_PATH, TERM, TERMCAP.
-r, --retry-attempts
 

The number of times to move a job to the RUNNABLE status. You may specify between 1 and 10 attempts. If the value of attempts is greater than one, the job is retried if it fails until it has moved to RUNNABLE that many times. Default value is 1

Default: 1

-t, --timeout The time duration in seconds (measured from the job attempt’s startedAt timestamp) after which AWS Batch terminates your jobs if they have not finished. It must be at least 60 seconds
-n, --nodes The number of nodes to reserve for the job. It enables Multi-Node Parallel submission
-a, --array-size
 The size of the array. It can be between 2 and 10,000. If you specify array properties for a job, it becomes an array job
-d, --depends-on
 A semicolon separated list of dependencies for the job. A job can depend upon a maximum of 20 jobs. You can specify a SEQUENTIAL type dependency without specifying a job ID for array jobs so that each child array job completes sequentially, starting at index 0. You can also specify an N_TO_N type dependency with a job ID for array jobs so that each index child of this job must wait for the corresponding index child of each dependency to complete before it can begin. Syntax: jobId=<string>,type=<string>;…

awsbstat

Shows the jobs submitted in the cluster’s Job Queue.

usage: awsbstat [-h] [-c CLUSTER] [-s STATUS] [-e] [-d]
                [job_ids [job_ids ...]]
Positional Arguments
job_ids A space separated list of job IDs to show in the output. If the job is a job array, all the children will be displayed. If a single job is requested it will be shown in a detailed version
Named Arguments
-c, --cluster Cluster to use
-s, --status

Comma separated list of job status to ask, defaults to “active” jobs. Accepted values are: SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, FAILED, ALL

Default: “SUBMITTED,PENDING,RUNNABLE,STARTING,RUNNING”

-e, --expand-children
 

Expand jobs with children (array and MNP)

Default: False

-d, --details

Show jobs details

Default: False

awsbout

Shows the output of the given Job.

usage: awsbout [-h] [-c CLUSTER] [-hd HEAD] [-t TAIL] [-s] [-sp STREAM_PERIOD]
               job_id
Positional Arguments
job_id The job ID
Named Arguments
-c, --cluster Cluster to use
-hd, --head Gets the first <head> lines of the job output
-t, --tail Gets the last <tail> lines of the job output
-s, --stream

Gets the job output and waits for additional output to be produced. It can be used in conjunction with –tail to start from the latest <tail> lines of the job output

Default: False

-sp, --stream-period
 Sets the streaming period. Default is 5

awsbkill

Cancels/terminates jobs submitted in the cluster.

usage: awsbkill [-h] [-c CLUSTER] [-r REASON] job_ids [job_ids ...]
Positional Arguments
job_ids A space separated list of job IDs to cancel/terminate
Named Arguments
-c, --cluster Cluster to use
-r, --reason

A message to attach to the job that explains the reason for canceling it

Default: “Terminated by the user”

awsbqueues

Shows the Job Queue associated to the cluster.

usage: awsbqueues [-h] [-c CLUSTER] [-d] [job_queues [job_queues ...]]
Positional Arguments
job_queues A space separated list of queues names to show. If a single queue is requested it will be shown in a detailed version
Named Arguments
-c, --cluster Cluster to use
-d, --details

Show queues details

Default: False

awsbhosts

Shows the hosts belonging to the cluster’s Compute Environment.

usage: awsbhosts [-h] [-c CLUSTER] [-d] [instance_ids [instance_ids ...]]
Positional Arguments
instance_ids A space separated list of instances IDs. If a single instance is requested it will be shown in a detailed version
Named Arguments
-c, --cluster Cluster to use
-d, --details

Show hosts details

Default: False

Configuration

pcluster uses the file ~/.parallelcluster/config by default for all configuration parameters.

You can see an example configuration file site-packages/aws-parallelcluster/examples/config

Layout

Configuration is defined in multiple sections. Required sections are “global”, “aws”, one “cluster”, and one “subnet”.

A section starts with the section name in brackets, followed by parameters and configuration.

[global]
cluster_template = default
update_check = true
sanity_check = true

Configuration Options

global

Global configuration options related to pcluster.

[global]
cluster_template

The name of the cluster section used for the cluster.

See the Cluster Definition.

cluster_template = default
update_check

Whether or not to check for updates to pcluster.

update_check = true
sanity_check

Attempts to validate that resources defined in parameters actually exist.

sanity_check = true

aws

This is the AWS credentials/region section (required). These settings apply to all clusters.

We highly recommend use of the environment, EC2 IAM Roles, or storing credentials using the AWS CLI to store credentials, rather than storing them in the AWS ParallelCluster config file.

[aws]
aws_access_key_id = #your_aws_access_key_id
aws_secret_access_key = #your_secret_access_key

# Defaults to us-east-1 if not defined in environment or below
aws_region_name = #region

aliases

This is the aliases section. Use this section to customize the ssh command.

CFN_USER is set to the default username for the os. MASTER_IP is set to the IP address of the master instance. ARGS is set to whatever arguments the user provides after pcluster ssh cluster_name.

[aliases]
# This is the aliases section, you can configure
# ssh alias here
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

cluster

You can define one or more clusters for different types of jobs or workloads.

Each cluster has it’s own configuration based on your needs.

The format is [cluster <clustername>].

[cluster default]
key_name

Name of an existing EC2 KeyPair to enable SSH access to the instances.

key_name = mykey
template_url

Overrides the path to the CloudFormation template used to create the cluster

Defaults to https://s3.amazonaws.com/<aws_region_name>-aws-parallelcluster/templates/aws-parallelcluster-<version>.cfn.json.

template_url = https://s3.amazonaws.com/us-east-1-aws-parallelcluster/templates/aws-parallelcluster.cfn.json
compute_instance_type

The EC2 instance type used for the cluster compute nodes.

If you’re using awsbatch, please refer to the Compute Environments creation in the AWS Batch UI for the list of the supported instance types.

Defaults to t2.micro, optimal when scheduler is awsbatch

compute_instance_type = t2.micro
master_instance_type

The EC2 instance type use for the master node.

This defaults to t2.micro.

master_instance_type = t2.micro
initial_queue_size

The initial number of EC2 instances to launch as compute nodes in the cluster for traditional schedulers.

If you’re using awsbatch, use min_vcpus.

The default is 2.

initial_queue_size = 2
max_queue_size

The maximum number of EC2 instances that can be launched in the cluster for traditional schedulers.

If you’re using awsbatch, use max_vcpus.

This defaults to 10.

max_queue_size = 10
maintain_initial_size

Boolean flag to set autoscaling group to maintain initial size for traditional schedulers.

If you’re using awsbatch, use desired_vcpus.

If set to true, the Auto Scaling group will never have fewer members than the value of initial_queue_size. It will still allow the cluster to scale up to the value of max_queue_size.

Setting to false allows the Auto Scaling group to scale down to 0 members, so resources will not sit idle when they aren’t needed.

Defaults to false.

maintain_initial_size = false
min_vcpus

If scheduler is awsbatch, the compute environment won’t have fewer than min_vcpus.

Defaults to 0.

min_vcpus = 0
desired_vcpus

If scheduler is awsbatch, the compute environment will initially have desired_vcpus

Defaults to 4.

desired_vcpus = 4
max_vcpus

If scheduler is awsbatch, the compute environment will at most have max_vcpus.

Defaults to 20.

desired_vcpus = 20
scheduler

Scheduler to be used with the cluster. Valid options are sge, torque, slurm, or awsbatch.

If you’re using awsbatch, please take a look at the networking setup.

Defaults to sge.

scheduler = sge
cluster_type

Type of cluster to launch i.e. ondemand or spot

Defaults to ondemand.

cluster_type = ondemand
spot_price

If cluster_type is set to spot, you can optionally set the maximum spot price for the ComputeFleet on traditional schedulers. If you do not specify a value, you are charged the Spot price, capped at the On-Demand price.

If you’re using awsbatch, use spot_bid_percentage.

See the Spot Bid Advisor for assistance finding a bid price that meets your needs:

spot_price = 1.50
spot_bid_percentage

If you’re using awsbatch as your scheduler, this optional parameter is the on-demand bid percentage. If not specified you’ll get the current spot market price, capped at the on-demand price.

spot_price = 85
custom_ami

ID of a Custom AMI, to use instead of default published AMI’s.

custom_ami = NONE
s3_read_resource

Specify S3 resource for which AWS ParallelCluster nodes will be granted read-only access

For example, ‘arn:aws:s3:::my_corporate_bucket/*’ would provide read-only access to all objects in the my_corporate_bucket bucket.

See working with S3 for details on format.

Defaults to NONE.

s3_read_resource = NONE
s3_read_write_resource

Specify S3 resource for which AWS ParallelCluster nodes will be granted read-write access

For example, ‘arn:aws:s3:::my_corporate_bucket/Development/*’ would provide read-write access to all objects in the Development folder of the my_corporate_bucket bucket.

See working with S3 for details on format.

Defaults to NONE.

s3_read_write_resource = NONE
pre_install

URL to a preinstall script. This is executed before any of the boot_as_* scripts are run

This only gets executed on the master node when using awsbatch as your scheduler.

Can be specified in “http://hostname/path/to/script.sh” or “s3://bucketname/path/to/script.sh” format.

Defaults to NONE.

pre_install = NONE
pre_install_args

Quoted list of arguments to be passed to preinstall script

Defaults to NONE.

pre_install_args = NONE
post_install

URL to a postinstall script. This is executed after any of the boot_as_* scripts are run

This only gets executed on the master node when using awsbatch as your scheduler.

Can be specified in “http://hostname/path/to/script.sh” or “s3://bucketname/path/to/script.sh” format.

Defaults to NONE.

post_install = NONE
post_install_args

Arguments to be passed to postinstall script

Defaults to NONE.

post_install_args = NONE
proxy_server

HTTP(S) proxy server, typically http://x.x.x.x:8080

Defaults to NONE.

proxy_server = NONE
placement_group

Cluster placement group. The can be one of three values: NONE, DYNAMIC and an existing placement group name. When DYNAMIC is set, a unique placement group will be created as part of the cluster and deleted when the cluster is deleted.

This does not apply to awsbatch.

Defaults to NONE. More information on placement groups can be found here:

placement_group = NONE
placement

Cluster placement logic. This enables the whole cluster or only compute to use the placement group.

Can be cluster or compute.

This does not apply to awsbatch.

Defaults to cluster.

placement = cluster
ephemeral_dir

If instance store volumes exist, this is the path/mountpoint they will be mounted on.

Defaults to /scratch.

ephemeral_dir = /scratch
shared_dir

Path/mountpoint for shared EBS volume. Do not use this option when using multiple EBS volumes; provide shared_dir under each EBS section instead

Defaults to /shared. The example below mounts to /myshared. See EBS Section for details on working with multiple EBS volumes:

shared_dir = myshared
encrypted_ephemeral

Encrypted ephemeral drives. In-memory keys, non-recoverable. If true, AWS ParallelCluster will generate an ephemeral encryption key in memory and using LUKS encryption, encrypt your instance store volumes.

Defaults to false.

encrypted_ephemeral = false
master_root_volume_size

MasterServer root volume size in GB. (AMI must support growroot)

Defaults to 15.

master_root_volume_size = 15
compute_root_volume_size

ComputeFleet root volume size in GB. (AMI must support growroot)

Defaults to 15.

compute_root_volume_size = 15
base_os

OS type used in the cluster

Defaults to alinux. Available options are: alinux, centos6, centos7, ubuntu1404 and ubuntu1604

Note: The base_os determines the username used to log into the cluster.

Supported OS’s by region. Note that commercial is all supported regions such as us-east-1, us-west-2 etc.

============== ======  ============ ============ ============= ============
region         alinux    centos6       centos7     ubuntu1404   ubuntu1604
============== ======  ============ ============ ============= ============
commercial      True     True          True          True        True
us-gov-west-1   True     False         False         True        True
us-gov-east-1   True     False         False         True        True
cn-north-1      True     False         False         True        True
cn-northwest-1  True     False         False         False       False
============== ======  ============ ============ ============= ============
  • CentOS 6 & 7: centos

  • Ubuntu: ubuntu

  • Amazon Linux: ec2-user

    base_os = alinux
    
ec2_iam_role

The given name of an existing EC2 IAM Role that will be attached to all instances in the cluster. Note that the given name of a role and its Amazon Resource Name (ARN) are different, and the latter can not be used as an argument to ec2_iam_role.

Defaults to NONE.

ec2_iam_role = NONE
extra_json

Extra JSON that will be merged into the dna.json used by Chef.

Defaults to {}.

extra_json = {}
additional_cfn_template

An additional CloudFormation template to launch along with the cluster. This allows you to create resources that exist outside of the cluster but are part of the cluster’s life cycle.

Must be a HTTP URL to a public template with all parameters provided.

Defaults to NONE.

additional_cfn_template = NONE
vpc_settings

Settings section relating to VPC to be used

See VPC Section.

vpc_settings = public
ebs_settings

Settings section relating to EBS volume mounted on the master. When using multiple EBS volumes, enter multiple settings as a comma separated list. Up to 5 EBS volumes are supported.

See EBS Section.

ebs_settings = custom1, custom2, ...
scaling_settings

Settings section relation to scaling

See Scaling Section.

scaling_settings = custom
efs_settings

Settings section relating to EFS filesystem

See EFS Section.

efs_settings = customfs
raid_settings

Settings section relating to RAID drive configuration.

See RAID Section.

raid_settings = rs
tags

Defines tags to be used in CloudFormation.

If command line tags are specified via –tags, they get merged with config tags.

Command line tags overwrite config tags that have the same key.

Tags are JSON formatted and should not have quotes outside the curly braces.

See AWS CloudFormation Resource Tags Type.

tags = {"key" : "value", "key2" : "value2"}

vpc

VPC Configuration Settings:

[vpc public]
vpc_id = vpc-xxxxxx
master_subnet_id = subnet-xxxxxx
vpc_id

ID of the VPC you want to provision cluster into.

vpc_id = vpc-xxxxxx
master_subnet_id

ID of an existing subnet you want to provision the Master server into.

master_subnet_id = subnet-xxxxxx
ssh_from

CIDR formatted IP range in which to allow SSH access from.

This is only used when AWS ParallelCluster creates the security group.

Defaults to 0.0.0.0/0.

ssh_from = 0.0.0.0/0
additional_sg

Additional VPC security group Id for all instances.

Defaults to NONE.

additional_sg = sg-xxxxxx
compute_subnet_id

ID of an existing subnet you want to provision the compute nodes into.

If it is private, you need to setup NAT for web access.

compute_subnet_id = subnet-xxxxxx
compute_subnet_cidr

If you wish for AWS ParallelCluster to create a compute subnet, this is the CIDR that.

compute_subnet_cidr = 10.0.100.0/24
use_public_ips

Define whether or not to assign public IP addresses to Compute EC2 instances.

If true, an Elastic IP will be associated to the Master instance. If false, the Master instance will have a Public IP or not according to the value of the “Auto-assign Public IP” subnet configuration parameter.

See networking configuration for some examples.

Defaults to true.

use_public_ips = true
vpc_security_group_id

Use an existing security group for all instances.

Defaults to NONE.

vpc_security_group_id = sg-xxxxxx

ebs

EBS Volume configuration settings for the volumes mounted on the master node and shared via NFS to compute nodes.

[ebs custom1]
shared_dir = vol1
ebs_snapshot_id = snap-xxxxx
volume_type = io1
volume_iops = 200
...

[ebs custom2]
shared_dir = vol2
...

...
shared_dir

Path/mountpoint for shared EBS volume. Required when using multiple EBS volumes. When using 1 ebs volume, this option will overwrite the shared_dir specified under the cluster section. The example below mounts to /vol1

shared_dir = vol1
ebs_snapshot_id

Id of EBS snapshot if using snapshot as source for volume.

Defaults to NONE.

ebs_snapshot_id = snap-xxxxx
volume_type

The API name for the type of volume you wish to launch.

Defaults to gp2.

volume_type = io1
volume_size

Size of volume to be created (if not using a snapshot).

Defaults to 20GB.

volume_size = 20
volume_iops

Number of IOPS for io1 type volumes.

volume_iops = 200
encrypted

Whether or not the volume should be encrypted (should not be used with snapshots).

Defaults to false.

encrypted = false
ebs_volume_id

EBS Volume Id of an existing volume that will be attached to the MasterServer.

Defaults to NONE.

ebs_volume_id = vol-xxxxxx

scaling

Settings which define how the compute nodes scale.

[scaling custom]
scaledown_idletime = 10
scaledown_idletime

Amount of time in minutes without a job after which the compute node will terminate.

This does not apply to awsbatch.

Defaults to 10.

scaledown_idletime = 10

examples

Let’s say you want to launch a cluster with the awsbatch scheduler and let batch pick the optimal instance type, based on your jobs resource needs.

The following allows a maximum of 40 concurrent vCPUs, and scales down to zero when you have no jobs running for 10 minutes.

[global]
update_check = true
sanity_check = true
cluster_template = awsbatch

[aws]
aws_region_name = [your_aws_region]

[cluster awsbatch]
scheduler = awsbatch
compute_instance_type = optimal # optional, defaults to optimal
min_vcpus = 0                   # optional, defaults to 0
desired_vcpus = 0               # optional, defaults to 4
max_vcpus = 40                  # optional, defaults to 20
base_os = alinux                # optional, defaults to alinux, controls the base_os of the master instance and the docker image for the compute fleet
key_name = [your_ec2_keypair]
vpc_settings = public

[vpc public]
master_subnet_id = [your_subnet]
vpc_id = [your_vpc]

EFS

EFS file system configuration settings for the EFS mounted on the master node and compute nodes via nfs4.

[efs customfs]
shared_dir = efs
encrypted = false
performance_mode = generalPurpose
shared_dir

Shared directory that the file system will be mounted to on the master and compute nodes.

This parameter is REQUIRED, the EFS section will only be used if this parameter is specified. The below example mounts to /efs. Do not use NONE or /NONE as the shared directory.:

shared_dir = efs
encrypted

Whether or not the file system will be encrypted.

Defaults to false.

encrypted = false
performance_mode

Performance Mode of the file system. We recommend generalPurpose performance mode for most file systems. File systems using the maxIO performance mode can scale to higher levels of aggregate throughput and operations per second with a trade-off of slightly higher latencies for most file operations. This can’t be changed after the file system has been created.

Defaults generalPurpose. Valid Values are generalPurpose | maxIO (case sensitive).

performance_mode = generalPurpose
throughput_mode

The throughput mode for the file system to be created. There are two throughput modes to choose from for your file system: bursting and provisioned.

Valid Values are provisioned | bursting

throughput_mode = provisioned
provisioned_throughput

The throughput, measured in MiB/s, that you want to provision for a file system that you’re creating. The limit on throughput is 1024 MiB/s. You can get these limits increased by contacting AWS Support.

Valid Range: Min of 0.0. To use this option, must specify throughput_mode to provisioned

provisioned_throughput = 1024
efs_fs_id

File system ID for an existing file system. Specifying this option will void all other EFS options but shared_dir. Config sanity will only allow file systems that: have no mount target in the stack’s availability zone OR have existing mount target in stack’s availability zone with inbound and outbound NFS traffic allowed from 0.0.0.0/0.

Note: sanity check for validating efs_fs_id requires the IAM role to have permission for the following actions: efs:DescribeMountTargets, efs:DescribeMountTargetSecurityGroups, ec2:DescribeSubnets, ec2:DescribeSecurityGroups. Please add these permissions to your IAM role, or set sanity_check = false to avoid errors.

CAUTION: having mount target with inbound and outbound NFS traffic allowed from 0.0.0.0/0 will expose the file system to NFS mounting request from anywhere in the mount target’s availability zone. We recommend not to have a mount target in stack’s availability zone and let us create the mount target. If you must have a mount target in stack’s availability zone, consider using a custom security group by providing a vpc_security_group_id option under the vpc section, adding that security group to the mount target, and turning off config sanity to create the cluster.

Defaults to NONE. Needs to be an available EFS file system:

efs_fs_id = fs-12345

RAID

RAID drive configuration settings for creating a RAID array from a number of identical EBS volumes. The RAID drive is mounted on the master node, and exported to compute nodes via nfs.

[raid rs]
shared_dir = raid
raid_type = 1
num_of_raid_volumes = 2
encrypted = true
shared_dir

Shared directory that the RAID drive will be mounted to on the master and compute nodes.

This parameter is REQUIRED, the RAID drive will only be created if this parameter is specified. The below example mounts to /raid. Do not use NONE or /NONE as the shared directory.:

shared_dir = raid
raid_type

RAID type for the RAID array. Currently only support RAID 0 or RAID 1. For more information on RAID types, see: RAID info

This parameter is REQUIRED, the RAID drive will only be created if this parameter is specified. The below example will create a RAID 0 array:

raid_type = 0
num_of_raid_volumes

The number of EBS volumes to assemble the RAID array from. Currently supports max of 5 volumes and minimum of 2.

Defaults to 2.

num_of_raid_volumes = 2
volume_type

The the type of volume you wish to launch. See: Volume type for detail

Defaults to gp2.

volume_type = io1
volume_size

Size of volume to be created.

Defaults to 20GB.

volume_size = 20
volume_iops

Number of IOPS for io1 type volumes.

volume_iops = 500
encrypted

Whether or not the file system will be encrypted.

Defaults to false.

encrypted = false

How AWS ParallelCluster Works

AWS ParallelCluster was built not only as a way to manage clusters, but as a reference on how to use AWS services to build your HPC environment

AWS ParallelCluster Processes

This section applies only to HPC clusters deployed with one of the supported traditional job scheduler, either SGE, Slurm or Torque. In these cases AWS ParallelCluster manages the compute node provisioning and removal by interacting with both the Auto Scaling Group (ASG) and the underlying job scheduler. For HPC clusters based on AWS Batch, ParallelCluster totally relies on the capabilities provided by the AWS Batch for the compute node management.

General Overview

A cluster’s life cycle begins after it is created by a user. Typically, this is done from the Command Line Interface (CLI). Once created, a cluster will exist until it’s deleted. There are then AWS ParallelCluster daemons running on the cluster nodes mainly aimed to manage the HPC cluster elasticity. Here below a diagram representing the user’s workflow and the cluster life cycle, while the next sections describe the AWS ParallelClustr daemons used to manage the cluster.

_images/workflow.svg

jobwatcher

Once a cluster is running, a process owned by the root user will monitor the configured scheduler (SGE, Torque, Slurm, etc) and each minute, it’ll evaluate the queue in order to decide when to scale up.

_images/jobwatcher.svg

sqswatcher

The sqswatcher process monitors for SQS messages emitted by Auto Scaling which notifies of state changes within the cluster. When an instance comes online, it will submit an “instance ready” message to SQS, which is picked up by sqs_watcher running on the master server. These messages are used to notify the queue manager when new instances come online or are terminated, so they can be added or removed from the queue accordingly.

_images/sqswatcher.svg

nodewatcher

The nodewatcher process runs on each node in the compute fleet. After the user defined scaledown_idletime period, the instance is terminated.

_images/nodewatcher.svg

AWS Services used in AWS ParallelCluster

The following Amazon Web Services (AWS) services are used in AWS ParallelCluster.

  • AWS CloudFormation
  • AWS Identity and Access Management (IAM)
  • Amazon SNS
  • Amazon SQS
  • Amazon EC2
  • Auto Scaling
  • Amazon EBS
  • Amazon S3
  • Amazon DynamoDB

AWS CloudFormation

AWS CloudFormation is the core service used by AWS ParallelCluster. Each cluster is represented as a stack. All resources required by the cluster are defined within the AWS ParallelCluster CloudFormation template. AWS ParallelCluster CLI commands typically map to CloudFormation stack commands, such as create, update and delete. Instances launched within a cluster make HTTPS calls to the CloudFormation Endpoint for the region the cluster is launched in.

For more details about AWS CloudFormation, see http://aws.amazon.com/cloudformation/

AWS Identity and Access Management (IAM)

AWS IAM is used within AWS ParallelCluster to provide an Amazon EC2 IAM Role for the instances. This role is a least privileged role specifically created for each cluster. AWS ParallelCluster instances are given access only to the specific API calls that are required to deploy and manage the cluster.

With AWS Batch clusters, IAM Roles are also created for the components involved with the Docker image building process at cluster creation time. These components include the Lambda functions allowed to add and delete Docker images to/from the ECR repository and to delete the S3 bucket created for the cluster and CodeBuild project. Then there are roles for the AWS Batch resources, instance, job.

For more details about AWS Identity and Access Management, see http://aws.amazon.com/iam/

Amazon Simple Notification Service (SNS)

Amazon SNS is used to receive notifications from Auto Scaling. These events are called life cycle events, and are generated when an instance launches or terminates in an Autoscaling Group. Within AWS ParallelCluster, the Amazon SNS topic for the Autoscaling Group is subscribed to an Amazon SQS queue.

The service is not used with AWS Batch clusters.

For more details about Amazon SNS, see http://aws.amazon.com/sns/

Amazon Simple Queuing Service (SQS)

Amazon SQS is used to hold notifications(messages) from Auto Scaling, sent through Amazon SNS and notifications from the ComputeFleet instances. This decouples the sending of notifications from the receiving and allows the Master to handle them through polling. The MasterServer runs Amazon SQSwatcher and polls the queue. AutoScaling and the ComputeFleet instances post messages to the queue.

The service is not used with AWS Batch clusters.

For more details about Amazon SQS, see http://aws.amazon.com/sqs/

Amazon EC2

Amazon EC2 provides the compute for AWS ParallelCluster. The MasterServer and ComputeFleet are EC2 instances. Any instance type that support HVM can be selected. The MasterServer and ComputeFleet can be different instance types and the ComputeFleet can also be launched as Spot instances. Instance store volumes found on the instances are mounted as a striped LVM volume.

For more details about Amazon EC2, see http://aws.amazon.com/ec2/

AWS Auto Scaling

AWS Auto Scaling is used to manage the ComputeFleet instances. These instances are managed as an AutoScaling Group and can either be elastically driven by workload or static and driven by the config.

The service is not used with AWS Batch clusters.

For more details about Auto Scaling, see http://aws.amazon.com/autoscaling/

Amazon Elastic Block Store (EBS)

Amazon EBS provides the persistent storage for the shared volumes. Any EBS settings can be passed through the config. EBS volumes can either be initialized empty or from an existing EBS snapshot.

For more details about Amazon EBS, see http://aws.amazon.com/ebs/

Amazon S3

Amazon S3 is used to store the AWS ParallelCluster templates. Each region has a bucket with all templates. AWS ParallelCluster can be configured to allow allow CLI/SDK tools to use S3.

With an AWS Batch cluster, an S3 bucket in the customer’s account is created to store artifacts used by the Docker image creation and the jobs scripts when submitting jobs from the user’s machine.

For more details, see http://aws.amazon.com/s3/

Amazon DynamoDB

Amazon DynamoDB is used to store minimal state of the cluster. The MasterServer tracks provisioned instances in a DynamoDB table.

The service is not used with AWS Batch clusters.

For more details, see http://aws.amazon.com/dynamodb/

AWS Batch

AWS Batch is the AWS managed job scheduler that dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs.

The service is only used with AWS Batch clusters.

For more details, see https://aws.amazon.com/batch/

AWS CodeBuild

AWS CodeBuild is used to automatically and transparently build Docker images at cluster creation time.

The service is only used with AWS Batch clusters.

For more details, see https://aws.amazon.com/codebuild/

AWS Lambda

AWS Lambda service runs the functions that orchestrate the Docker image creation and manage custom cluster resources cleanup, that are the created Docker images stored in the ECR repository and the S3 bucket for the cluster.

The service is only used with AWS Batch clusters.

For more details, see https://aws.amazon.com/lambda/

Amazon Elastic Container Registry (ECR)

Amazon ECR stores the Docker images built at cluster creation time. The Docker images are then used by AWS Batch to run the containers for the submitted jobs.

The service is only used with AWS Batch clusters.

For more details, see https://aws.amazon.com/ecr/

Amazon CloudWatch

Amazon CloudWatch is used to log Docker image build steps and the standard output and error of the AWS Batch jobs.

The service is only used with AWS Batch clusters.

For more details, see https://aws.amazon.com/cloudwatch/

AWS ParallelCluster Auto Scaling

The auto scaling strategy described here applies to HPC clusters deployed with one of the supported traditional job scheduler, either SGE, Slurm or Torque. In these cases AWS ParallelCluster directly implements the scaling capabilities by managing the Auto Scaling Group (ASG) of the compute nodes and changing the scheduler configuration accordingly. For HPC clusters based on AWS Batch, ParallelCluster relies on the elastic scaling capabilities provided by the AWS-managed job scheduler.

Clusters deployed with AWS ParallelCluster are elastic in several ways. The first is by simply setting the initial_queue_size and max_queue_size parameters of a cluster settings. The initial_queue_size sets the minimum size value of the ComputeFleet ASG and also the desired capacity value. The max_queue_size sets the maximum size value of the ComputeFleet ASG.

_images/as-basic-diagram.png

Scaling Up

Every minute, a process called jobwatcher runs on the master instance and evaluates the current number of instances required by the pending jobs in the queue. If the total number of busy nodes and requested nodes is greater than the current desired value in the ASG, it adds more instances. If you submit more jobs, the queue will get re-evaluated and the ASG updated up to the max_queue_size.

With SGE each job requires a number of slots to run (one slot corresponds to one processing unit, e.g. a vCPU). When evaluating the number of instances required to serve the currently pending jobs, the jobwatcher divides the total number of requested slots by the capacity of a single compute node. The capacity of a compute node that is the number of available vCPUs depends on the EC2 instance type selected in the cluster configuration.

With Slurm and Torque schedulers each job can require both a number of nodes and a number of slots per node. The jobwatcher takes into account the request of each job and determines the number of compute nodes to fulfill the new computational requirements. For example, assuming a cluster with c5.2xlarge (8 vCPU) as compute instance type, and three queued pending jobs with the following requirements: job1 2 nodes / 4 slots each, job2 3 nodes / 2 slots and job3 1 node / 4 slots, the jobwatcher will require three new compute instances to the ASG to serve the three jobs.

Current limitation: the auto scale up logic does not consider partially loaded busy nodes, i.e. each node running a job is considered busy even if there are empty slots.

Scaling Down

On each compute node, a process called nodewatcher runs and evaluates the idle time of the node. If an instance had no jobs for longer than scaledown_idletime (which defaults to 10 minutes) and currently there are no pending jobs in the cluster, the instance is terminated.

It specifically calls the TerminateInstanceInAutoScalingGroup API call, which will remove an instance as long as the size of the ASG is at least the minimum ASG size. That handles scale down of the cluster, without affecting running jobs and also enables an elastic cluster with a fixed base number of instances.

Static Cluster

The value of the auto scaling is the same for HPC as with any other workloads, the only difference here is AWS ParallelCluster has code to specifically make it interact in a more intelligent manner. If a static cluster is required, this can be achieved by setting initial_queue_size and max_queue_size parameters to the size of cluster required and also setting the maintain_initial_size parameter to true. This will cause the ComputeFleet ASG to have the same value for minimum, maximum and desired capacity.

Tutorials

Here you can find tutorials for best practices guides for getting started with AWS ParallelCluster.

Running your first job on AWS ParallelCluster

This tutorial will walk you through running your first “Hello World” job on aws-parallelcluster.

If you haven’t yet, you will need to follow the getting started guide to install AWS ParallelCluster and configure your CLI.

Verifying your installation

First, we’ll verify that AWS ParallelCluster is correctly installed and configured.

$ pcluster version

This should return the running version of AWS ParallelCluster. If it gives you a message about configuration, you will need to run the following to configure AWS ParallelCluster.

$ pcluster configure

Creating your First Cluster

Now it’s time to create our first cluster. Because our workload isn’t performance intensive, we will use the default instance sizes of t2.micro. For production workloads, you’ll want to choose an instance size which better fits your needs.

We’re going to call our cluster “hello-world”.

$ pcluster create hello-world

You’ll see some messages on your screen about the cluster creating. When it’s finished, it will provide the following output:

Starting: hello-world
Status: parallelcluster-hello-world - CREATE_COMPLETE
MasterPublicIP = 54.148.x.x
ClusterUser: ec2-user
MasterPrivateIP = 192.168.x.x
GangliaPrivateURL = http://192.168.x.x/ganglia/
GangliaPublicURL = http://54.148.x.x/ganglia/

The message “CREATE_COMPLETE” shows that the cluster created successfully. It also provided us with the public and private IP addresses of our master node. We’ll need this IP to log in.

Logging into your Master instance

You’ll use your OpenSSH pem file to log into your master instance.

pcluster ssh hello-world -i /path/to/keyfile.pem

Once logged in, run the command “qhost” to ensure that your compute nodes are setup and configured.

[ec2-user@ip-192-168-1-86 ~]$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-192-168-1-125        lx-amd64        2    1    2    2  0.15    3.7G  130.8M 1024.0M     0.0
ip-192-168-1-126        lx-amd64        2    1    2    2  0.15    3.7G  130.8M 1024.0M     0.0

As you can see, we have two compute nodes in our cluster, both with 2 threads available to them.

Running your first job using SGE

Now we’ll create a simple job which sleeps for a little while and then outputs it’s own hostname.

Create a file called “hellojob.sh” with the following contents.

#!/bin/bash
sleep 30
echo "Hello World from $(hostname)"

Next, submit the job using “qsub” and ensure it runs.

$ qsub hellojob.sh
Your job 1 ("hellojob.sh") has been submitted

Now, you can view your queue and check the status of the job.

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      1 0.55500 hellojob.s ec2-user     r     03/24/2015 22:23:48 all.q@ip-192-168-1-125.us-west     1

The job is currently in a running state. Wait 30 seconds for the job to finish and run qstat again.

$ qstat
$

Now that there are no jobs in the queue, we can check for output in our current directory.

$ ls -l
total 8
-rw-rw-r-- 1 ec2-user ec2-user 48 Mar 24 22:34 hellojob.sh
-rw-r--r-- 1 ec2-user ec2-user  0 Mar 24 22:34 hellojob.sh.e1
-rw-r--r-- 1 ec2-user ec2-user 34 Mar 24 22:34 hellojob.sh.o1

Here, we see our job script, an “e1” and “o1” file. Since the e1 file is empty, there was no output to stderr. If we view the .o1 file, we can see any output from our job.

$ cat hellojob.sh.o1
Hello World from ip-192-168-1-125

We can see that our job successfully ran on instance “ip-192-168-1-125”.

Building a custom AWS ParallelCluster AMI

Warning

Building a custom AMI is not the recommended approach for customizing AWS ParallelCluster.

Once you build your own AMI, you will no longer receive updates or bug fixes with future releases of AWS ParallelCluster. You will need to repeat the steps used to create your custom AMI with each new AWS ParallelCluster release.

Before reading any further, take a look at the Custom Bootstrap Actions section of the documentation to determine if the modifications you wish to make can be scripted and supported with future AWS ParallelCluster releases.

While not ideal, there are a number of scenarios where building a custom AMI for AWS ParallelCluster is necessary. This tutorial will guide you through the process.

How to customize the AWS ParallelCluster AMI

There are three ways to use a custom AWS ParallelCluster AMI, two of them require to build a new AMI that will be available under your AWS account and one does not require to build anything in advance. Feel free to select the appropriate method based on your needs.

Modify an AWS ParallelCluster AMI

This is the safest method as the base AWS ParallelCluster AMI is often updated with new releases. This AMI has all of the components required for AWS ParallelCluster to function installed and configured and you can start with this as the base.

  1. Find the AMI which corresponds to the region you will be utilizing from the AMI list. The AMI list to use must match the version of the product e.g.

  2. Within the EC2 Console, choose “Launch Instance”.

  3. Navigate to “Community AMIs”, and enter the AMI id for your region into the search box.

  4. Select the AMI, choose your instance type and properties, and launch your instance.

  5. Log into your instance using the OS user and your SSH key.

  6. Customize your instance as required

  7. Run the following command to prepare your instance for AMI creation:

    sudo /usr/local/sbin/ami_cleanup.sh
    
  8. Stop the instance

  9. Create a new AMI from the instance

  10. Enter the AMI id in the custom_ami field within your cluster configuration.

Build a Custom AWS ParallelCluster AMI

If you have an AMI with a lot of customization and software already in place, you can apply the changes needed by AWS ParallelCluster on top of it.

For this method you have to install the following tools in your local system, together with the AWS ParallelCluster CLI:

  1. Packer: grab the latest version for your OS from Packer website and install it
  2. ChefDK: grab the latest version for your OS from ChefDK website and install it

Verify that the commands ‘packer’ and ‘berks’ are available in your PATH after the above tools installation.

You need to configure your AWS account credentials so that Packer can make calls to AWS API operations on your behalf. The minimal set of required permissions necessary for Packer to work are documented in the Packer doc.

Now you can use the command ‘createami’ of the AWS ParallelCluster CLI in order to build an AWS ParallelCluster AMI starting from the one you provide as base:

pcluster createami --ami-id <BASE AMI> --os <BASE OS AMI>

For other parameters, please consult the command help:

pcluster createami -h

The command executes Packer, which does the following steps:

  1. Launch an instance using the base AMI provided.
  2. Apply the AWS ParallelCluster cookbook to the instance, in order to install software and perform other necessary configuration tasks.
  3. Stop the instance.
  4. Creates an new AMI from the instance.
  5. Terminates the instance after the AMI is created.
  6. Outputs the new AMI ID string to use to create your cluster.

To create your cluster enter the AMI id in the custom_ami field within your cluster configuration.

Note

The instance type to build a custom AWS ParallelCluster AMI is a t2.xlarge and does not qualify for the AWS free tier. You are charged for any instances created when building this AMI.

Use a Custom AMI at runtime

If you don’t want to create anything in advance you can just use your AMI and create a AWS ParallelCluster from that.

Please notice that in this case the AWS ParallelCluster creation time will take longer, as it will install every software needed by AWS ParallelCluster at cluster creation time. Also scaling up for every new node will need more time.

  1. Enter the AMI id in the custom_ami field within your cluster configuration.

Running an MPI job with ParallelCluster and awsbatch scheduler

This tutorial will walk you through running a simple MPI job with awsbatch as a scheduler.

If you haven’t yet, you will need to follow the getting started guide to install AWS ParallelCluster and configure your CLI. Also, make sure to read through the awsbatch networking setup documentation before moving to the next step.

Creating the cluster

As first step let’s create a simple configuration for a cluster that uses awsbatch as scheduler. Make sure to replace the missing data in the vpc section and the key_name field with the resources you created at configuration time.

[global]
sanity_check = true

[aws]
aws_region_name = us-east-1

[cluster awsbatch]
base_os = alinux
# Replace with the name of the key you intend to use.
key_name = key-#######
vpc_settings = my-vpc
scheduler = awsbatch
compute_instance_type = optimal
min_vcpus = 2
desired_vcpus = 2
max_vcpus = 24

[vpc  my-vpc]
# Replace with the id of the vpc you intend to use.
vpc_id = vpc-#######
# Replace with id of the subnet for the Master node.
master_subnet_id = subnet-#######
# Replace with id of the subnet for the Compute nodes.
# A NAT Gateway is required for MNP.
compute_subnet_id = subnet-#######

You can now start the creation of the cluster. We’re going to call our cluster “awsbatch-tutorial”.:

$ pcluster create -c /path/to/the/created/config/aws_batch.config -t awsbatch awsbatch-tutorial

You’ll see some messages on your screen about the cluster creating. When it’s finished, it will provide the following output:

Beginning cluster creation for cluster: awsbatch-tutorial
Creating stack named: parallelcluster-awsbatch-tutorial
Status: parallelcluster-awsbatch-tutorial - CREATE_COMPLETE
MasterPublicIP: 54.160.xxx.xxx
ClusterUser: ec2-user
MasterPrivateIP: 10.0.0.15

Logging into your Master instance

Although the AWS ParallelCluster Batch CLI commands are all available on the client machine where ParallelCluster is installed, we are going to ssh into the Master node and submit the jobs from there, so that we can take advantage of the NFS volume that is shared between the Master and all Docker instances that run Batch jobs.

You’ll use your SSH pem file to log into your master instance

$ pcluster ssh awsbatch-tutorial -i /path/to/keyfile.pem

Once logged in, run the commands awsbqueues and awsbhosts to show the configured AWS Batch queue and the running ECS instances.

[ec2-user@ip-10-0-0-111 ~]$ awsbqueues
jobQueueName                       status
---------------------------------  --------
parallelcluster-awsbatch-tutorial  VALID

[ec2-user@ip-10-0-0-111 ~]$ awsbhosts
ec2InstanceId        instanceType    privateIpAddress    publicIpAddress      runningJobs
-------------------  --------------  ------------------  -----------------  -------------
i-0d6a0c8c560cd5bed  m4.large        10.0.0.235          34.239.174.236                 0

As you can see, we have one single running host. This is due to the value we chose for min_vcpus in the config. If you want to display additional details about the AWS Batch queue and hosts you can simply add the -d flag to the command.

Running your first job using AWS Batch

Before moving to MPI let’s create a simple dummy jobs which sleeps for a little while and then outputs it’s own hostname, greeting the name passed as parameter.

Create a file called “hellojob.sh” with the following content.

#!/bin/bash

sleep 30
echo "Hello $1 from $(hostname)"
echo "Hello $1 from $(hostname)" > "/shared/secret_message_for_${1}_by_${AWS_BATCH_JOB_ID}"

Next, submit the job using awsbsub and ensure it runs.

$ awsbsub -jn hello -cf hellojob.sh Luca
Job 6efe6c7c-4943-4c1a-baf5-edbfeccab5d2 (hello) has been submitted.

Now, you can view your queue and check the status of the job.

$ awsbstat
jobId                                 jobName      status    startedAt            stoppedAt    exitCode
------------------------------------  -----------  --------  -------------------  -----------  ----------
6efe6c7c-4943-4c1a-baf5-edbfeccab5d2  hello        RUNNING   2018-11-12 09:41:29  -            -

You can even see the detailed information for the job.

$ awsbstat 6efe6c7c-4943-4c1a-baf5-edbfeccab5d2
jobId                    : 6efe6c7c-4943-4c1a-baf5-edbfeccab5d2
jobName                  : hello
createdAt                : 2018-11-12 09:41:21
startedAt                : 2018-11-12 09:41:29
stoppedAt                : -
status                   : RUNNING
statusReason             : -
jobDefinition            : parallelcluster-myBatch:1
jobQueue                 : parallelcluster-myBatch
command                  : /bin/bash -c 'aws s3 --region us-east-1 cp s3://parallelcluster-mybatch-lui1ftboklhpns95/batch/job-hellojob_sh-1542015680924.sh /tmp/batch/job-hellojob_sh-1542015680924.sh; bash /tmp/batch/job-hellojob_sh-1542015680924.sh Luca'
exitCode                 : -
reason                   : -
vcpus                    : 1
memory[MB]               : 128
nodes                    : 1
logStream                : parallelcluster-myBatch/default/c75dac4a-5aca-4238-a4dd-078037453554
log                      : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/batch/job;stream=parallelcluster-myBatch/default/c75dac4a-5aca-4238-a4dd-078037453554
-------------------------

The job is currently in a RUNNING state. Wait 30 seconds for the job to finish and run awsbstat again.

$ awsbstat
jobId                                 jobName      status    startedAt            stoppedAt    exitCode
------------------------------------  -----------  --------  -------------------  -----------  ----------

You can see that the job is in the SUCCEEDED status.

$ awsbstat -s SUCCEEDED
jobId                                 jobName      status     startedAt            stoppedAt              exitCode
------------------------------------  -----------  ---------  -------------------  -------------------  ----------
6efe6c7c-4943-4c1a-baf5-edbfeccab5d2  hello        SUCCEEDED  2018-11-12 09:41:29  2018-11-12 09:42:00           0

Now that there are no jobs in the queue, we can check for output through the awsbout command.

$ awsbout 6efe6c7c-4943-4c1a-baf5-edbfeccab5d2
2018-11-12 09:41:29: Starting Job 6efe6c7c-4943-4c1a-baf5-edbfeccab5d2
download: s3://parallelcluster-mybatch-lui1ftboklhpns95/batch/job-hellojob_sh-1542015680924.sh to tmp/batch/job-hellojob_sh-1542015680924.sh
2018-11-12 09:42:00: Hello Luca from ip-172-31-4-234

We can see that our job successfully ran on instance “ip-172-31-4-234”.

Also if you look into the /shared directory you will find a secret message for you :)

Feel free to take a look at the AWS ParallelCluster Batch CLI documentation in order to explore all the available features that are not part of this demo (How about running an array job?). Once you are ready let’s move on and see how to submit an MPI job!

Running an MPI job in a multi-node parallel environment

In this section you’ll learn how to submit a simple MPI job which gets executed in a AWS Batch multi-node parallel environment.

First of all, while still logged into the Master node, let’s create a file in the /shared directory, named mpi_hello_world.c, that contains the following MPI program:

// Copyright 2011 www.mpitutorial.com
//
// An intro MPI hello world program that uses MPI_Init, MPI_Comm_size,
// MPI_Comm_rank, MPI_Finalize, and MPI_Get_processor_name.
//
#include <mpi.h>
#include <stdio.h>
#include <stddef.h>

int main(int argc, char** argv) {
  // Initialize the MPI environment. The two arguments to MPI Init are not
  // currently used by MPI implementations, but are there in case future
  // implementations might need the arguments.
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print off a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
         processor_name, world_rank, world_size);

  // Finalize the MPI environment. No more MPI calls can be made after this
  MPI_Finalize();
}

Now save the following code as submit_mpi.sh:

#!/bin/bash
echo "ip container: $(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)"
echo "ip host: $(curl -s "http://169.254.169.254/latest/meta-data/local-ipv4")"

# get shared dir
IFS=',' _shared_dirs=(${PCLUSTER_SHARED_DIRS})
_shared_dir=${_shared_dirs[0]}
_job_dir="${_shared_dir}/${AWS_BATCH_JOB_ID%#*}-${AWS_BATCH_JOB_ATTEMPT}"
_exit_code_file="${_job_dir}/batch-exit-code"

if [[ "${AWS_BATCH_JOB_NODE_INDEX}" -eq  "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" ]]; then
    echo "Hello I'm the main node $(hostname)! I run the mpi job!"

    mkdir -p "${_job_dir}"

    echo "Compiling..."
    /usr/lib64/openmpi/bin/mpicc -o "${_job_dir}/mpi_hello_world" "${_shared_dir}/mpi_hello_world.c"

    echo "Running..."
    /usr/lib64/openmpi/bin/mpirun --mca btl_tcp_if_include eth0 --allow-run-as-root --machinefile "${HOME}/hostfile" "${_job_dir}/mpi_hello_world"

    # Write exit status code
    echo "0" > "${_exit_code_file}"
    # Waiting for compute nodes to terminate
    sleep 30
else
    echo "Hello I'm the compute node $(hostname)! I let the main node orchestrate the mpi execution!"
    # Since mpi orchestration happens on the main node, we need to make sure the containers representing the compute
    # nodes are not terminated. A simple trick is to wait for a file containing the status code to be created.
    # All compute nodes are terminated by Batch if the main node exits abruptly.
    while [ ! -f "${_exit_code_file}" ]; do
        sleep 2
    done
    exit $(cat "${_exit_code_file}")
fi

And that’s all. We are now ready to submit our first MPI job and make it run concurrently on 3 nodes:

$ awsbsub -n 3 -cf submit_mpi.sh

Let’s now monitor the job status and wait for it to enter the RUNNING status:

$ watch awsbstat -d

Once the job enters the RUNNING status we can look at its output. Simply append #0 to the job id in order to show the output of the main node, while use #1 and #2 to display the output of the compute nodes:

[ec2-user@ip-10-0-0-111 ~]$ awsbout -s 5b4d50f8-1060-4ebf-ba2d-1ae868bbd92d#0
2018-11-27 15:50:10: Job id: 5b4d50f8-1060-4ebf-ba2d-1ae868bbd92d#0
2018-11-27 15:50:10: Initializing the environment...
2018-11-27 15:50:10: Starting ssh agents...
2018-11-27 15:50:11: Agent pid 7
2018-11-27 15:50:11: Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
2018-11-27 15:50:11: Mounting shared file system...
2018-11-27 15:50:11: Generating hostfile...
2018-11-27 15:50:11: Detected 1/3 compute nodes. Waiting for all compute nodes to start.
2018-11-27 15:50:26: Detected 1/3 compute nodes. Waiting for all compute nodes to start.
2018-11-27 15:50:41: Detected 1/3 compute nodes. Waiting for all compute nodes to start.
2018-11-27 15:50:56: Detected 3/3 compute nodes. Waiting for all compute nodes to start.
2018-11-27 15:51:11: Starting the job...
download: s3://parallelcluster-awsbatch-tutorial-iwyl4458saiwgwvg/batch/job-submit_mpi_sh-1543333713772.sh to tmp/batch/job-submit_mpi_sh-1543333713772.sh
2018-11-27 15:51:12: ip container: 10.0.0.180
2018-11-27 15:51:12: ip host: 10.0.0.245
2018-11-27 15:51:12: Compiling...
2018-11-27 15:51:12: Running...
2018-11-27 15:51:12: Hello I'm the main node! I run the mpi job!
2018-11-27 15:51:12: Warning: Permanently added '10.0.0.199' (RSA) to the list of known hosts.
2018-11-27 15:51:12: Warning: Permanently added '10.0.0.147' (RSA) to the list of known hosts.
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-180.ec2.internal, rank 1 out of 6 processors
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-199.ec2.internal, rank 5 out of 6 processors
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-180.ec2.internal, rank 0 out of 6 processors
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-199.ec2.internal, rank 4 out of 6 processors
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-147.ec2.internal, rank 2 out of 6 processors
2018-11-27 15:51:13: Hello world from processor ip-10-0-0-147.ec2.internal, rank 3 out of 6 processors

[ec2-user@ip-10-0-0-111 ~]$ awsbout -s 5b4d50f8-1060-4ebf-ba2d-1ae868bbd92d#1
2018-11-27 15:50:52: Job id: 5b4d50f8-1060-4ebf-ba2d-1ae868bbd92d#1
2018-11-27 15:50:52: Initializing the environment...
2018-11-27 15:50:52: Starting ssh agents...
2018-11-27 15:50:52: Agent pid 7
2018-11-27 15:50:52: Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
2018-11-27 15:50:52: Mounting shared file system...
2018-11-27 15:50:52: Generating hostfile...
2018-11-27 15:50:52: Starting the job...
download: s3://parallelcluster-awsbatch-tutorial-iwyl4458saiwgwvg/batch/job-submit_mpi_sh-1543333713772.sh to tmp/batch/job-submit_mpi_sh-1543333713772.sh
2018-11-27 15:50:53: ip container: 10.0.0.199
2018-11-27 15:50:53: ip host: 10.0.0.227
2018-11-27 15:50:53: Compiling...
2018-11-27 15:50:53: Running...
2018-11-27 15:50:53: Hello I'm a compute note! I let the main node orchestrate the mpi execution!

We can now confirm that the job completed successfully:

[ec2-user@ip-10-0-0-111 ~]$ awsbstat -s ALL
jobId                                 jobName        status     startedAt            stoppedAt            exitCode
------------------------------------  -------------  ---------  -------------------  -------------------  ----------
5b4d50f8-1060-4ebf-ba2d-1ae868bbd92d  submit_mpi_sh  SUCCEEDED  2018-11-27 15:50:10  2018-11-27 15:51:26  -

In case you want to terminate a job before it ends you can use the awsbkill command.

Development

Here you can find guides for getting started with the development of AWS ParallelCluster.

Warning

The following guides are instructions for use a custom version of the cookbook recipes or a custom AWS ParallelCluster Node package. These are advanced method of customizing AWS ParallelCluster, with many hard to debug pitfalls. The AWS ParallelCluster team highly recommends using Custom Bootstrap Actions scripts for customization, as post install hooks are generally easier to debug and more portable across releases of AWS ParallelCluster.

Setting Up a Custom AWS ParallelCluster Cookbook

Warning

The following are instructions for use a custom version of the AWS ParallelCluster cookbook recipes. This is an advanced method of customizing AWS ParallelCluster, with many hard to debug pitfalls. The AWS ParallelCluster team highly recommends using Custom Bootstrap Actions scripts for customization, as post install hooks are generally easier to debug and more portable across releases of AWS ParallelCluster.

Steps

  1. Identify the AWS ParallelCluster Cookbook working directory where you have cloned the AWS ParallelCluster Cookbook code

    _cookbookDir=<path to cookbook>
    
  2. Detect the current version of the AWS ParallelCluster Cookbook

    _version=$(grep version ${_cookbookDir}/metadata.rb|awk '{print $2}'| tr -d \')
    
  3. Create an archive of the AWS ParallelCluster Cookbook and calculate its md5

    cd "${_cookbookDir}"
    _stashName=$(git stash create)
    git archive --format tar --prefix="aws-parallelcluster-cookbook-${_version}/" "${_stashName:-HEAD}" | gzip > "aws-parallelcluster-cookbook-${_version}.tgz"
    md5sum "aws-parallelcluster-cookbook-${_version}.tgz" > "aws-parallelcluster-cookbook-${_version}.md5"
    
  4. Create an S3 bucket and upload the archive, its md5 and its last modified date into the bucket, giving public readable permission through a public-read ACL

    _bucket=<the bucket name>
    aws s3 cp --acl public-read aws-parallelcluster-cookbook-${_version}.tgz s3://${_bucket}/cookbooks/aws-parallelcluster-cookbook-${_version}.tgz
    aws s3 cp --acl public-read aws-parallelcluster-cookbook-${_version}.md5 s3://${_bucket}/cookbooks/aws-parallelcluster-cookbook-${_version}.md5
    aws s3api head-object --bucket ${_bucket} --key cookbooks/aws-parallelcluster-cookbook-${_version}.tgz --output text --query LastModified > aws-parallelcluster-cookbook-${_version}.tgz.date
    aws s3 cp --acl public-read aws-parallelcluster-cookbook-${_version}.tgz.date s3://${_bucket}/cookbooks/aws-parallelcluster-cookbook-${_version}.tgz.date
    
  5. Add the following variable to the AWS ParallelCluster config file, under the [cluster …] section”

    custom_chef_cookbook = https://s3.<the bucket region>.amazonaws.com/${_bucket}/cookbooks/aws-parallelcluster-cookbook-${_version}.tgz
    

Setting Up a Custom AWS ParallelCluster Node Package

Warning

The following are instructions for use a custom version of the AWS ParallelCluster Node package. This is an advanced method of customizing AWS ParallelCluster, with many hard to debug pitfalls. The AWS ParallelCluster team highly recommends using Custom Bootstrap Actions scripts for customization, as post install hooks are generally easier to debug and more portable across releases of AWS ParallelCluster.

Steps

  1. Identify the AWS ParallelCluster Node working directory where you have cloned the AWS ParallelCluster Node code

    _nodeDir=<path to node package>
    
  2. Detect the current version of the AWS ParallelCluster Node

    _version=$(grep "version = \"" ${_nodeDir}/setup.py |awk '{print $3}' | tr -d \")
    
  3. Create an archive of the AWS ParallelCluster Node

    cd "${_nodeDir}"
    _stashName=$(git stash create)
    git archive --format tar --prefix="aws-parallelcluster-node-${_version}/" "${_stashName:-HEAD}" | gzip > "aws-parallelcluster-node-${_version}.tgz"
    
  4. Create an S3 bucket and upload the archive into the bucket, giving public readable permission through a public-read ACL

    _bucket=<the bucket name>
    aws s3 cp --acl public-read aws-parallelcluster-node-${_version}.tgz s3://${_bucket}/node/aws-parallelcluster-node-${_version}.tgz
    
  5. Add the following variable to the AWS ParallelCluster config file, under the [cluster …] section”

    extra_json = { "cluster" : { "custom_node_package" : "https://s3.<the bucket region>.amazonaws.com/${_bucket}/node/aws-parallelcluster-node-${_version}.tgz" } }
    

Getting Started

If you’ve never used AWS ParallelCluster before, you should read the Getting Started with AWS ParallelCluster guide to get familiar with pcluster & its usage.