Start by cloning this repository
- Edit
terraform.tfvars
and update with your settings - Edit
terraform.tf
if you want to use remote state
The make file included from makefiles/terraform.tf
has some helpers for applying your terraform infrastructure.
Plan your infrastructure first (a dry run):
make plan
When you are happy, execute the plan generated above
make apply
Provided you have added an SSH key, you will be able access an available swarm manager using the command:
make swarm-ssh
An example app is included in docker/docker-compose.yml
The make file included from makefiles/swarm.tf
has some helpers for applying your terraform infrastructure.
make swarm-deploy
Note, the
make
commands shown will only work once you have created your swarm using the steps above
The swarm is composed of multiple EC2 autoscaling groups performing various roles.
You can show all available instances and the groups to which they belong using:
make swarm-instances
For a functioning cluster, you must run a manager group which by default consists of 3 swarm manager instances, one in each availability zone.
make swarm-managers
You can have as many or as few worker groups as you wish, running in as many different configurations as you choose. Instances in worker groups join the cluster as swarm workers. By default this terraform config creates a single worker group running 1 instance.
In order to provide automatic swarm initialization we run a one shot docker container on instance launch, which uses an S3 Bucket to find active managers and join tokens.
See here for more information on how this works.
TODO: Look into https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-route-53-releases-auto-naming-api-name-service-management/ to see if it can replace this requirement.
To allow external addressing of nodes in the cluster, you can configure an autoscaling group to automatically maintain a route 53 DNS record. By default only the manager group has a DNS record configured.
This record will be updated on the following autoscaling events:
- Instance Launched
- Instance Terminated
- Autoscaling Group Scale Down*
*NOTE: An Autoscaling Lifecycle Hook is configured on scale down events, to delay the termination of the instance until (DNS TTL + 120) seconds has elapsed from the time of the event.
In the case of groups with DNS records attached or groups executing long running tasks, you probably want to decommission hosts in a more graceful fashion.
The steps to do this are:
- Set the docker node to DRAIN state, to prevent new tasks being allocated
- Stop all the containers on the node
- Set the host to unhealthy in the autoscaling group*
*This will automatically trigger the notification to update any associated DNS records. If this is the case the instance will remain in the group until a period of (DNS TTL + 120) has expired.
make swarm-remove-instance ID=<instance-id>
If for any reason you need to force a node out of the cluster you can simply terminate it. The autoscaling group will automatically provision a new host and the swarm will automatically rebalance the containers the node was running.
Once instance have been removed from the swarm, the node is show in a "down" state in the docker node ls
output. You can remove these nodes using the make task:
make swarm-tidy
WARNING: this will destroy ALL infrastructure elements with no method of retrieving data or configuration.
make clean
- Send EC2 logs to CloudWatch
- Set up CloudWatch Alarms
- Lambda failures
- EC2 Health
- Docker Registry in example app
- CI in example docker-compose