In 2022, pretty much every single software product that might ever be run on a Cloud infrastructure will support some form of High Availability deployment.

In the best scenarios these will depend on an external data store, and we will be able to keep the Application servers truly immutable (or even with read-only disks!). Sometimes, the application we are trying to set up is the data store itself, like with MongoDB, Influx, or Prometheus, or some poorly designed (ehm “not cloud native”) software that just wants a disk to write stuff on. (how about a Bitcoin full node?)

I’m looking at you Jenkins, and WordPress.

The solution, depending on the specific software we are using, can be an NFS (or EFS on Amazon) volume or some more advanced system with Active / Passive setup and automatic failovers.

These setup, however, get expensive quickly. Not only for the cost of running the actual servers, but also for the Enterprise offering pricing. This doesn’t fare well with your short-for-cash quasi-unicorn startup.

So, at the beginning, and for a while, we will have to make do with single node configurations, Mostly-Available™ services.

Anatomy of a Mostly-Available service

You know that Jenkins instance that your developers (or you, the full stack 10x developer also doing DevOps) need, but that you can afford to lose for a few minutes or even hours and most people wouldn’t even notice?

That’s your MA service. Something you do want to be online, at least 99% of the time. But preparing a clustered service is overkill, as is paying ~$25 every month for an ALB.

It obstinately writes to a disk, so we will have to save that. But if the instance dies, we want it to recover automatically, with an Autoscaling Group.

Setting up the EBS volume

EBS volumes are AZ-local, which means that if it is created in eu-west-1a, it will not be possible to attach it in eu-west-1b or 1c.

So, choose an AZ, and remember it because we will have to configure the ASG to launch instances in a Subnet set in the same AZ as our Volume.

resource "aws_ebs_volume" "jenkins" {
  availability_zone = "eu-west-1a"
  size              = 32
}

Setting up the Autoscaling Group

Provided that you already built an Amazon Machine Image (AMI) for your service, we are going to create the ASG specifiying a Userdata script that will mount the volume at boot. For this purpose you can use this Terraform code:

data "aws_ami" "jenkins" {
  most_recent = true

  owners = ["self"]

  filter {
    name   = "name"
    values = ["jenkins-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_launch_template" "jenkins" {
  name          = "jenkins"
  image_id      = data.aws_ami.jenkins.id
  instance_type = var.instance_type
  user_data     = base64encode(data.template_file.jenkins_userdata.rendered)

  update_default_version = true
}

resource "aws_autoscaling_group" "jenkins" {
  name                      = "jenkins"
  max_size                  = 1
  min_size                  = 1
  desired_capacity          = 1

  # Make sure the subnet specified here lives in the same AZ as the Volume.
  vpc_zone_identifier       = [aws_subnet.private_1a.id]

  launch_template {
    id      = aws_launch_template.jenkins.id
    version = "$Latest"
  }
}

data "template_file" "jenkins_userdata" {
  template = file("${path.module}/userdata.sh")

  vars = {
    region = var.region
    volume_id = aws_ebs_volume.jenkins.id
  }
}

You will also need to give the instances a role that allows the ec2:AttachVolume action.

Finally, the Userdata script (this is tuned for Debian 11, feel free to adapt):

#!/bin/bash

# Configure awscli to use the current region
export AWS_REGION="${region}"

mkdir -p /root/.aws
cat <<EOF > /root/.aws/config
[default]
region = $AWS_REGION
EOF

# Attach the EBS volume
INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
aws ec2 attach-volume --volume-id ${volume_id} --instance-id $INSTANCE_ID --device /dev/xvdb

# Mount the Volume to the Service data directory
mount /dev/xvdb /opt/jenkins_data

# Ensure an ext4 filesystem exists on the volume.
# We use blkid to detect the current file-system and if none is found we use `mkfs`.
# Sleep is needed because blkid might not have detected the disk yet.
sleep 3

# Ensure Volumes have a filesystem
VOLUME_FS=$(blkid -o value -s TYPE /dev/xvdb)
if [ "$VOLUME_FS" != "ext4" ]; then
  mkfs -t ext4 /dev/xvdb
fi

# Allow Jenkins to use the volume (or it will default to root:root)
chown -R jenkins: /opt/jenkins_data

# And, run!
service enable jenkins
service start jenkins

Yeah, I know, it looks more complicated than it actually is. There’s a few lines in that script that are effectively only needed the very first time a volume is used.

If your Volume has already been used for a while, or you have initialized it in another instance, you can remove all the lines from the sleep to the chown.