Using Amazon S3 to Store Database Backups

Using S3 to store your MySQL or PostgreSQL Backups

"I'd love to restore a Database Backup" said no one ever. When you're forced to do so, that means that the production system you're maintaining is going legs up and your company (or the company you're working for) is probably losing an outrageous amount of money for every second of downtime. If can happen for a variety of reasons:

  • Someone running a DELETE query with a WHERE clause larger than intended, or worse with no WHERE clause at all.
  • TRUNCATE and DROP queries being executed on the wrong tables or columns.
  • A faulty Hard Disk that suddenly stops working.
  • A hacker decided to ruin your company and deleted everything or worse, you got a ransomware that is now asking you some Bitcoins to get your data back.

It should be clear that storing your Database Backups appropriately is of extreme importance. We live in the hope that we won't need them, but when the urgency calls, you'll be glad of having gone the extra mile to protect your data and to have that database backup ready to be restored.

In this situations you, the DevOps engineer, or SysAdmin, or Cloud expert or simply "the only developer in the company", can either be the hero that saved the day or the one who didn't care about taking backups.

Now:

If you are using Amazon RDS you are pretty much sorted out as RDS will create and store backups for you every night automatically.

But if you are not in the Cloud, or running a Database server on an EC2 instance to retain more control, or on a third-party VPS to save some cents, you will have to manage this yourself.

In this article, we will use Amazon S3 as a safe heaven for one of our most precious assets.

AWS Cloud A safe place for your backups

Amazon S3 has many nice features that make it a great place to save your backups. In this tutorial we are going to use:

  • The AWS CLI: a command line interface to the AWS API, to manage the bucket and to upload the backups;
  • Object Lifecycle Management: reduce costs by moving old backups to Infrequent Access Storage Class;
  • Server-Side Encryption: make sure that data is encrypted at rest;
  • MFA Delete: avoid accidental or malicious backup deletion with a Multi Factor Authentication token;
  • Logging: keep track of who, where and why uploads and downloads.

Having some, even tiny, experience with AWS, it will make your life much easier. This guide, however, should be easy for you to follow even if it is your first time in the Cloud.

Create an AWS account

If you have an AWS account, you can skip this section

Open the AWS Homepage and click on "Create an AWS Account":

Create an AWS Account

Insert your email address, select "I am a new user" and click on "Sign in using our secure server":

AWS Login

At this point you will be asked to fill in some informations about you or your company and a telephone number that will be verified by calling you.

Finally they will ask you for credit card details. But don't worry, you don't pay a penny just for having an AWS accounts. All AWS services are priced to the hour or to the GB of storage. An empty AWS account is free. And by the way, the first year of AWS is rich of freebies to let you experiment.

Get API Credentials

If you do have some experience with AWS you should already know how this works. Avoid using Root Credentials and prefer using an IAM user with restricted access. Keep the keys as secret as your bank account credentials.

For everyone else yet to be initiated to the Amazing world of Amazon Web Services, some more details:

When you create an AWS Account you login using what is called the "Root Account". This Root Account is pretty much the Unix root user. It is almighty, powerful and pays the bills. Its credentials should be kept with the maximum secrecy, as they have the power to start multiple 9600$/month instances on your card.

Aside the Login credentials you just created you can also generate a pair of API credentials that are used in applications and CLIs to interact with AWS. Needless to say, API Credentials for the Root Account, are just as dangerous as the Login one, and you should avoid generating them entirely.

Countless botnets have been created using stolen AWS credentials leaked by someone so naïve to push them to GitHub. Save yourself a possible pain and stay away from root user API credentials.

Instead learn to use AWS Identity and Access Management (IAM) and IAM Users.

You can think of IAM Users as subaccounts you would generally handle to employees in your company, with well defined powers and capabilities, to let them do what they strictly need to do. By default a new IAM User has no permissions at all, and we use policies to declare what they should be allowed to do.

As the root user, we can generate Login credentials and/or API credentials. In this case we will only need the latter. These keys still have the power of doing damages, but with a much limited scope.

If your Database server is running on an AWS EC2 instance, you may not need an IAM User at all. You can assign a credential-less IAM Role to the EC2 instance with the permissions needed to save backups to S3. This is by any means the safest solution as it doesn't involve handling dangerous secrets. IAM Roles can be a bit harsh to understand. Ask your AWS expert for more details.

Back to the original route. Open the IAM Console. On the left sidebar select "Users", then, "New User":

Create an IAM User

Give the user a name (for example backups), and check the "Programmatic Access" checkbox. This will enable API Access for the user and generate its credentials. Proceed to the next page.

Here you have to choose permissions for the user. If you are an experienced AWS user, you may want to write your own policy to pick only the exact permissions needed for the job. In this tutorial we will use the AWS managed S3FullAccess Policy, which grants S3 superpowers to the user:

IAM User - S3FullAccess

Review the settings and on the final page you will receive your secret credentials. Please, remember to keep them safe! The safety of your backups and of your S3 account in general depends on it.

First rule: do not push them to repositories, whether public or private.

From now on, I will assume the user executing the commands has all the right permissions to do so. In a production environment remember to follow the Principle of Least Privilege, which in this case means that the user creating the bucket should not be the same uploading the backups. To do so you'll have to learn to write IAM Policies.

Install the AWS CLI

The Command Line Interface is a great tool to experiment and to interface with the AWS Platform. Most of the steps listed in this tutorial could be manually applied from the AWS Console.

But you don't want to upload backups manually every day. You will need the CLI anyway.

If you are lucky enough to work with a Mac, and already use Homebrew, install the CLI using:

brew install awscli

On Linux and on Windows, ensure that you have Python and pip installed, then run:

pip install awscli

Before running any commands with the CLI, you must set the access credentials that the console uses to authenticate the API requests. Run aws configure and set the appropriate values for Access Key Id and Secret Access Key.

You will also be asked for a region. In this case you should probably use the region that is geographically close to you. You can find a list of available regions on the AWS Documentation.

Setting up the Bucket

At this point you have to create a bucket where to save backups. You could use one that you already own, but the purpose of buckets is to separate objects of different nature. Also some of the customizations that we are going to apply may not work for your other objects.

Time to choose your bucket name. Remember that bucket names must be globally unique, so "database-backups" is not going to work for everyone of you. My suggestion for non-public buckets is to suffix them with an UUID to ensure randomness (like backups-b3bd1643-8cbf-4927-a64a-f0cf9b58dfab). Once you have the name:

aws s3api create-bucket --bucket <bucketname>

Since we are going to enable logging, we'll use the backups/ key prefix (or directory if you insist) to store the actual database dumps, and the logs/ prefix for S3 access logs.

If you prefer using your mouse and an UI, open the S3 Console, and click on "Create Bucket":

Create a new S3 bucket

When the popup appears, insert the name for your bucket and the region. Then leave all other settings to defaults and continue. By default your bucket is only accessible by the Root Account and authorized IAM Users.

Containing backup costs

Using Object Lifecycle Management we are going to move objects older than 30 days to the Infrequent Access Storage Class (pay less for storage, more for downloads). After 6 months the backup are probably going to be so old that would have no real use case, so we are going to expire them.

Copy the following JSON Lifecycle Configuration to a file (I will name mine lifecycle.json) and feel free to make the appropriate edits for your case:

{
  "Rules": [
    {
      "ID": "Backups Lifecycle Configuration",
      "Status": "Enabled",
      "Prefix": "backups/",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        "Expiration": {
          "Days": 180
        }
      ],
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 2
      }
    }
  ]
}

This configuration contains a single "Rule" definition, which is applied to objects prefixed with backups/. We are instructing the bucket to move objects to the STANDARD_IA Storage Class, to expire them after 180 days (roughly 6 months). Finally we are making sure that incomplete Multipart Uploads are aborted after 2 days, which is a best practice to apply to any bucket.

Run the following command to apply this configuration to your newly created bucket:

aws s3api put-bucket-lifecycle-configuration \
          --bucket <bucketname> \
          --lifecycle-configuration file://lifecycle.json

The same can be easily achieved on the S3 Console too. Select your bucket Management tab and click on the "Add Lifecycle rule" button. Get yourself confident with the wizard and fill all the required fields.

Protecting against accidental deletion

The last thing you want is to accidentally delete your precious backups. Or even worse to have some malicious actor trying to ruin the company.

MFA Delete protects against this scenario. All delete requests must be further authenticated with a Two Factor Authentication token, like the one used to protect your Gmail, Facebook or Bank account, to put it simply. There are two requirements to enable it:

  • First, your AWS Root account must have MFA enabled. Head over to your IAM Dashboard and enable it with either a Virtual or Physical device.
  • Second, MFA Delete requires Bucket Versioning to be enabled. We don't really need it as we are not going to overwrite our objects, and it may add some costs should it happen...

Note: this step is optional. If it does feel like an overkill for your Kitten Blog Wordpress database, feel free to skip to the next section.

Once you have MFA Enabled on your account proceed to enable MFA Delete on the bucket. This is done using the same command as to enable versioning, which comes handy:

aws s3api put-bucket-versioning \
          --mfa <otp> \
          --bucket <bucketname> \
          --versioning-configuration Status=Enabled,MFADelete=Enabled

You will notice that this command line requires a One Time Password. Yes, AWS authenticates the request to enable MFA with MFA. In this way they are assured that your root account has MFA enabled and that you are authorized to make such change.

Unfortunately this operation cannot be completed with the S3 Console. You'll have to use the CLI to enable MFA Delete.

Logging

By enabling Logging on the bucket we keep track of all uploads and downloads, authorized or malicious. This can be required for compliance reasons or just to have an IP trace in case of a data leak. This step is optional too.

We are going to prefix log objects with logs/.

Create a second JSON file (logging.json) with this content:

{
    "LoggingEnabled": {
        "TargetBucket": "<bucketname>",
        "TargetPrefix": "logs/"
    }
}

And execute:

aws s3api put-bucket-logging \
          --bucket <bucketname> \
          --bucket-logging-status file://logging.json

Again, to do the same using the friendlier AWS Console, select your bucket from the S3 console, click on the Properties tab then on the Logging box:

Enabling Logging on an S3 Bucket

Select the current bucket as target and logs/ as prefix and Save.

Great! Your bucket is all set to receive your DB Backups!

Generating a backup

The first step to saving backups is of course creating them.

If you are running a MySQL server, you can backup any entity with this single line:

mysqldump -u [user] \
          -p [password] \
          -h [host] \
          --single-transaction \
          --routines --triggers \
          --all-databases

That will write a huge blob of SQL to the stdout, so you may want to compress it on the fly and save to a file:

mysqldump -u [user] [...] | gzip > mysql_backup.sql.gz

If you are instead using PostgreSQL you can use pg_dumpall:

pg_dumpall -h [host] \
           -U [user] \
           --file=postgresql_backup.sql
gzip postgres_backup.sql

If you are backing up a single database, you can exploit the Postgres "Custom" dump format, which is an already compressed and optimized backup format:

pg_dump -U [user] \
        -h [host] -Fc \
        --file=postges_db.custom [database_name]

Naturally, you can follow this tutorial with any other database engine, like Oracle or SQL Server, but you'll have to figure out how to take a database snapshot yourself.

Storing the Backup in the Bucket

We're almost there. Now we have a snapshot of the whole database into a single file. The last step is to actually upload the file to the bucket.

If you can use standard uploads, the next command will do the job:

S3_KEY=<bucketname>/backups/$(date "+%Y-%m-%d")-backup.gz
aws s3 cp <backupfile> s3://$S3_KEY --sse AES256

For the first time in this tutorial we used aws s3 instead of aws s3api. The latter is the "official" S3 client, supporting all API operations. While the s3 client is a useful abstraction on top of s3api, which supports way less operations with less options. But in this case it makes our life easier: if your backups are larger than 5GB you are forced to use the Multipart Upload process. Actually S3 suggests to use them for any file larger than 100MB.

Using Multipart Uploads with the s3api is a real pain. The s3 client takes care of all the nitty-gritty details for us and it just works nicely.

By using --sse AES256 we are asking S3 to perform encryption for data at rest. This is usually only needed for compliance reasons, unless you're scared that an AWS employee may steal your data.

You made it this far! If you are enjoying the article, do not miss out the next!
* I promise to keep your email safe and to not send spam

So you're looking for a script to automate this?

Once you have set up the bucket, it's very easy to script this and run it daily:

#!/bin/bash

AWS_ACCESS_KEY_ID=<iam_user_access_key>
AWS_SECRET_ACCESS_KEY=<iam_user_secret_key>
BUCKET=<bucketname>

MYSQL_USER=<user>
MYSQL_PASSWORD=<password>
MYSQL_HOST=<host>

mysqldump -u $MYSQL_USER \
          -p $MYSQL_PASSWORD \
          -h $MYSQL_HOST \
          --single-transaction \
          --routines --triggers \
          --all-databases | gzip > backup.gz

S3_KEY=$BUCKET/backups/$(date "+%Y-%m-%d")-backup.gz
aws s3 cp backup.gz s3://$S3_KEY --sse AES256

rm -f backup.gz

Save this to a file somewhere on your server, for example in your home, and make it executable:

chmod +x ./backup.sh

of course replace the <placeholders> with your actual values. Also if you're not using MySQL replace the snapshot line with the appropriate command.

Setting up a cron job

To run this every day, at 12pm for example, run crontab -e and add the following line:

0 12 * * * /home/<youruser>/backup.sh

Save and celebrate. 🎉

Bonus: infrastructure as code

For you, loyal reader, that got to this point, have a nice CloudFormation stack script to generate your bucket in a automated and repeatable fashion: download.

Using the S3 Console or the CLI are great ways to get yourself confident, and to be fair, a lot of infrastructure I saw just did this: building it all manually. When you're ready to grow up, move to automation, whether it is using CloudFormation or Terraform or anything else... they're life savers.

Unfortunately MFA Delete and Incomplete Multipart Upload Expiration cannot be enabled with CloudFormation and you will have to resort to the CLI for these two.

Or using Terraform

Lately I fell in love with Terraform and I don't use the AWS Console at all. The following script is all you need to create a bucket configured exactly as we discussed up to now.

variable "bucket_name" {}
variable "region" {}

provider "aws" {
    version = "~> 1.2"
    region = "${var.region}"
}

resource "aws_s3_bucket" "backup" {
  bucket = "${var.bucket_name}"
  acl    = "private"

  versioning {
    enabled = true
    mfa_delete = true
  }

  logging {
    target_bucket = "${var.bucket_name}"
    target_prefix = "logs/"
  }

  lifecycle_rule {
    id      = "backups"
    enabled = true

    prefix  = "backups/"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    expiration {
      days = 180
    }

    abort_incomplete_multipart_upload_days = 2
  }
}

Congratulations

If you made it this far, you are now generating and storing backups in an proper and secure way.

If your job is to maintain the infrastructure for a company (or your company), you probably cannot just copy and past whatever I did here. Your legal or compliance requirements will affect what and how you actually do most of this. You may not need Server Side Encryption or Logging for example, or you may be asked to never expire Database backups. In that case, give a look to Amazon Glacier for long term storage of cold data. It's your job as the "AWS expert" find and customize the solution that best adapts to your use case.