Reducing Downtime While Backing Up AWS EC2 Instances

Hard disk driveWe’re big fans of AWS and use several of its services.  One service we use is EC2 – it costs much less than having a dedicated server and still gives you full control of the server.

 

One issue is being sure we can bring the server back up in case it crashes, or the instance is somehow corrupted (I’ve read and heard about this anecdotally enough to believe it must happen on rare occasion).  That means we need backups.

 

Backing up an EC2 instance is really just backing up the full drives in the instance.  There are two ways to do this:

 

 

1. Create an AMI of the server.  This requires the EC2 instance to be stopped and will copy all drives and the EC2 instance configuration into a single entity, making it easy to spin up another identical server.

 

2. Backup the individual volumes of the EC2 instance.  We use this method because down time is much shorter.

 

Taking a snapshot of a volume is the way you back it up.  They are differential snapshots, which means only data changed since the last snapshot was run needs to be copied.  So the very first snapshot takes the longest as the most data (full drive) will need to be copied.

 

Snapshots are done asynchronously, and are low priority.  They can take a while.  If you’re looking to reduce downtime, you should be aware of a couple of points:

 

1. It’s safe to backup a volume that is in use.  However, you’ll end up with whatever the state of the disk was at the instant the snapshot starts.  That means it’s safest to either have the disks unmounted from the operating system (so the disk is in a consistant state), or if that’s not possible (the boot drive is being backed up), stop the instance.

 

2. I’ve not been able to find any documentation that describes exactly when the snapshot starts.  You can ‘create a snapshot’, but it appears that really just creates a snapshot request that will get handled at some point soon.

 

#2 is the kicker.  You don’t know when the snapshot has started.  Because of that, we wait until the status is “pending” and something other than “0%” – meaning a status of 1% complete means the snapshot has definitely started.

 

With that in mind, this procedure (coded using any of the AWS AMIs or done manually) will give you the shortest possible downtime while creating a backup of your server:backup-complete

 

1. Stop EC2 instance

 

 

2. Start a snapshot of each volume in the instance

 

3. For each snapshot, wait until the status is more than 0% complete

 

4. Start the EC2 instance

 

You can start the EC2 instance as soon as all snapshots have made progress.  This is completely safe.   So the beauty and time savings is in not having to wait for the snapshots to finish.

 

Using this method, we find we can backup a server with two 150 GB drives in 12-15 minutes.  If we waited for the snapshots to complete, it could be an hour or more.

 

If you backup once a week, with 15 minutes of downtime for the backup, that’s 99.85% uptime, or about “three nines”.

 

 


Posted

in

, ,

by

Tags: