Tanga uses PostgreSQL. Before the Crash Of 2011, we used 8.3, now we use 9.0.
Our database server became slower and non-responsive. We used EBS for the database storage. Up until that fateful day, we had 600+ days of uptime on that server, so it was a pretty stable (read: lucky) setup. As EBS became unresponsive, I figured that I would have to shutdown the machine to mount the EBS drive elsewhere. That’s the point of EBS - the storage is separate from the machine, so if the machine goes down, you can use the database stored on the storage elsewhere.
However, the server/instance wouldn’t shutdown. The EBS drive would not unmount (disconnect). The database server was unreachable, but I could not disconnect the EBS storage and use it elsewhere.
Shit. See http://twitter.com/joevandyk/status/61207619540500480 for my thoughts at the time.
At this point, Amazon was not providing much information about when they expected things to come back. So, I looked at my backups.
Unfortunately, the last backup I had was about two hours old. This isn’t too bad, but we would have lost some information and annoyed some people.
My backup method was to do a complete dump of the database every so often. Then, another machine on a different provider would copy the backup (using rsnapshot) off the database machine. rsnapshot was configured to keep hourly, daily, weekly, and monthly copies of the database.
Obviously, “every so often” wasn’t often enough if the latest usable backup copy was two hours old. This might be acceptable for some sites, but I really didn’t want to lose that two hour period of data.
Our database isn’t too big (compressed, the database backup is about 1 gigabyte). But that does take some time to copy over the internet (remember we’re doing complete backups). Also, doing more frequent complete dumps of the production database was causing the database to slow down (turns out this was probably a EBS problem as well).
This is the first thing we need to address: fixing our backups. If Amazon went “poof” and all our data disappeared instantly, we should still have access to an up to date version of it. If only our database server at Amazon went “poof”, we should have a VERY recent copy of the data. This isn’t specific to Amazon. Servers die. If you don’t have a backup and replication strategy, you will be screwed eventually.
Unfortunately, with postgresql, the documentation for replication and backups always confused me. Postgresql 9 has some neat replication features, and 9.1 is supposed to be even better.
So what are we going to do? Doing a complete database backup with pg_dump will take too long, and our database gets bigger, it’ll become impossible to do. Remember that we want to transfer our data off Amazon as soon as possible. Doing a transfer of a large database will be difficult once it reaches a certain size.
Luckily for us, Postgresql has something called “Log Shipping”. As postgres saves stuff, it writes the data out to log files (WAL records, in their terminology). By default, these log files are 16 megabytes each. Once the log file reaches the 16 meg limit (or a time limit expires - usually around 60 seconds), an “archive_command” is ran, which can copy the log file to somewhere else. Using the archive_command, we can copy the changes to the database to someplace else relatively soon after the changes happened.
Keep in mind that these log files just contain the changes to the database over some small period of time, it doesn’t contain all of the database data. To do a complete restore of the database, we will need to apply the log files mentioned above to a “base backup”.
So, the procedure is:
- Create a base backup.
- Copy that base backup somewhere safe.
- Tell postgresql to copy the WAL records somewhere safe.
- Ensure you can restore the database.