Joe's Nerd Party
The perfect web framework
I’m hoping something that meets all the following is developed in a few years.
  • The data access layer is separated out completely.
  • Security built-in.  Easy to prevent csrf and xss without having to think much about it.
  • Packages css/js for you.  Would be awesome if it worked with Sass/Coffeescript (or had something similar).
  • Easy to write acceptance / unit tests.
  • The language the framework is in has a good DSL for constructing SQL queries (like korma, sequel, etc).  I don’t really need a full-fledged ORM — I like using postgresql features like views and triggers — but I’m not hand-writing SQL all the time.
  • The compiler can catch typing or missing method errors.  Computers should do my work for me, damn it.  I should know on compilation that a route/url was generated somewhere in my application without the correct parameters.
  • Views probably will get complex.  There should be a good solution for complicated views.  HTML generation code often shares lots of things with tiny variations.
  • Comes with a project skeleton.
  • Deploys on heroku.
  • Live code-reloading in development mode.
  • Compiling / packaging is easy.
  • Installing the app to a server doesn’t require installing a butt-load of dependencies managed separately from the application (I’m looking at you Ruby).
  • Has some sort of a CRUD admin interface that can be plugged in and customized.  active_admin for Rails is pretty good.  http://activeadmin.info/
  • Has a sane way of managing 3rd party dependencies.
  • The application boots fast (for minimal downtime during deployments) and doesn’t use tons of ram.
  • Has a way to to stuff long running tasks in the backgrou d and report the progress of the task to the user.  (note: i’m apparently not smart enough to understand amqp, at least in Ruby).
  • Solutions for form validations and ajax.
  • Support for i18n/localization.

I’m sorta thinking something in Scala or Haskell are the only options here.  Haskell’s cabal still sorta sucks, and Scala’s complexity sorta scares me.

Fancy HTML Emails with Rails 3.1

Getting HTML emails to look nice is a pain.  Most email clients can’t use stylesheets, so you have to embed all the styles inline in the HTML.  You also have to write a separate plain-text version of the email.  And popular email clients (Outlook, Windows Live Mail, etc) render html email using some very weird rules.

Here’s what our order email looks like in Gmail: 

Here’s what it looks like on the iPhone:

Not too shabby.

We found this to be a great article on how to make mobile email great looking: http://webdesignerwall.com/general/make-your-html-email-5-times-more-mobile-friendly

We also used the Premailer gem to automatically inline the linked stylesheet in the email views.

Our email layout looks something like:

We include a stylesheet in the HTML. Premailer downloads it, processes it, and inserts the css rules inline in the HTML.

The @media rules need to be inline in the email layout, since Premailer can’t handle those being in a separate css file yet.

We use premailer-rails3 to integrate Premailer into Rails 3.  Unfortunately, we found a bunch of bugs in premailer and premailer-rails3. Our forks of the projects are at https://github.com/joevandyk/premailer and https://github.com/joevandyk/premailer-rails3.  The forks fix some encoding bugs, remove some weird css processing stuff done by premailer-rails3, allow premailer to not strip out embedded <style> rules in the email layouts, and some other things.  

We also found a bug in sass-rails, where you can’t embed image-urls in inline sass code.  See https://github.com/rails/sass-rails/issues/71

Premailer-rails3 hooks into ActionMailer when the email actually being delivered, not just generated.  When running tests, email is not actually sent, so the premailer-rails3 hooks don’t get ran during tests.  I haven’t spent the time to see if it’s possible to get the premailer processing to run during tests, but that would be a nice thing to do.

Also, our forks on premailer-rails3 assume that you want premailer to go out and actually download the linked CSS files.  It should be possible to use the Rails 3.1 asset pipeline to get the processed css without downloading it.

A very special thanks goes to Jordan Isip who did the super annoying job of making sure the emails look great in all the different clients out there.  Writing that CSS/HTML did not look fun.

WebOps #4

Your continuous integration tests, if they use a database (or other shared resource), should use unique names for the database names.

i.e. use tanga_test_[git-hash] for the test database name.

This way, you can have multiple tests going without conflicts.

Same thing if you use rabbitmq or memcached or whatever — segment the namespaces by some unique identifier.

WebOps #3

It’s amazing how unicorn and nginx lets you easily do zero-downtime deploys. More software should handle signals correctly.

I purchased the Linux Programming Interface book and it’s stuffed with information on how to design things like this.

WebOps #2

You should probably store your images/uploads on S3. Are you going to build a cluster of synchronized fileservers? Probably not.

Unsolved question: How do you properly backup the S3 files? Backups would need to be stored somewhere off S3, in case your S3 account mysteriously disappears. And a simple copy/replace/rsync wouldn’t do it, what if all your files were truncated to zero length? You’d want some sort of an incremental backups that takes a snapshot of how your S3 account looks at a given point in time. You also need to be able to do this without doing a zillion GET/PUT/LIST requests, those are expensive.

WebOps #1

This is a series of posts, each time I realize or read something useful.

In Rails, don’t set cookies for all domains (i.e. .tanga.com). Restrict cookies to ‘www.tanga.com’. Otherwise, the cookies will be sent when doing requests for images, javascript, css, etc, even they they are hosted on assets.tanga.com.

PostgreSQL Backups

Tanga uses PostgreSQL.  Before the Crash Of 2011, we used 8.3, now we use 9.0.

Our database server became slower and non-responsive. We used EBS for the database storage. Up until that fateful day, we had 600+ days of uptime on that server, so it was a pretty stable (read: lucky) setup.  As EBS became unresponsive, I figured that I would have to shutdown the machine to mount the EBS drive elsewhere.  That’s the point of EBS - the storage is separate from the machine, so if the machine goes down, you can use the database stored on the storage elsewhere.

However, the server/instance wouldn’t shutdown.  The EBS drive would not unmount (disconnect).  The database server was unreachable, but I could not disconnect the EBS storage and use it elsewhere.

Shit.  See http://twitter.com/joevandyk/status/61207619540500480 for my thoughts at the time.

At this point, Amazon was not providing much information about when they expected things to come back.  So, I looked at my backups.

Unfortunately, the last backup I had was about two hours old.  This isn’t too bad, but we would have lost some information and annoyed some people.

My backup method was to do a complete dump of the database every so often. Then, another machine on a different provider would copy the backup (using rsnapshot) off the database machine. rsnapshot was configured to keep hourly, daily, weekly, and monthly copies of the database.

Obviously, “every so often” wasn’t often enough if the latest usable backup copy was two hours old. This might be acceptable for some sites, but I really didn’t want to lose that two hour period of data.

Our database isn’t too big (compressed, the database backup is about 1 gigabyte).  But that does take some time to copy over the internet (remember we’re doing complete backups). Also, doing more frequent complete dumps of the production database was causing the database to slow down (turns out this was probably a EBS problem as well).

This is the first thing we need to address: fixing our backups.  If Amazon went “poof” and all our data disappeared instantly, we should still have access to an up to date version of it.  If only our database server at Amazon went “poof”, we should have a VERY recent copy of the data. This isn’t specific to Amazon.  Servers die.  If you don’t have a backup and replication strategy, you will be screwed eventually.

Unfortunately, with postgresql, the documentation for replication and backups always confused me.  Postgresql 9 has some neat replication features, and 9.1 is supposed to be even better.  

So what are we going to do?  Doing a complete database backup with pg_dump will take too long, and our database gets bigger, it’ll become impossible to do.  Remember that we want to transfer our data off Amazon as soon as possible. Doing a transfer of a large database will be difficult once it reaches a certain size.

Luckily for us, Postgresql has something called “Log Shipping”. As postgres saves stuff, it writes the data out to log files (WAL records, in their terminology). By default, these log files are 16 megabytes each. Once the log file reaches the 16 meg limit (or a time limit expires - usually around 60 seconds), an “archive_command” is ran, which can copy the log file to somewhere else.  Using the archive_command, we can copy the changes to the database to someplace else relatively soon after the changes happened.

Keep in mind that these log files just contain the changes to the database over some small period of time, it doesn’t contain all of the database data.  To do a complete restore of the database, we will need to apply the log files mentioned above to a “base backup”.  

So, the procedure is:

  1. Create a base backup.
  2. Copy that base backup somewhere safe.
  3. Tell postgresql to copy the WAL records somewhere safe.
  4. Ensure you can restore the database.
This is me. I live in the Matrix.

This is me. I live in the Matrix.

Motivations

I develop and run a website called http://www.tanga.com.

Last week, my website was offline for 36+ hours because of problems at Amazon. See http://highscalability.com/blog/2011/4/25/the-big-list-of-articles-on-the-amazon-outage.html for more details.

There were multiple reasons why our website was so vulnerable to a glitch in Amazon service.  This blog will be an attempt to investigate and solve these issues.  I’m hoping this information will be useful to other people.

Let’s start with some assumptions:

  • We’re running a webservice or website - something http-based.
  • The dataset involved is a “reasonable” size.  We aren’t dealing with Google, Reddit, or Amazon-sized loads.  Most discussion online about high availability and scalability assumes you are dealing with huge amounts of traffic and data. If you follow recommendations geared for large sites, you will be wasting effort that should be spent in other areas.
  • You want your service to be up as much as possible, while keeping costs and complexity down.  You don’t want an army of operations people, you don’t want a byzantine system, you don’t want overly-complex code.
  • You know the basics of *nix, databases, filesystems, http, etc.
  • You run on some *nix.  I use Ubuntu server (10.04).

The next post will deal with database backups! You don’t think your database server will last forever, do you…?

Beginnings…

What this blog will be about:

  • Services!  Monolithic applications are for losers.
  • Postgresql: backups, replication, performance, storage
  • Rails: nothing too Rails specific, but some general thoughts, especially about integrating with services and background processes and forms.
  • Load balancing 
  • Backups & Recovery
  • Failures: thinking of them, thinking of recovery options.
  • Chef
  • Logging
  • Visualization & transparency
  • Alerting
  • Daemons
  • Security