We’ll start with the non-nerd version. Last week there was a massive security breach in some very standard software used by most sites on the internet, including Beeminder. Let us first quickly reassure you that your credit card info gets secured by our much more savvy payment processor, Stripe, and actually never even touches our servers.
We updated our server the same day the vulnerability was announced, thanks to our friend Joe Rayhawk, but, being really awful sysadmins ourselves, failed to notice we needed to restart the server to make it take effect. We did that the next morning when users started letting us know we were still vulnerable (thank you!). But then our server failed to boot back up normally and we spent a couple frantic hours, with help again from Joe, getting it operational.
Then four days later, presumably wholly unrelated, there was a power outage that took out the whole data center where Beeminder is hosted. That lasted for 3.5 hours.
In other embarrassing server-related news, there were a couple days last week when many of you were letting us know that you were seeing a lot of your Beeminder emails marked as spam. We asked our email service provider, MailGun, for help and they quickly fixed it. Definitely let us know if you ever see a Beeminder email marked as spam. That could be pretty devastating for us. Of course we always reverse derailments that were caused by any kind of technicality like that.
“Bright side: zero data loss!”
So this has not been a happy week at the beehive. We were looking back over previous blog posts about our crashes of ineptitude and see that we failed to blog about our biggest crash of ineptitude of all time, at least as measured in downtime (there are some other doozies in those previous blog posts — like nearly destroying our database and causing spurious derailments). That was in November when we had a completely unacceptable 12 hours of downtime as we scrambled to recover after a hard drive failure, wasting precious hours thinking we had the original server repaired only to finally realize we had no choice but to start from scratch on a fresh server. (Bright side: zero data loss!)
Suffice it to say, we are hacking our guts out making things more robust. In the short term that may mean more downtime (as brief as possible!) which we’ll continue to live-tweet. Anything major we’ll first announce on our main twitter account, @bmndr, and point to @beemstat for details. (So no need to follow both of them.)
We also want to say how sorry we are about all this. Especially since the downtime in November should really have been sufficient kick in the pants to improve our infrastructure, at the very least being able to quickly redeploy from scratch on a new server in case of total meltdown at wherever Beeminder happens to be hosted. And we do appreciate how much you rely on us. In fact, Bethany’s reaction to our most recent downtime was pretty funny. At some point there was nothing we could do but wait for our hosting provider to restore power and Bethany said, with zero intended irony, “I guess I’ll check my beeminders to see what else I need to do tonight while we’re wait- Oh. Right.” [1]
——— Non-Nerds Stop Reading Here ———
For more of the technical details, I’ll turn this over to our fearless CTO, Bethany Soule.
Spamboxing
The IP address our mail was coming from had gotten blacklisted, presumably due to someone else’s shenanigans. MailGun switched our IP and it seems to have solved the problem.
Power Outage
Saturday night a meteor full of zombies hit Newark and they gnawed through intertubes connecting our servers to the world wide web!
That didn’t really happen. But Linode’s entire Newark data center had a power outage and apparently a UPS failure along with it. Because power was totally out to the entire data center, and because of our current deployment architecture (ha! that’s just a fancy way to say “We’re running all of Beeminder’s components off the same physical (virtual) machine”) there was little we could do to recover without losing data back to our latest backup, which was over 12 hours old. [UPDATE: To clarify, we suffered through the downtime so as not to have to restore from backup. So no data was actually lost!]
Heartbleed
Our servers run Ubuntu Linux and indeed were vulnerable to the Heartbleed bug for however long it’s been out there in the wild. We patched our libraries within several hours of Ubuntu releasing their updated packages, but failed to restart nginx to reload the updated libraries. We went to bed blissful in our ignorance of our ignorance. When you guys started emailing us in the morning we were no longer ignorant of our ignorance, but we were however still ignorant of what in the world was wrong with our filesystem after updating the packages the day before. The server is running, and not vulnerable to heartbleed, but it’s somewhat hobbled at the moment. Which brings me to my next point.
Getting better
Aside from the immediate clean-up from this latest failure, we are (with help from our savvier friends) making Beeminder robust at the very least to experiencing this exact failure again. We’re hard at work writing and testing an actual deploy script for setting up a new copy of the Beeminder web server seamlessly and much more quickly than previously. We’re also setting up database redundancy using Mongo’s replica sets so that if our server’s datacenter goes AWOL we can quickly fail over to the replica.
Footnotes
[1] Actually, thanks to the Android app’s robustness — normally useful when your phone doesn’t have data coverage — it was actually no problem to see what beemergencies we had left. One of them, for Bethany, was running, which the two of us did together at 2:30am after the server chaos settled down.