Anyone remember our old blog post from 2012 about accidentally running a query that started deleting our whole database? It’s pretty entertaining and helpfully demarcates the parts that non-nerds should skip. If you’re a non-nerd I’d stick with (the non-nerd parts of) that post. The executive summary of this post is that we upgraded back-end stuff and it caused everything to break and we didn’t sleep very much for 48 hours but now we think things are back to normal, sort of, but please tell us if not.
OK nerds, backing up, Beeminder is a Ruby on Rails app with Mongo as the database. In between Rails and Mongo is the Object-Relational Mapper, or ORM. [1] It’s how Rails and Mongo talk to each other. Back in 2011 there were two choices for this: MongoMapper and Mongoid. MongoMapper was written by the obviously awesome John Nunemaker of GitHub so that’s what we went with. Google searches are suggesting that that was even Mongo’s official endorsement at the time so I guess we don’t need to feel dumb about that choice. Except for the part where, some months after we publicly launched, we found an outrageous bug that made MongoMapper try to delete our database (see aforementioned blog post from 2012) and that didn’t set off enough alarm bells to make us flee from it in terror.
But whether or not we should’ve known better, we bet on the wrong horse. MongoMapper went moribund over the last few years and, due to various dependencies with newer versions of other libraries we’re using, we were eventually forced to switch to Mongoid.
“Deploying this new model of modernity meant a mind-numbingly meticulous monster merge.”
Back in October, Bethany embarked on the project of making that switch, along with other needed refactoring and upgrades of Ruby and Rails. That was all in a separate branch of the code, of course, and the main branch of Beeminder wasn’t staying still. So when the upgrades and ORM switch were ready, deploying this new model of modernity meant a mind-numbingly meticulous monster merge. Mmmm. I mean, gack. It was super hard and error-prone and there were many things related to autodata integrations that we didn’t have a good way to test outside of production.
But finally we got impatient and bit the bullet and deployed the new hotness Monday night. And then everything broke. Here’s a sampling of the live-tweeted blow-by-blow as we stayed up all night fixing things (mostly Bethany, while Alys and I staffed the support inbox and the Twitters and whatnot):
We’ve avoided literal downtime so far but the database upgrade is causing other woes: Might’ve temporarily lost everyone’s deadline settings
— Beeminder Status (@beemstat) February 23, 2016
…followed by the tedious process of reconstructing clobbered settings on hundreds of people’s graphs. At some point we debated reverting everything so we could sleep but decided we were past the point of no return…
Aand the DB server went down again for unknown reasons. We may need to revert everything to old server and try again after sleeping :(
— Beeminder Status (@beemstat) February 23, 2016
And the dramatic conclusion (or at least when there were few enough fires and explosions that we stopped tweeting about it):
Hallelujah! Signing in w/ Google’s fixed. Some remaining cleanup & babysitting but no news = good news at this point. Thx so mch 4 patience!
— Beeminder Status (@beemstat) February 23, 2016
It was another 24 hours before we cleaned up the rest of the messes. Frustratingly, the one issue we haven’t resolved yet is that the whole site seems to be slower now. We may have messed up some database indices and are still profiling and investigating.
For posterity, here’s the list of things that broke (and that we now believe to be fixed):
- IFTTT triggers broke in a couple different ways
- The SMS bot and some emails started spewing broken html
- Several autodata integrations broke in various ways
- Every time a goal was saved we overwrote various settings, most noticeably the goal’s deadline (this involved painstakingly parsing our transaction log to reconstruct things and make sure no data was lost)
- UI icons disappeared briefly
- Everyone’s session expired which was especially bad because the password reset function was briefly broken (thanks to Andy Brett for quickly solving that one)
- Signing in with Google broke (and led us on a wild goose chase and maybe was just a matter of Google needing confirmation of the change of IP address or something)
- Changing rate units silently failed
Thanks so much for all the patience and encouragement! Keep telling us about problems you notice, even when you figure we probably already know. Getting multiple reports is a great gauge of priority.
PS: This was all written with not enough input from Bee, who’s still recovering from the trauma herein described. Since this post is due at 6am (see that dogfood graph in the sidebar?) and she went to sleep shortly after midnight, I asked her for her disutility for being woken up at 5am to vet this post before it went live. She said she was too tired to introspect on her utility function properly but the bounds were $20 to $800 of disutility. That’s pretty high (in expectation) so I’m going to publish this without her vetting and she can chime in in the comments if I got anything wrong!
And huge thank you, Bee, for the hacking heroics! Despite all these things breaking, and lots of people being sad, we avoided more than a few minutes of downtime and I believe most users didn’t actually notice any problems at all. Which is pretty amazing.
PPS: Another huge thanks to Alice Harris, aka Alys, for staying up half the night with us talking to distressed users and helping with triage. Our Support Czar, Chelsea Miller, has been brilliant as ever in the aftermath as well. It doesn’t hurt that our users apparently like us so much that they stay super nice even when we break everything. So thanks to you all too!
Footnotes
[1] I guess it’s technically an ODM — Object-Document Mapper — since Mongo is not a relational database. Now stop with your quibbling and let us finish this story.