Non-Nerd Version of This Harrowing Tale
Bethany and I woke up at 7am on March 7th to a text message from Jill that some graphs seemed to be fubar. Panic ensued as we hacked away nonstop till 5pm or so, never getting dressed or leaving the house. What was the problem? The night before we had run a seemingly innocuous command to destroy some data points that were part of meta/users in such-and-such time range. I think we were cleaning up test users so as not to inadvertantly inflate that metric.
We used a command called “destroy_all” on the dataset we wanted to get rid of, but it turns out destroy_all can’t be qualified that way and, much worse, it silently ignores the error. When we ran that command it just started destroying all the data points! We canceled the command when it seemed to be taking too long, not suspecting how drastically awry things had just gone. Until the fateful text from Jill, which triggered a frantic scramble to piece everything back together from the last backup and the logs of new data that came in since then.
Long story short: Bethany saves the day and all is well again in Beeminder land.
Nerd Version
We hate to pile on poor Mongo but the recent spate of Hacker News hate for Mongo reminded us that we should really blog about a little experience we had six months ago.
So here’s what not to do!
First, fire up the Rails console on your production machine and write a little query to find the objects you want:
Datapoints.where(:conditions=>blahblah)
Inspect your results. Confirm that is totally the set of objects you want to blow to smithereens. “Yes, yes it is. I feel totally confident (well, maybe slightly anxious, let’s be honest now) in blowing away that results set.”
Then:
Datapoints.where(:conditions=>blahblah).destroy_all
Our datapoints query was scoped within the domain of a single Goal instance (some meta goal), and it still went through destroying ALL the datapoints. Fuck scope.
If you have a lot of Things, (and we do — going on half a million of them now), it turns out this will take for frakkin’ ever, because it is DESTROYING EVERYTHING. If you wait for it to actually return, you get a handy error along the lines of “Can’t do destroy on a query object. (But I just destroyed everything anyway. Y’know, just in case.)” [1]
“Calling destroy_all illegally on a query object destroys the entire collection!”
First off, above is just not how destroy_all is supposed to work. RTFM, etc. The destroy_all method is a ClassMethod that actually takes the query conditions as arguments, performs the query for you, and destroys each document returned. So calling destroy_all on a collection with no arguments in fact destroys everything. Where we went wrong was assuming that the scope of the previously chained query would apply. The problem with this may stem from the where() method actually returning a Plucky::Query object, not a results set from the database. This is for some very nice reasons, like being able to chain together queries, and not actually hitting the database until you need the results, etc. Still, it’s pretty crazy that the result of calling destroy_all illegally on a query object is to destroy the entire collection!
Anyway, we didn’t wait for that point, so we never got that helpful error. We noticed that it was taking too long so we just killed it. Nothing was obviously amiss yet and we fell asleep, blissfully unaware of the chaos we would soon wake up to.
Lessons
Holy cow is Bethany an amazing hacker. She sprang to action, fingers flying, writing Ruby code from scratch to parse and piece together everything from our transaction logs and merge it with our last uncorrupted backup from the day before. I fell in love all over again.
Patrick Jordan got the told-you-so award for the day: Never run ad hoc commands in the production console! I’ll claim 2nd prize for insisting on a solid, machine-parsable transaction log, which made it possible to piece everything back together. But Bethany was still the hero of the day for doing the heavy lifting on the actual piecing together. We won’t talk about whose fault the whole thing was in the first place (since the real answer is Fucking Mongo’s Fault!).
Or maybe it’s Mongomapper’s fault. We just raised the issue there in hopes of getting this fixed. We’re embarrassed that we sat on this for so long after solving our own immediate problem. We kept meaning to help get it fixed for posterity but failed to get a round tuit.
UPDATE 2012.12.27: This was just fixed in the master branch of the MongoMapper repository on Github. Phew!
PS: Oh, Irony! While converting the above from a post-mortem email amongst ourselves into a blog post, Miëtek Bak kindly alerted us that when we deployed our shiny new goal creation wizard we introduced a bug that set all new goals as temporary test goals, meaning they magically delete themselves after a number of days. Oy. So we have another fun night of putting Humpty Dumpty back together again. And a bunch of apologetic emails to send. Note that the Mongo thing happened six months ago so I guess we were overdue for another fit of ineptitude…
Footnotes
[1] We should include the exact error for anyone searching on this:NoMethodError: undefined method `destroy_all' for #<Plucky::Query>