Earlier this month some dozens of you got a delightful little email from us like this:
Subject: beeminder error in your favor
this is super embarrassing but there’s a chance you were affected by a bug where we thought we canceled a charge but then it went through anyway. partly as self-punishment we’re just refunding all the charges where that could possibly have happened. so either you’re rightfully getting a refund or you’re getting a random reprieve on a past derailment! either way a refund of $___ should show up on your credit card statement soon! sorry about this! (or possibly: you’re welcome!)
the beeminder team
ps: not to make excuses but this was caused by a ridiculous bug in resque/redis (like cron for ruby on rails) which we’ll blog about soon at blog.beeminder.com
Not to get superstitious but every time we blog about some shameful bout of ineptitude, something equally embarrassing happens while we’re writing it up. Last time, while writing about how we accidentally deleted a random slice of our database, we added a footnote about accidentally marking all goals as temporary test goals that automatically delete themselves.
This time, with uncanny parallelism, we found that, weeks ago, when adding more fanciness to our goal creation wizard, we started silently ignoring the “temporary test goal” setting. So not nearly as bad as the other way around, but especially embarrassing this time because one of our favorite beemindees alerted us over a week ago and we just shrugged it off as PEBKAC and deleted their goal manually. We even had our own suspicions when meaning to create test goals (which we do a lot) but always assumed we must’ve forgotten to check that box!
So now it’s fixed,  and anyone reading this who was (even possibly) affected by the bug, let us know and we’ll delete any goals you like, no questions asked. (Well, one question asked: “Do you promise you meant it to be a temporary test goal?”)
Nerds Only Beyond This Point
As promised, here are the gory details of that money-eating bug. First, as background, we delay charges for 24 hours after derailment for logistical reasons. If there’s cause to cancel the charge (and sometimes there is!) we save hassle/fees if we haven’t actually charged the credit card yet. We implemented this using resque_scheduler’s delayed jobs, and queuing up a job at time of derailment to process the charge in 24 hours. Then if the derailment was not legit, we remove the job from the delayed queue.
Resque relies on string comparison of JSON-encoded hashes to find and delete a delayed item. It uses a hash of the parameters you passed in when you queued it, but in Ruby 1.8 hashes are unordered. So sometimes the string that Redis (Resque’s keystore) gives has the parameters in a different order than the string that Resque constructs when you call remove_delayed, so despite knowing the timestamp it was delayed for and the exact parameters of the delayed job, you get a false negative on comparison, and Resque won’t remove it!
Now, Resque returns the number of deleted jobs, but we weren’t checking the return value, and so sometimes the charge job would remain on queue and go through despite our intent to have canceled it.
We ultimately worked around this problem by removing Resque from the equation and adding a key to our pledge model for the timestamp of the pending charge. Then we have a sweeper that runs regularly and looks for pledges whose pending charges have come due. This also makes it simpler to know if a particular charge has been canceled already without digging through Redis and introspecting on the queues.
Here’s the Github issue — https://github.com/bvandenbos/resque-scheduler/issues/47 — and we’ll update this post when it’s fixed. (UPDATE: That GitHub issue says it was fixed a year later.) Of course, even if we hadn’t worked around it, it should be moot in Ruby 1.9.
 And we wrote some damn tests to catch it if we break this again!