Kavka’s Toxin Puzzle and the Superpower of Commitment Devices

Friday, December 16, 2022
By dreeves

DALL-E image of a cartoon bee looking ill drinking a vial of a toxin

There’s a famous philosophy thought experiment about how the concept of forming an intention might not be as coherent as it seems. Suppose some magic mind-reading aliens (you know the type) offer you a million dollars to drink a nasty but ultimately harmless toxin. Let’s say it makes you puke your guts out for a week but then you’ll fully recover. And say that that’s a no-brainer for you — you’d absolutely take the deal. But the aliens’ actual offer is actually much more attractive: you get the million dollars just by honestly intending to drink the toxin. If you do, they pay up and fly away. Whether you actually follow through and drink the toxin doesn’t matter anymore.

Can you get the million dollars now? You are predictably going to chicken out on drinking the toxin later, with no money at stake anymore. Knowing that, maybe you can’t form an honest intention to drink it? Of course if you employ a commitment device then it’s trivial, so the thought experiment generally stipulates that those aren’t allowed.

I think maybe a hardcore rationalist would intend to drink the toxin and then actually do it on the grounds that that’s the only way to retroactively cause the million dollars to have (in the past!) been given to them. This is counterintuitive. Drinking the toxin is unambiguously idiotic unless you somehow think that doing it can change the past. People do bite that bullet.

Or maybe it’s enough to have very high value for a sense of self-consistency? But of course violating that sense of self-consistency is a consequence of sorts. Which makes it a little bit like a commitment device?

(Aside: I love how a commitment device dispatches the whole paradox so neatly that the puzzle has to be amended to specifically disallow it.)

As we learned in our blog post on incentive alignment there isn’t a bright line definition of a commitment device. What if you solemnly promise to your best friend that you’ll drink the toxin? Is that a commitment device or just voicing the intention? Breaking a promise has real consequences — future ones, not just past — so I’d say it’s on the spectrum.

“It establishes a precedent that’s almost like a superpower”

What if you choose to drink the toxin because it establishes a precedent that’s almost like a superpower: merely uttering “I hereby intend to X” becomes itself a magic commitment device ensuring you actually do X! Maybe that’s worth enduring the toxin for. But this again is incentive alignment. By intending to drink the toxin in order to establish a superhuman follow-through precedent, you’ve created real consequences for yourself in the future. You’ve created a genuine incentive for your future self to follow through.

The very act of forming an intention can and should imply future consequences for flaking out.

What if I just intend to have ice cream after dinner?

In that case the consequences are built right in. Ice cream is delicious. Your incentives are perfectly aligned. That intention to have ice cream is the same as a prediction that you’ll be having ice cream. If you change your mind for any reason — even due to a random burst of willpower — it doesn’t matter. You still had the honest intention qua prediction that you would have ice cream. But that obviously doesn’t work when you replace the ice cream with the toxin!

So, again, if the intention isn’t already incentive-aligned then it has to involve a commitment in order to be an honest intention. If the aliens want to get out of paying you your million dollars by calling this conception of intention a commitment device, that means defining away the concept altogether. If we’re talking about pure intention with no consequences then it’s strictly a prediction of your own future behavior. When faced with the choice of drinking the toxin and having zero incentive to do so, of course you won’t. So that’s your honest prediction. Predicting otherwise would be dishonest and so you have no way to get the million dollars.

So that’s my solution to Kavka’s toxin puzzle. Pure intention stripped of all consequences is nothing more nor less than a prediction of your future behavior. The only way to form an honest prediction that future-you will do something you won’t have incentive to do is to create that incentive.


Thanks to Mary Renaud, Bee Soule, Christopher Moravec, Nathan Young, Theo Spears, and others in the Beeminder community Discord for discussion that led to this post.

Tags: , , , ,