I just finished reading Dave Bayer and Persi Diaconis’ paper Trailing the dovetail shuffle to its lair (link requires MathSciNet access.) This paper is best known for the statement that you should shuffle a deck of cards seven times in order to thoroughly mix it. Actaully, what they show is that, if you shuffle a deck of cards times, the leading order term in the formula for the quality of mixing is a function of . They also give explicit numerical computaions for a deck of 52 cards, and show that the most dramatic improvement is going from 6 shuffles to 7. It’s a beautiful paper, so I’m not going to spoil it by explaining the details here. What I want to try to explain is the motivation behind their formula for the “goodness” of a shuffle. They use a measure called “total variation distance”, which they describe as “standard in probability theory … [but] difficult to explain to nonspecialists”. I’m not a specialist, but I think I have a good explanation of what total variation distance measures. Here is the slogan: “Total variation distance is how much money you could make when betting against an opponent who is scared of big pots.”
Suppose I have a way of shuffling a deck of cards which looks random but, in fact, is not. (For example, doing only three riffle shuffles.) How can I make many off of this? Well, for example, let’s say that I start with the Ace of Spades on top of the deck. After three shuffles (under the model in the Bayer and Diaconis paper) the odds that the Ace will still be on top are 1/8. If I have a friend who believes that my shuffling method is fair, he’ll think that the odds of the top card being the Ace are only 1/52. So, if I offer him a bet at odds of 1:51 that the top card is the Ace of Spades, he’ll accept it and, on average, I’ll multiply my wealth by 52/8=6.5. Pretty sweet! In general, suppose that I can think of an event which, under a fair shuffle, has probability but, under my shuffling method, has probability . Then I should be able to multiply my money by by finding people to bet with. Formally, by a bet I will mean a specification of a subset of all possible shuffles. I’ll write for what my
mark trusting friend would expect to be the probability that shuffle occurs — namely, . I’ll write for the actual probaility. So, here is a trial definition:
Bad Definition: The variation of a shuffling method is the maximum of , where ranges over all sets of shuffles. More generally, if is some set of possible outcomes of an event, and and are two probability distribtutions on , then the variation between and is the maximum of , where ranges over all subsets of .
Why is this a bad definition? Well, it might involve dividing zero by zero, but let’s ignore that point. More problematically, it might involve dividing a nonzero number by zero. If there is some event in that my trusting friend thinks is impossible, but actually occurs with some miniscule probability, then the Bad Definition claims that this situation is of infinite worth to me. But most people won’t take bets that end up with them getting an earful of cider. In the case of shuffling, there is no shuffle that my trusting frend thinks is impossible, but the possibility that is very small still causes trouble. For example, suppose I bet that the deck will remain in its original order. My friend will think the odds of this happening are 1/(52)!, or about . In reality, if I have done 7 shuffles, the results of the Bayer-Diaconis paper show that the true odds are
almost than four orders of magnitude higher. Does anyone seriously think I can multiply my money ten thousand-fold by betting someone that I can shuffle a deck seven times and have it stay in perfect order? (How cool is that there is an exact formula for the odds of this? And that’s just the beginning! This is why you should read their paper.)
The correct definition of total variation is, once again, a maximum over all possible bets , but with replaced by a more sensible quantity. What would be more sensible? Well, as I said above, let’s suppose that my trusting friend is scared of large pots. So he won’t place any wager where the total amount of money at stake is more than a hundered dollars. Suppose once again that there is some event which my trusting friend believes has probability , but in reality has probability . My friend will be willing to bet , to my , that this event won’t happen. My expected earnings are thus . So, I’ll consider the worth of this situation to be . (Note that I’ve switched from measuring my earnings multiplicatively to measuring them additively.) That gives us the right definition:
Right Definition The total variation between two probability measures and on a set is the supremum, over all subsets of , of . In other words, if describes the true probability of events, and describes my
victim friend’s beliefs about probability, then the total variation between and is the most money I could expect win on a bet whose total pot contained one unit of money.
You are now prepared to understand Bayer and Diaconis’s Table One: shuffling a deck of cards 5, 6, 7 or 8 times respectively has a total variation of 0.924, 0.614, 0.334 and 0.167. Look how extreme that first number is. There are events which, under honest shuffling, would happen with odds less than 1 in 10 but such that, if I do five riffle shuffles, they will actually happen with odds of more than 9 in 10.
Excercises for the reader: (1) Suppose my friend is only willing to put up $100 of his own money, but is not frightened by me betting as much as I want. (In other words, he is the sort of person who plays the lottery.) What should I replace by? (2) The Kelly Criterion suggests that I should bet, not in the way that would maximize my expected earning but, rather, in the way that would (if done repeatedly, reinvesting the winnings) maximize the most probable outcome. This winds up meaning that I should bet times however much money I have. If I use this formula, what should I replace by?