I learned this puzzle from Henry Cohn. Henry tells me it is probably due to Peter Glynn and Phillip Heidelberger (JSTOR).

I’ve just come back from trick-or-treating, and in my bag are two types of candy — peppermints and toffee. I’d like to know what fraction of the bag is peppermints. Not wanting to dump out that whole big bag, I’ll just draw a few random samples and make an estimate. (The bag is big enough that it doesn’t matter whether or not I return candies to the bag after sampling.) The responsible way to do this would be to decide ahead of time how many samples I will draw — say ten pieces — and stick to that plan. Instead, though, I decide to just sample for five seconds and count however much candy I can grab in that time. This would be fine too, if it took me the same amount of time to draw a toffee or to draw a peppermint. But it doesn’t. I can draw a peppermint from the bag in one second, while the toffee sticks to my fingers and takes three seconds to unwrap. This makes my sampling process flawed. Suppose that the probability of drawing a peppermint is , and the probability of drawing a toffee is (with ). So, for example, I might draw a toffee, a peppermint and then another toffee (which I would finish unwrapping after time expired). The probability of this sequence of draws is and, if I encountered that series of draws, I would esitmate that my bag was 1/3 peppermint. In the table below, I have listed all possible sequences of draws, the odds of obtaining that sample, and the estimate that sample produces

Draws: | PPPPP | PPPPT | PPPT | PPT | PTP | PTT | TPP | TPT | TT |

Odds: | p^5 | p^4 q | p^3 q | p^2 q | p^2 q | p q^2 | p^2 q | p q^2 | q^2 |

Estimate: | 1 | 4/5 | 3/4 | 2/3 | 2/3 | 1/3 | 2/3 | 1/3 | 0 |

So the expected value of our estimate is

.

Note that this is not equal to the true value, . It is a slight underestimate because those sticky toffee get counted more than their fair share. In statistical jargon, this is not a consistent estimator. Here is a graph of our estimate as a function of the true value of .

Incidently, this has a “practical” application. Suppose a pollster comes to your door and asks what candidate you plan on voting for. If you suspect that he will poll for eight hours and then stop, you should stall as long as possible, as that will lead to a higher reported total for your candidate.

Ok, that was nifty, but it wasn’t the puzzle. Here is the puzzle. Suppose that I adjust my sampling procedure in just the smallest way: if I am still unwrapping a piece of toffee when time runs out, I don’t count it to my sample. So that changes the bottom row of our table as follows:

Draws: | PPPPP | PPPPT | PPPT | PPT | PTP | PTT | TPP | TPT | TT |

Odds: | p^5 | p^4 q | p^3 q | p^2 q | p^2 q | p q^2 | p^2 q | p q^2 | q^2 |

Old Estimate: | 1 | 4/5 | 3/4 | 2/3 | 2/3 | 1/3 | 2/3 | 1/3 | 0 |

New Estimate: | 1 | 1 | 1 | 2/3 | 2/3 | 1/2 | 2/3 | 1/2 | 0 |

The new expected value of our estimate is . That’s right, this slight change made our procedure accurate again! So the puzzle is, why? I have two proofs, but they both involve some messy computations. There must be an argument which makes this just jump out at us. (Of course, there is nothing important about the numbers 3 and 5, or about having two types of candy. This happens with any number of types of candy, any unwrapping times and any sampling time, as long as the sampling time is longer than the longest unwrapping time.)

By the way, I was going to suggest that pollsters should use this modified sampling to defeat the attack mentioned above. But the modified method is vulnerable to a slightly more sophisticated attack: you should now answer slowly near the beginning of the day but rapidly at the end of the day. The modified method only works when the time to draw a particular candy is independent of past history, including being independent of how much history has passed.

**Update:** Terence Tao gives a very nice solution in the comments.

My first guess is that if you allow overspills, you’re effectively shortening the toffee time by a factor of where is the a expected number of toffees, while still calculating as if the original value of the time is correct. But somehow this seems a mite simplistic.

The one who I think

willknow the right answer is Isabel.This reminds me of the calculation of how many people are on a bus (with infinite capacity, of course) if they arrive by Poisson process. Say you get N as the average. That is the average seen by the bus drivers.

What is the average seen by the passengers? It is N+1.

The reasoning is also similar to this old canard. Suppose that half of children are male and half female. Suppose that everybody stops having children as soon as they get their first girl. Will this increase or decrease the ratio of boy to girl babies born?

With the a-priori knowledge that determining that a candy is a peppermint takes only one second, and unwrapping a toffe takes three seconds, why wouldn’t you simply state that you take one second to unwrap each candy. If it is fully unwrapped in one second, it is a peppermint, and if it is not unwrapped fully, then it is a toffee.

Nothing in the original statement said that you had to EAT the candy, only that you didn’t want to dump out the whole bag, and you were presuming that each piece was wrapped!

In order to avoid division by zero issues, one should adopt the convention that if no candy has been fully unwrapped when the time runs out, then one uses the partially unwrapped candy as the sample (thus the fraction would be 1 if it was a peppermint and 0 if it was a toffee).

One can see why the algorithm works by the following “martingale” argument. For each n=0,1,2,…, let X_n be the random variable defined as the fraction of peppermints one sees when one has unwrapped n candies or when the time runs out, whichever occurs sooner. For instance, in the TPP case mentioned above, one has X_0=X_1=0, X_2=1/2, X_3=2/3, and X_n=2/3 for all n > 3. The objective is to show that X_n has expectation p for all sufficiently large n.

On the other hand, it is clear (with our above conventions) that X_0 has expectation p. So it suffices to show that X_n – X_{n-1} has expectation 0 for every n; informally, this means that each new candy one gets to unwrap does not affect the expected fraction.

We can condition on the event that we actually get to unwrap n candies, since otherwise X_n-X_{n-1} vanishes. We can then condition on the distribution of candies one sees at this stage, e.g. m peppermints and n-m toffees (so ).

Now for the key observation: once one performs this conditioning, the _order_ in which the m peppermints and n-m toffees appears is completely random (each of the possibilities is equally likely).

We can now proceed in one of two ways. One is direct computation: the last candy was equal to a peppermint with probability , and a toffee with probability . In the former case , and in the latter . Since , the claim follows. (The case n=1 has to be treated separately, of course.)

The other way is to view as the probability that a randomly selected candy among the n candies already unwrapped is a peppermint. , on the other hand, is the probability that after randomly deleting one of those candies, a randomly selected candy among the remaining candies is a peppermint. It is then intuitively obvious that these probabilities are the same (they can be joined by the joint distribution of selecting two distinct candies at random among the n candies, deleting one, and inspecting the type of the other).