World War II Statistics-and-Security Story

I’d like to know the derivation of that estimator. I recall a similar problem in a statistics class and the Maximum Likelihood Estimator turned out to be the maximum of the observed values — a biased estimator, indeed, but the MLE nonetheless.

TimmMurray • August 28, 2006 3:39 PM

I used a similar (though opposite) tatic once in the strategy game “Star Wars: Rebellion”. I know, I hear some of you sighing, but I think there is a lesson to be learned here about the art of deciving your enemy.

Ships in the game were normally named with sequential numbers (“Star Destroyer 5”, “Star Destroyer 6”, and so on), but you could change. As the Empire, I built a Death Star and changed its name to “Death Star 5”. Now, as you can imagine, a Death Star takes so many resources that you could build a whole fleet of Star Destroyers instead and do a much better job. Building even one Death Star doesn’t actually do that much for you.

But having 5 Death Stars around would mean that my opponent wouldn’t want to risk attacking any planets unless he was sure there wasn’t a Death Star there. He destroyed the only Death Star I actually built in its first battle, but his further movements were paralyzed while he tried to hunt for my other non-existent Death Stars.

The lesson here is that you can modify the information your enemy is getting, causing them to choose an incorrect strategy.

Carlo Graziani • August 28, 2006 3:40 PM

I just tried to reproduce the result, but I’m coming up with a different answer.

The likelihood is

P(M|N,S) ~ N^-S (dropping all but the N-dependence).

Assuming a flat prior in N, Bayes theorem gives that

P(N|M,S) ~ N^-S, N>=M.

Assuming N>=2, and approximating the normalization constant by the integral of N^-S N>M,

P(N|M,S)=(S-1)M^{S-1} N^-S.

Suppose we want to know a 90% probability upper bound on N, that is, an N_0 such that

P(N<N_0)=0.9=p

Then using the same technique of replacing the summation by the integral, we get

N_0=M (1-p)^{-1/(S-1)}.

With p=0.9, M=93, and S=5, I get N_0=164, somewhat higher than the example in the article.

I’d be curious to know what the actual data was that the wartime statisticians had to work with.

Anonymous • August 28, 2006 4:12 PM

I ran accross this story a long time ago. In that version the Germans had gaps in the serial numbers that also had to be figured out by the statisticians.

dano • August 28, 2006 5:51 PM

In reading Zhukov (1969, Harper and Row, translated by Shabad) it becomes dreadfully apparent that it almost did not matter how many tanks Germany was able to make. Granted, that takes the fun out of the proof of the statistics wonks being smarter than the intel jocks.

swiss connecction • August 28, 2006 6:06 PM

I always use random numbers on my client contracts and bills. I do not want the to be able to estimate how many contracts I have closed

Carlo Graziani • August 28, 2006 7:32 PM

I understand where that estimator came from. According to

http://www.math.uah.edu/stat/urn/OrderStatistics.xhtml

it is an unbiased estimator based on the order statistic M. Essentially, the result is very close to the peak of the distribution P~N^-S, N>M, which is to say near N=M. However, it is a point estimate, and doesn’t really give a sense of the breadth of the distribution the way the Bayesian calculation of a 90% upper limit does.

Oh, in my previous post, I meant “assuming S>=2” rather than “assuming N>=2”.

This concludes the Bayes geek portion of our program. Sorry about that.

fitzroy • August 28, 2006 9:29 PM

This article is historically problematic. The mark IV was introduced in 1937 and upgunned in 1942. The Mark V was introduced in late 1942. How are the allies worrying about the Mark V and even the upgunned Mark IV in 1941-42? Also, the claim:
“emboldened, the allies attacked the western front in 1944 and overcame the Panzers on their way to Berlin. And so it was that statisticians won the war – in their own estimation, at any rate.” reduces the Allied decision to invade the continent in 1944 a function of quantity of German tanks, which is, charitably, ahistorical.

bogeyman • August 29, 2006 12:30 AM

Another conclusion is that intelligence services always overestimate the targeted enemy’s capability, because you can be fired (demoted, defunded) only for an underestimate.

RonK • August 29, 2006 1:46 AM

Funny coincidence, but ever since the post about “Bruce Schneier Facts” I’ve been thinking on-and-off about a similar problem.

You have a (secret) collection of N unique items and a “random” button which when pressed displays an item chosen at random, and you need to estimate N. You have two given costs, one for pressing the button, and another for (relative) inaccuracy in your estimate — what’s the algorithm which minimizes the expected cost?

For extra credit, redo the analysis including the possibility to use a “Next”, and/or “Previous” button, like the website provides.

Greg • August 29, 2006 2:22 AM

The storys history is not quite right, but this sort of thing was done, and proved to be accurate. The problem is that at the time, you don’t know which estimate is accurate (but you do after the fact).

Anyway it turns out that no military hardware uses simple number sequences for serial numbers anymore. What they use its well not that easy to tell from the outside, but they still want to be able to get “batch” numbers etc, so that QC can be managed.

Greg • August 29, 2006 2:33 AM

@ruidh and anyone else interested.

Here is a worked example on Bayes whatits… Its from a Grad paper, but well writen and easy to follow. You can just skip down to the tram car problem on page 2. I will leave on the site for at least a few months.

http://www.cibiv.univie.ac.at/~greg/chap5.pdf

Matthew Skala • August 29, 2006 9:10 AM

RonK – this isn’t a complete analysis, but the birthday paradox is relevant. As an order-of-magnitude first cut, I’d say press the button until you see a duplicate, and square the number of presses.

derf • August 29, 2006 10:12 AM

Maybe these statisticians could be resurrected to tell the TSA how many terrorists actually want to blow up an airplane with hair gel. Surely they have serial numbers on bottles of hair gel.

Andre LePlume • August 29, 2006 12:03 PM

@{Carlo, Greg}

Thanks for the pointers.

David Thornley • August 29, 2006 12:29 PM

The history is worse than fitzroy claimed.

Through 1942, the main German medium tank was the Pz III, with the Pz IV also in widespread use. The Pz V was not used in combat until July 1943, and wasn’t ready then. The Western Allies did not encounter the Pz V until March 1944 or so, and I really really doubt the Soviets gave the West any useful information. The Pz VI was introduced in late 1942, and was encountered then by British forces in Africa. So much for the concern about the Pz IV and Pz V in the middle war years.

The production figures are also off. The Germans produced well over 300 Pz Vs a month once they got going. It still wasn’t enough, but it was more than the article says.

And, as fitzroy reports, the estimates did not have a major effect on the war. On June 5, 1944, Eisenhower wanted a weather report, not the highest observed serial numbers on particular German tank models.

As a result, I don’t take the article seriously. The research was bad and the final conclusion ludicrous. Therefore, I have no confidence in the rest of the article.

Anybody looking for contributions of statisticians to WWII should look into the Battle of the Atlantic, against the U-boats (which did not have consecutive numbering, for what that’s worth).

Bill R • August 29, 2006 10:17 PM

While the Guardian’s Panzer version history is apocryphal (never trust pacifist military historians), the math story fits with one I heard in math circles decades ago.

Two math boffins for Brit intel were working on this problem or an equivalent one in Operations Research, and decided to get a test data-set. They sat at at cafe in London writing down the taxi license-number on each black London cab that passed by, to estimate the number of cabs in the city — which number could be checked with the registrar through government channels.

The original punchline I’ve half forgotten – A civilian who’d read the security-through-fear posters either reported them as spies to the nearest bobby, or told them it was unpatriotic to be recording data with a war on, I forget which.

WRTo the Panzer models, it was easier to out-number PzIII’s than PzIV’s with Allied tanks — we needed to know how many it would take to out-number ALL of them, if we were going to have to replace/recover-repair some number of our tanks for each of theirs we elminated. (As long as we push forward, we recover and repair some of our losses, and they don’t. Same goes for wounded/captured troops.) So the speed of production (especially compared to known losses) and ratio of PzIII:PzIV:PzV production were both crucial numbers in the calculation of WHEN and WHERE we wanted to face ALL of Hitler’s remaining reserves.

Ping-Che Chen • August 31, 2006 2:13 AM

When I read this story I also tried to derive the estimator. After some failed attempts, I derived an estimator by computing the expected value of the maximum serial number M when you capture S tanks from N tanks, and use the relation to compute M from S and N. The estimator I derived is very similar to the one in the Guardian’s story, but slightly different. My estimator is N = M(S+1)/S – 1.

Schneier on Security

World War II Statistics-and-Security Story

Comments

Leave a comment Cancel reply