Schneier on Security
A blog covering security and security technology.
« Friday Squid Blogging: Humboldt Squid Invasion |
| Tips for Staying Safe Online »
July 27, 2009
Base Rate Fallacy
Nice description of the base rate fallacy.
Posted on July 27, 2009 at 6:48 AM
• 33 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Percentages that are used in news stories and product information always remind of the book "How to Lie with Statistics" by Darrel Huff.
I like the way he slides into Swine Flu...
In the current issue of satirical magazine Private Eye, they had the following (if I remember correctly) info under "Number Crunching" on the leader page,
* 29 Deaths in UK from Swine Flu.
* 3-4000 deaths in UK each year from seasonal flu.
Spot the problems?
1, Swine flu is "out of season" in the UK (ie not like with like).
2, Swine flu is nowhere near a year on in the UK (so not like with like).
3, No age range information (Swine flu tends to be most serious in the "economicaly productive" where as seasonal flu tends not to touch the "economicaly productive" (so again not like for like).
4, No underlying illness indicators (so again not like for like).
There are a whole bunch of other things missing but I guess 29:4000 makes a good sound bite.
Where as "We Don't Know" is just way to scary for the massess. (However we do get some pathetic contradictory advice from the UK Gov and a help line from the much maligned NHS Direct...).
Even this is wrong. There are several mistakes.
It's kind of a childish essay written by someone who learned to do some kindergarten arithmetic.
Bruce just loves everyone who points out that any test that claims x% true/false positives -- must be nuts because there is no such thing to those caught in the wrong side of such statistic(data).
It's worse than that. The author assumes no false negatives; that out of 301 positives, the terrorist must be in there. In fact there's a 10% chance that none of the 300 people you nabbed are terrorists, you let him go. He also assumes that the population of 1000 contains exactly one terrorist.
Even assuming that, and even assuming that the 90% rate is correct for both positive and negative assertions, if the device picks 301 people, there's only a 90% probability that it correctly identified the terrorist and he's in that group of 301, so it's not 1/301, it's 0.9/301 - so 0.299%, not the 0.33% that 1/301 gives you.
It would likely be a bit different than this too, because it's a rare test that has the same rate of false positives than false negatives.
Oops, should have said "...population of 3000..." at the end of the first paragraph.
Also of interest is that it only deals with 90% which is only one side of the story...
If your machine is 90% accurate how did you get to that figure?
And therfore what does it realy mean?
In his case of 3000 people you would get 300 positives, but you need to go on and ask if there are 3000 people is there actually any terrorists amoungst them? And if so do they ALL actually get picked up or only some of them?
But importantly how on earth do you actually test it.
For instance if you used test subjects how do you know that they are "fully independant" with respect to each other. That is will the test pick up PIRA, RAF, Al Queada with the same accuracy or not, and if not why not?
Also you need to know if the 90% is from a single test (has RPG on their back), from a chain of tests (has beard, olive skin, brown eyes and RPG on their back) or a tree of tests (If eyes blue then X, if Brown then Y, else Z).
Oh and I liked the comment of one of the posters to the page saying 90% is ok if you expect 10% in the population. Unfortunatly that is not right (theres no right answer for it but) it's nearer correct if you expect 20% in the population (Oh for that level of confidence ;).
As was once noted,
"Theres lies, dam lies and statistics"
To which should be added,
"But real dishonisty hides in sound bites and spin"
You beat me to (some) of it 8)
The base rate fallacy reminds me of the Prosecutor's Fallacy. I think they're actually the same thing.
The Prosecutor's Fallacy is poignantly illustrated by the tragic story of Sally Clark (http://en.wikipedia.org/wiki/Sally_Clark)
In industrial Quality Control, the best test or inspection processes are considered to be effective 85% of the time. This is the reason that true quality assurance comes from controlling process, not from inspections and tests. Could we expect this test to be better? The 90% assumption must be questioned.
As JRR writes, giving an "accuracy" figure, and identifying that figure with the false-positive rate is an incomplete, even misleading characterization of this sort of system.
There are two variables that matter: the false positive rate FP, and the false negative rate FN. These sorts of systems generally have a "response curve", in which FP is plotted as a function of FN. This response curve completely characterizes the behavior of the system.
The response curve genrally slopes downwards --- a lower FN corresponds to a higher FP. This is easy to understand: a system that identifies everyone as a terrorist (FP=1) never misidentifies a terrorist as a good guy (FN=0), whereas if the machine gives everyone a pass (FP=0) it will never ID a terrorist (FN=1).
There is generally a user-settable threshold that can be dialed up or down, depending on the desired "sensitivity". Depending on the setting, the system moves up or down its response curve. The name of the game is to claim that your system has a sweet spot: a portion of the curve that is sensitive (low FN) but has an acceptably low FP rate. That is, the response curve starts at FN=0, FP=1, as it must, but then immediately plummets down to a low FP (say FN=0.01, FP=0.01), levels out, and then gradually extends to FN=1, FP=0. The sweet spot is where the curve levels off.
It is absolutely crucial that such a sweet spot should exist for any system that is to be used for mass loyalty/criminal intent screening (as opposed to investigation of suspects), because any appreciable FP will manifest itself as tens, hundrededs, or even thousands of false bad-guy IDs, depending on the number of people screened and on the sensitivity setting. Turning the sensitivity down to turn down the FP noise will result in a higher FN, which is to say a lower chance of catching a bad guy.
Insofar as I am aware, no system of any kind --- magic terrorist detectors, polygraphs, or anything else --- actually has this kind of response curve. The Congressionally-commissioned National Academies study of polygraphing found that the actual available calibration data was comically inadequate to support the efficacy claims made by securocrats and vendors, but that to the extent that data is available, no sweet spot exists. The only reason polygraph loyalty screening doesn't result in hundreds of US security employees losing their jobs every year is that the sensitivity is turned way down --- the tests couldn't detect a spy even if he'd just come back from meeting a his Chinese controller (say). Ouija boards would be equally effective. But the securocracy finds polygraphs so familiar and reassuring that they reject any such criticism out of hand, preferring magical thinking to scientific uncertainty.
Misunderstanding tests in this and other ways has much greater implications in the medical context. Everyone wants to do screening tests but most screening tests aren't very good. There are very few screening tests that have decent analytic and clinical validity and clinical utility. Just watch how medical tests are dealt with in the heath reform discussions.
Nice article here:
It has a great quote from the chairman of the United States Preventive Services Task Force: “There are five things that can happen as a result of screening tests, and four of them are bad.”
Accuracy here is a function of three independent factors:
1. Sensitivity is how good the test is at detecting the quality when it is present. This is the unitary complement of alpha.
2. Specificity is how good the test is at missing the quality when it is indeed absent. This is the unitary complement of beta.
3. Prevalence is frequency of occurrence in the population under test.
Accuracy is (true-positives)*(prevalence) + (true-negatives)*(1-prevalence).
Given a test with a=.1 and b=.1 then accuracy has a minimum at p=.5 where false positives are expected to be .1 and false negatives also .1.
When the quality is more likely than not, accuracy is higher, false positives lower, and false negatives higher.
When the quality is less likely than even-money, accuracy is also higher, false positives higher, and false negatives lower.
When the quality is exceedingly rare, virtually every positive will be a false positive, and it won't matter if the test is 99% accurate if the follow-up routinely identifies the negatives as false negatives.
History is written by the winners, so all police need to do, if they want to improve their odds of having captured a real terrorist, is shoot the guy.
Nicley concise but...
You are assuming that everybody knows what ranges alpah beta can have and what a unitary complement are.
Also what the difference between true and false positives/negatives are.
The point of the article Bruce pointed to was "a simple example" for the "lay person".
As always language and it's various meanings gets in the way of comprehension ;)
@ Pete Austin,
"all police need to do, if they want to improve their odds of having captured a real terrorist, is shoot the guy."
Naw, it used to (supposadly) work in NY with "street punks", but when the UK Met Police tried it with a Brazilian Electrician it came unstuck.
Sometimes people have greater power and influence once they are dead than they ever did when alive.
This is off-topic, but several previous commenters mentioned "swine flu".
Have you noticed how the advice about hand washing has changed? With other diseases, infected people and doctors had to wash their hands, for example:
I have frequently seen medical staff cleaning their hands, on TV, but never once seen a healthy patient doing so to protect against CD or MRSA.
But with swine flu, have you noticed how this advice about hand washing has been swapped around? Now it's mainly directed at healthy people who want to avoid catching the dread disease, for example:
I assume politicians think people won't notice the subtle reversal, and they are hoping to benefit from "Health Theater".
@ Pete Austin,
"But with swine flu, have you noticed how this advice about hand washing has been swapped around?"
I'm waiting for the dermatoligists to pop up and say "too much washing of the hands is unhealthy" (which it is).
As "Hagar the Horable" once advised,
"Son the secret to life is moderation, but don't over do it".
That's why I keep bringing up the three factors -- sensitivity, specificity, and prevalence -- and rail against attempts to collapse three dimensions into one.
An elephant detector would be 100% accurate when used on airline passengers at a boarding gate, even if the detector were made out of wood.
Our company created a tiger detector made of wood, and this discussion of statistics has convinced me we can also sell it as an elephant detector, rhinoceros detector, and several other species. Thanks to Bruce and to other commenters for bringing this new marketing opportunity to our attention.
When trying to explain the base rate fallacy to some of my cohorts, I run into the problem of saying that such a detection machine is 90% accurate vs. that the machine is right 90% of the time.
There IS a difference, right?
It seems that the confusion the general public have when something is 90% accurate, with the whole forgetting-the-false-positives thing is that they're (inadvertantly) assuming they've gotten past that. In their minds, the statement reads that 90% accurate means that when the machine beeps that it's a terrorist, 9 times out of ten times, it's correct, and that was a terrorist.
Is this a fair statement? Can someone add something to this to make it clearer to the general schlomo?
Another mental trick I use for explaining the base rate fallacy is to suggest a screening system which always gives the more likely answer. Then I point out just how much higher the accuracy of the "always guess the more likely outcome" screening actually is.
Several others have mentioned the question of 90% accurate vs other methods of measuring effectiveness. The article jumps dramatically from a single percentage to failure rates, without questioning what the 90% means. If its 0% false-positives and 10% false-negatives, that would be a 90% where teh chances of that one guy being a terrorist is 90%. On the other side, a 10% false-positive rate and 0% false-negative rate also yields 90%, but the percentage chance of a positive test subject being at terrorist depends highly on the terrorist population at large.
I think the article assumes 10% false-negative and 10% false-positive. I don't know of a scanning machine out there which does that. usually you pick your null hypothesis and adjust tweaking factors until your most damaging case has a lower rate.
From the article
"If 3,000 people are tested, and the test is 90% accurate, it is also 10% wrong. So it will probably identify 301 terrorists - about 300 by mistake and 1 correctly."
The story assumes there is for sure 1 terrorist in the population of 3000. OK.
This ignores the case where where the terrorist did not get a positive signal.
Basically it's saying the false positive rate is 10%. You might ask, what would the false negative rate have to be for the 1 terrorist to "probably" be among the group testing positive? Exercise left to the reader.
I remember teaching this kind of thing re Aids testing in an intro stat class back in the early 90's.
Roy's "accuracy" (see above) is in fact "the probability that the machine is right", irrespective of whether the subject is benign or malignant. As he points out, it is a function not only of system parameters (FP and FN rates), but also of the proportion of malignants to benigns in the tested population (Roy's "prevalence").
You have no control over that proportion, except to the extent that you pre-screen. What you do control is the FP and FN rates, which you can trade off according to the system response function. So you have some limited control over the "accuracy", assuming the population proportion of malefactors is fixed. In existing systems that I am aware of, this control is insufficient to make a satisfactory bad-guy screening detector.
However, as you say, the term "accuracy" is often, misleadingly, used to characterize the FP rate (as in the cited article) or the FN rate alone. This sort of usage is worse than useless, although it makes for great marketing copy and lazy journalist bait.
I've got the perfect terrorist detector for the scenario in the article.
It has a zero false positive rate, and a false negative rate of 0.03%: Just scan everyone with it, and it says "not a terrorist", unfailingly, every time.
Yes and no: 90% accuracy means that the system is right nine times out of ten. But as terrorists are incredibly rare in the general population, the real issue is not false negatives (a terrorist is falsely assumed to be clean) but false positives (a clean person is labeled terrorist).
If you have 3000 clean people and 1 suspected terrorist and your detector works with 90% accuracy. It will name 300 "terrorists". Of them only 0.9 are dangerous, all others are false positives.
Do you get the numbers now?
For somebody whom was trying to expliain the base rate falicy in less than 250 words he didn't do half bad. Remember, his target is not the techies and geekfolk of the world.
@Dylan: Yes, an accuracy of 90% means the machine is right 90% of the time and wrong 10% of the time.
The problem comes when you apply this machine to ten million people per year at airports. When the machine is wrong, either it will miss a guilty person, or it will wrongly flag an innocent person. Suppose there are 100 actual terrorists out of those ten million people (which probably vastly exaggerates the number of terrorists, but nevermind). Then the machine is likely to miss at least 10 of the terrorists--oh well--but far more importantly, its going to flag about *one million innocent people*.
So what you have is a test that flags 90 terrorists and about one million innocent people. Which is utterly useless!
Notice that even if the machine was 99.9% accurate, you'd still have the same problem, only slightly less severe. The machine flags one out of every thousand people incorrectly. If you're lucky, it will flag all 100 of the terrorists as bad guys, and that part is great. The not-so-great part is that it also flags around 10,000 innocent people out of the ten million innocents! So approximately 1 in 100 of the people flagged by the machine would be terrorists, and the rest would have to be processed (and harassed, and investigated, and held without bail, and have their rights trampled on in dozens of other ways, all at the taxpayer expense).
Of course there will never, ever be a test that is 99.9% accurate at detecting "terrorists". Even 90% accuracy sounds wildly optimistic to me, and is already so inaccurate as to be downright useless.
Just on little thing people are forgetting here...
Terrorists are very very rare (low prevalence).
In the UK current (in jail) ordinary criminals are hovering around 0.1%of the population.
Even when prevalence is very low (terrorists in general population) you need to be carefull with your test group size and the likleyhood of a terrorist being present in the test group.
You need to think through what each variable in the test you are doing has on the outcome.
To start with you have a box that indicates one thing (aproximatly) 90% of the time.
But what does it mean,
With the box you get a true or false output (not terrorist 90%, terrorist 10%) which in turn may be correct or incorrect.
So there are four not two possible outcomes from each use of the box,
A) Terrorist : who is (correct) [TP].
B) Non Terrorist : who is not (correct) [TN].
C) Terrorist : who is not (incorrect) [FP].
D) Non Terrorist : who is (incorrect) [FN]
The first incorrect (C) is known as "an error of the first type" or False Positive, usually due to the test sensitivity being to high. You could say the box (if it where human) was skeptical and saw fault where there was none.
The second incorrect (D) is known as "an error of the second type" or False Negative, due to the test specificity being to high. That is you could say the box (if it where human) was complacent and hade commited an oversight.
Next you realy need to consider not just one test but a number of tests taken on a subset or group of the general population. Each member of the "test group" is (supposadly) selected at random from the general population and there for each is "independent" of each other (in reality this is almost never the case).
The acid question you need to ask is what is the likely hood of my very very rare target (terrorist) being in my test group, and importantly to what extent.
Which means of the test group of 3000 people one or more may or may not be a terrorist (after all intel can be wrong).
Which means you can have four cases,
In case 1 (no terrorists) you have,
1.A) 2700 not terrorist : who are not (correct),
300 terrorist : who are not (incorrect).
In case 2 (1 terrorist) you have (effective) the same output from the device but two posabilities,
2.A) 2699 non terrorists : who are not (correct),
300 terrorist : who are not (incorrect),
1 non terrorist : who is (incorrect).
2.B) 2700 non terrorists : who are not (correct),
299 terrorists : who are not (incorrect),
1 terrorist : who is (correct).
In case 3 where you have 2 (or more) terrorists,
3.A) 2698 non terrorists : who are not (correct),
300 terrorist : who are not (incorrect),
2 non terrorist : who are (incorrect).
3.B) 2700 non terrorists : who are not (correct),
298 terrorists : who are not (incorrect),
2 terrorist : who are (correct).
3.C) 2699 non terrorists : who are not (correct),
299 terrorists : who are not (incorrect),
1 non terrorist : who is (incorrect),
1 terrorist : who is (correct).
Finally there is case 4 where your group is all terrorists (I'm leaving the numbers the same unlikley as it is),
4.A) 2700 non terrorist : who are (incorrect),
300 terrorist : who are (correct).
As you can see not only the prevalence of the target group "terrorists" effects the expected outcome but also the test group size with respect to the prevalence within the general population.
Which shows just some of the issues (there are also other types of error that need to be considered as well) you can expect to come across.
@ Mat: I have that book, "How to Lie With Statistics". I too am reminded of it frequently. It's a good read.
As for the article, I like how the author reframes the question: Rather than looking at how 90% of bad guys will be caught, instead look at how 10% of good guys will be falsely suspected. (Assuming 90% successful identification.) Namely, take the additive inverse and apply it to the opposite group.
Makes things less impressive.
> I've got the perfect terrorist detector
> for the scenario in the article.
> It has a zero false positive rate, and a
> false negative rate of 0.03%: Just scan
> everyone with it, and it says "not a
> terrorist", unfailingly, every time.
Actually, that has a false negative rate of 100%. However, it has an accuracy of about 99.97%
You have to be careful at how you manipulate your information in your marketing strategy. don't mention the false negative rate, and tout the high accuracy.
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.