Bruce Schneier

# Schneier on Security

A blog covering security and security technology.

## July 10, 2006

### Terrorists, Data Mining, and the Base Rate Fallacy

I have already explained why NSA-style wholesale surveillance data-mining systems are useless for finding terrorists. Here's a more formal explanation:

Floyd Rudmin, a professor at a Norwegian university, applies the mathematics of conditional probability, known as Bayes' Theorem, to demonstrate that the NSA's surveillance cannot successfully detect terrorists unless both the percentage of terrorists in the population and the accuracy rate of their identification are far higher than they are. He correctly concludes that "NSA's surveillance system is useless for finding terrorists."

The surveillance is, however, useful for monitoring political opposition and stymieing the activities of those who do not believe the government's propaganda.

What is the probability that people are terrorists given that NSA's mass surveillance identifies them as terrorists? If the probability is zero (p=0.00), then they certainly are not terrorists, and NSA was wasting resources and damaging the lives of innocent citizens. If the probability is one (p=1.00), then they definitely are terrorists, and NSA has saved the day. If the probability is fifty-fifty (p=0.50), that is the same as guessing the flip of a coin. The conditional probability that people are terrorists given that the NSA surveillance system says they are, that had better be very near to one (p=1.00) and very far from zero (p=0.00).

The mathematics of conditional probability were figured out by the Scottish logician Thomas Bayes. If you Google "Bayes' Theorem", you will get more than a million hits. Bayes' Theorem is taught in all elementary statistics classes. Everyone at NSA certainly knows Bayes' Theorem.

To know if mass surveillance will work, Bayes' theorem requires three estimations:

1. The base-rate for terrorists, i.e. what proportion of the population are terrorists;
2. The accuracy rate, i.e., the probability that real terrorists will be identified by NSA;
3. The misidentification rate, i.e., the probability that innocent citizens will be misidentified by NSA as terrorists.

No matter how sophisticated and super-duper are NSA's methods for identifying terrorists, no matter how big and fast are NSA's computers, NSA's accuracy rate will never be 100% and their misidentification rate will never be 0%. That fact, plus the extremely low base-rate for terrorists, means it is logically impossible for mass surveillance to be an effective way to find terrorists.

I will not put Bayes' computational formula here. It is available in all elementary statistics books and is on the web should any readers be interested. But I will compute some conditional probabilities that people are terrorists given that NSA's system of mass surveillance identifies them to be terrorists.

The US Census shows that there are about 300 million people living in the USA.

Suppose that there are 1,000 terrorists there as well, which is probably a high estimate. The base-rate would be 1 terrorist per 300,000 people. In percentages, that is .00033%, which is way less than 1%. Suppose that NSA surveillance has an accuracy rate of .40, which means that 40% of real terrorists in the USA will be identified by NSA's monitoring of everyone's email and phone calls. This is probably a high estimate, considering that terrorists are doing their best to avoid detection. There is no evidence thus far that NSA has been so successful at finding terrorists. And suppose NSA's misidentification rate is .0001, which means that .01% of innocent people will be misidentified as terrorists, at least until they are investigated, detained and interrogated. Note that .01% of the US population is 30,000 people. With these suppositions, then the probability that people are terrorists given that NSA's system of surveillance identifies them as terrorists is only p=0.0132, which is near zero, very far from one. Ergo, NSA's surveillance system is useless for finding terrorists.

Suppose that NSA's system is more accurate than .40, let's say, .70, which means that 70% of terrorists in the USA will be found by mass monitoring of phone calls and email messages. Then, by Bayes' Theorem, the probability that a person is a terrorist if targeted by NSA is still only p=0.0228, which is near zero, far from one, and useless.

Suppose that NSA's system is really, really, really good, really, really good, with an accuracy rate of .90, and a misidentification rate of .00001, which means that only 3,000 innocent people are misidentified as terrorists. With these suppositions, then the probability that people are terrorists given that NSA's system of surveillance identifies them as terrorists is only p=0.2308, which is far from one and well below flipping a coin. NSA's domestic monitoring of everyone's email and phone calls is useless for finding terrorists.

As an exercise to the reader, you can use the same analysis to show that data mining is an excellent tool for finding stolen credit cards, or stolen cell phones. Data mining is by no means useless; it's just useless for this particular application.

klassobanierasJuly 10, 2006 7:48 AM

Isn't the point of wholesale surveilance like this to prune down the number of cases that need to be examined? If so, p=0.01 means that the NSA only has to look at 100 guys to find one terrorist, which seems rather useful to me.

And how is p=0.5 the same as flipping a coin? That's such a self-evidently wrong statement that I'm sure I'm missing something.

Lou the trollJuly 10, 2006 7:53 AM

@klassobanieras: Let me guess, either you don't believe that coins are evenly weighted or you live somewhere that has coins with more than two sides...

Lou the troll

klassobanierasJuly 10, 2006 8:07 AM

@Lou the troll: The flipping-a-coin analogy is used to suggest that a system with p=0.5 is somehow the same as blind luck. This is clearly not true - in such a system, half the guys to come to the NSAs attention would be terrorists. I'm sure they would find that somewhat more useful than flipping a coin.

lukemJuly 10, 2006 8:08 AM

I hate to defend NSA surveillance but klassobanieras has made a fine point.

It all depends on how many false positives there are... If it's 3 million then that is a problem, but if it's 30,000 it seems like such a system could actually be useful.

Rubyin's assertion that 30,000 false positives would make "NSA's surveillance system useless for finding terrorists" seems a little bit unsubstantiated, don't you think?

Frank Ch. EiglerJuly 10, 2006 8:10 AM

> NSA's accuracy rate will never be 100% and
> their misidentification rate will never be 0%.
> That fact, plus the extremely low base-rate for
> terrorists, means it is logically impossible for
> mass surveillance to be an effective way to find
> terrorists.

This claim of "logically impossibility" is itself a
faulty leap of logic. 100%/0% are straw man
standards.

wonderingJuly 10, 2006 8:13 AM

If this blog ever a reported the US Government doing something right I'd take it more seriously. Alternatively, if there were more suggestions of what to do instead...

lukemJuly 10, 2006 8:13 AM

Sorry, I meant Rudmin, not Rubyin!

Dude N.July 10, 2006 8:14 AM

Hang on. This analysis seems to implicity assume that there exists a fair coin somewhere such that if you flip it to assign heads or tails to each person, the probability that a head will be a terrorist is 0.5. But that's silly. The population of terrorists and non-terrorists gets split evenly by the coin, and the probability that someone labeled with a head is a terrorist is the same as the prior probability that someone is a terrorist -- 0.00033% in this example.

So now, the analysis tells us that if NSA can improve this percentage by two orders of magnitude to 0.02%, they have accomplished nothing, because they haven't increased the percentage to 50%. But as klassobanieras said, fair coins won't magically produce a set with half terrorists in it.

I don't mind being against wholesale Hoovering of our data; I'm against it myself. But this analysis seems fundamentally flawed, and I think it damages the credibility of opposition to NSA's data mining program.

TamasJuly 10, 2006 8:15 AM

@klassobanieras: I don't think you are following the analysis. You are making a statement about the posterior probability ("half the guys to come to the NSAs attention would be terrorists"), whereas the article calculates this probability by making assumptions about prior and conditional probabilities.

Generally, people are very bad at estimating probabilities, especially if they result from complex processes, or belong to rare events. That's why it is better to calculate them formally, which is exactly what the interesting part of statistics does.*

*For historical reasons, descriptive statistics is also called "statistics", even if it involves no inference, just summary, for example, think of the GDP.

I find it hard to believe (but not impossible) that the NSA lacks the skills in statistics to not understand the value (or lack thereof) in the use of data mining.

lukemJuly 10, 2006 8:21 AM

I think we can stop bickering about the p=.5 coin flip stuff, the real question is: Can such systems reduce the number of false positives to something managable?

Rudmin's analysis was fine until he started saying unsubstantiated things like 30k false positives make such a system useless.

TamasJuly 10, 2006 8:22 AM

@Dude N: I think that the coin in the article just serves as a benchmark for evaluating the effectiveness of the program. To argue whether the program is worthwhile, one would need to assign a loss function, ie somehow formalize the the losses/gains associated with identifying a terrorist correctly (T), missing a terrorist (M), suspecting an innocent citizen (S), and identifying an innocent person correctly (I). Depending on these numbers (and the cost of the program itself), 0.02 might either be good or insufficient. Eg if M is a much higher cost then S, then even very low posterior probabilities would be OK. The hard question is estimating these costs/benefits.

klassobanierasJuly 10, 2006 8:31 AM

@Tamas: The analysis defines p as "What is the probability that people are terrorists given that NSA's mass surveillance identifies them as terrorists?"

If p=0.5, then a person brought to the NSA's attention by their system has a 50% probability of being a terrorist. The analysis suggests that this is somehow useless or random, which is what I don't understand.

Perhaps this is the coin explanation?

Method 1): Take a random person. Flip a coin to determine his terroristness, heads for OK, tails for terrorist.

Method 2): Run that same person through the NSA system. It will come back "yes/no" this person is a terrorist.

Rudmin is saying that the coin is going to be correct more often than the NSA.

Also, 30,000 might seem like an acceptably small number of false positives, but if you are one of the 30,000 people who's lives are ruined, I suspect you would disagree.

Government is supposed to protect you from terrorists, not provide you with surrogate terrorists.

Loyal CitizenJuly 10, 2006 8:42 AM

@lukem

30k false positives *a day* could very well make the system useless.

What this analysis is saying is that the misidentified rate (Type II-- 'false positive') is THE MOST important part of this type of analysis-- even more so than the accuracy rate.

Bottom line, without further information either way regarding the actual scope of the system, it would appear that NSA has it's work cut out for them--- the misidentification rate could very well be the most difficult part of the problem to solve.

AnonymousJuly 10, 2006 8:46 AM

@bob, it's not clear that a person deemed as a terrorist by the system would have his or her life ruined. Maybe it just would just mean they were subjected to additional surveillance or something. And yes that might be bad if that happened 30,000 times, and we might not want to make that tradeoff.

But then Rudmin goes on to assert that the NSA system would still be useless even if the system magically had only 3,000 false positives, while identifying 700 out of the hypothetical 1000 terrorists in the country. It just seems totally baffling to me that he could still be saying the system is useless.

ShuraJuly 10, 2006 8:49 AM

@arl: Of course the NSA knows about these things - they're not stupid. The whole thing just shows that the "we're doing this to catch terrorists" justification is a big, fat lie.

Tim KirkJuly 10, 2006 8:49 AM

From the article - using the most optimistic figures.

"then the probability that people are terrorists given that NSA's system of surveillance identifies them as terrorists is only p=0.2308"

So even with very optimistic assumptions the NSA system would produce 3 times as many false positives and correct identifications (0.7692 to 0.2308). Given the recent accidental shooting in the UK of an innocent man (who was then detained, with his brother, for several days) I can see a lot of problems with a system that pulls names out of great masses of data (too great for it to be easy for a human double check - especially quickly, with the fear of being too slow when searching for terrorists) and yet is wrong 3 times out of 4.

The number of false positives seems less important to me than the odds of any given positive being false - and unless I'm totally mis-remembering my maths from school that is the real problem here.

AnonymousJuly 10, 2006 8:49 AM

Oops, that last post (@bob, it's not clear) was from me (lukem), not anonymous.

@arl
--I find it hard to believe (but not impossible) that the NSA lacks the skills in statistics to not understand the value (or lack thereof) in the use of data mining. --

The concern is not that they lack the skills, but lack the concern about the potential harmful side effects. To an agency whose whole focus is preserving a governmental structure, individuals lie pretty far down on the priority list.

TamasJuly 10, 2006 8:51 AM

@klassobanieras: There is little to understand in the article, because the argument the author makes doesn't allow him to evaluate how effective the program is.

To make claims like that, he would need a loss function, which he does not provide. If missing a terrorist has very high costs, but suspecting somebody (who possibly won't even know about it) has very low costs, then even a probability of 0.02 makes the program effective. For example, many medical tests have worse posterior probabilities than this (because of low incidence in the population), and they are still used, because preventing death from illness is worth the hassle.

Note that I am not saying that the vacuum-cleaner approach to surveillance is good, I think that sooner or later it will diminish civil liberties. I only want to argue that posterior probabilities are useless without a loss function for decision-making. This is such a trivial statement in decision theory that it should be implicit in any kind of analysis...

Couple of tangential points:
1. statisticians have tried to quantify "randomness", and the concept is called entropy. Indeed, a uniform distribution has the highest entropy, so a fair coin is the "least informative" distribution with only two outcomes. However, this is not relevant here, as we are trying to cast this problem in a decision-theoretic framework, for which you need a loss function.

2. arl made the argument which can be best summarized as "NSA wouldn't be doing this if it wasn't worthwhile". It is true that they have some of the best statisticians, but they don't have the same loss functions as the rest of society (eg they care less about suspecting innocent people), so they are not the best candidate for evaluating whether such programs are good.

If false positive (falsely identifying someone as a terrorist - not sure whether thats a "negative" or not :-)) is NOT a problem then merely declare everyone in the US as a terrorist and the problem is solved!

This is similar to the FAA only requiring an aircraft gas gauge to be correct when the tank is empty. So simply paint a nonmoving needle on the gauge at "E" and you will satisfy the regulation, but probably not solve the problem.

AnonymousJuly 10, 2006 8:58 AM

@bob, ha :-)

But I still think the cost of a false positive has been somewhat poorly defined. If all the false positives were immediately locked away in secret government prisons, then we obviously would be willing to tolerate very few false positives. (Maybe none.) But say false positives were just subjected to a more intensive wiretapping? It's not great but it's definitely not the same as ruining their lives. And in some cases we as a society might be willing to make that tradeoff.

33K people are false positives out of what sized suspect population identified for follow-up?

If based on 1000 terrorists is the suspect list size = 34K?

If based on 40% misidentification is the list size = 82.5K? 52.8K? -- this might assume that the other 60% are either true terrorists of in some gray area of shady characters.

lukemJuly 10, 2006 8:59 AM

aargh, I keep forgetting to type my name. that last @bob comment was mine also.

What Rudmin seems to be missing is that 30,000 false positives could be a manageable level in an automated first-pass system, particularly if they occur over the course of, say, a year. The reason is that you can get much more cutdown before you send around the agents in black suits.

A group of maybe 30 humans could cut the number down to maybe 3,000 by reviewing the data (perhaps 100 of which are actually terrorists -- say 90% precision, 30% accuracy), and then collect more data from wiretap.gov. For those of you keeping track at home, they're investsigating maybe 4 or 5 people each per work day; not unreasonable for throwing out those "chicken soup is codeword for a bomb" things.

Second pass, in-depth review with more data. Say another pass with similar precision and slightly better accuracy, 10 times more work but with 10 times fewer suspects. Now it's 300 or 400 people, and 50 of them are terrorists.

It's not unreasonable to have the cops investigate 300 people across the US to catch 50 terrorists. That's like, what, one in each state every 2 months?

Of course, I'm making up numbers, and NSA may have trouble getting to 0.01% false positive rate in the first pass, and 10% in the next passes. Who knows whether finding these terrorists a year would be worth the cost of the program, or the abuse potential, or whatever. Still, it's not as ridiculous as this guy makes it sound.

klassobanierasJuly 10, 2006 9:05 AM

The population of interest is not necessarily everyone in the USA. Presumably the NSA gets intelligence from time-to-time that narrows this population to (say) a town, recent entrants to the country, or everyone taking a flight on a given date.

Whatever the fraction of suspects that the system can safely eliminate, there is some number of suspects where the NSA cannot cope without the system, but can cope with the system.

The worst piece of intellecutally-dishonest claptrap I've seen in a long time.

I see that others have chimed in with all the reasons about why the math may be correct but the fundamental assumptions are fallacious. It reminds me of the arguments offered by snake-oil cryptography, in which an algorithm is offered as "unbreakable" because random guessing won't work.

I'll just contribute something a few other comments. First, that the author could have written an interesting article about how the NSA, which is filled with very smart people, must be doing more than simple random data mining. Second, the statement that data mining is an efficient method of supressing political dissent is (a) unproven and (b) revealing of the author's bias and (c) lacks any identification of the method by with the Big Bad NSA is implementing this policy of intimidation. Thirdly, if the system does not achieve results, the NSA hiearchy, which does not have an unlimited budget and a surfeit of analysts, will drop the program. Fourth, perhaps it's time to remind the author and others of his ilk that when the NSA or FBI computer kick out suspicious activity, the bureaucracy does not automatically dispatch teams of assassins.

Finally, the exact same arguments as given by the author can be applied to any police or intelligence activity. These arguments boil down to "it's not perfect, so we cannot implement it. See, here's the math." Perfect cannot be the enemy of the adequate -- unless your goal is to stymie any effort whatsoever, which is clearly the goal of this author.

I should mention that I'm also dismayed to see an article from LewRockwell.com in this blog, since that site has more than a slight tinge of anti-semitism about it.

" the exact same arguments as given by the author can be applied to any police or intelligence activity "
Njet, police activities should be based on evidence proposed by real humans.
Data-mining is not based on any evidence. It's just virtual shooting at every citizen.
As virtual-shooting and evidence have different characteristics there is different mathematics.

BrianJuly 10, 2006 10:00 AM

History is full of examples of bureacracies continuing to pursue completely useless programs because the programs justify the existence of the bureacracy. National security bureacracies are no exception. In fact, they are prone to indulging in security theater, wasting money and time (not to mention trampling civil liberties) without actually making anybody safer.

The best point to take away from this discussion is not that the NSA program is useless, or that it is useful. The point is that there are fundamental questions about how a program such as the NSA's could *ever* be practical, and that the NSA needs to answer those questions. A blanket statement like "we can't tell congress about the program for national security reasons" is more than likely a coverup for incompetence.

lukemJuly 10, 2006 10:09 AM

@Brian, that's not been *my* point in this discussion, at least. I'm not defending the NSA program so much as attacking this faulty analysis, which in my view is oversimplistic and flat out wrong!

This appears to cover nothing that isn't discussed in Carnegie Mellon's introductory statistics class for humanities majors (many if not most of them take it as freshmen).

This is sad news considering the amount of funds allocated to NSA.

aburtJuly 10, 2006 10:21 AM

I'm not defending the surveillance program (it's for Congress, the Judiciary, and the Executive branch to do the checks&balances dance to determine legality, with input [and votes] from the citizenry as is our right). However, just addressing the feasibility, I'll note that, from what I've seen, a universe of 30k potential suspects to wade through is not likely a problem for the government. I'm personally aware of a case ten years ago in which, to find a lesser criminal, the FBI was willing to investigate a universe of 10k suspects. They have the manpower to do such things.

Also, let's not forget that subsets of actual terrorists may form graphs of communication with each other. If you start finding links between individuals also on your 30k suspect list, they'll stand out and bear more scrutiny.

GabrielJuly 10, 2006 10:27 AM

I find the author very misleading. He wants you to believe that, even if the NSA were to improve their techniques by a tremendous amount, it would still be no better than randomly selecting people by flipping a coin. (???)

Let's see: If the proportion of terrorists in the general population is .00033%, and you select a subset of them by flipping a coin, then the proportion of terrorists in your subset will be .00033%, just like in the original.

But if the NSA selects people using very sophisticated survelliance, then the proportion of terrorists in their subset will be, say, 50%.

That sounds *much* better than when you flip a coin!

Carlo GrazianiJuly 10, 2006 10:28 AM

Moshe writes: "Thirdly, if the system does not achieve results, the NSA hiearchy, which does not have an unlimited budget and a surfeit of analysts, will drop the program."

This is so comically off-base that I don't know whether to laugh or cry.

The measure of "success" adopted by people who oversee such programs in government has very little to do with actual performance. It has a great deal more to do with conquering and defending budget.

Since this metric of performance is paramount, program managers have every incentive to talk up the reliability and success and (above all) the future promise of the program, and to bury any doubt that might result from a rigorous and intellectually-honest assessment of the challenges to be overcome.

To see this effect in action in a totally different program, one merely need cast one's eye over the Air Force's "reliability" assessment of the National Missile Defense shield, which has been repeatedly declared a success despite a total inability to discriminate warheads from dummies (a mission requirement), unless the warheads helpfully carry beacons. The point is, of course, that to the Air Force, NMD is a "success" irrespective of whether it could ever shoot down a warhead, because of its budgetary mass and momentum, and because it positions the Air Force to pre-empt the creation of a "Space Force" when (as they confidently expect) space becomes militarized during the course of the century.

And, just as the reliability of NMD is neither here nor there to the Air Force, to imagine that the NSA would let intellectual difficulties stand in the way of growing this kind of program is simply naive.

Most of the problems here come from the fact that he doesn't clearly state what he means by "identified as a terrorist"

As has been said before, this would indicate what kind of a loss function would be nessecary to evaluate the cost.

For instance, if "identified as a terrorist" means that as far as they know, that person is *definitely* a terrorist planning horrific attacks and should most definitely be arrested or shot at the earliest convenient time, then *any* misidentification is unacceptable.

However if "identified as a terrorist" means that an automatic computer program has flagged that person for further attention from a person, then that's not so bad at all, and 30,000 false positives (manpower aside) are acceptable.

This is of course assuming nobody actively tries to abuse the system. Say you wanted to watch someone? Sign them up to jihadi mailing lists from an internet cafe, they get flagged, then voila, all the information about their personal life is accessible to a human...

David ThomasJuly 10, 2006 10:37 AM

I am greatly disappointed in this article. Both of my main problems with it have been mentioned above, but I will nonetheless repeat them here as some were dismissive of them and I think they may need to be stated more clearly.

First, the coin flip analogy. While it is perfectly fair to say that a 50/50 chance is like flipping a coin, we have to be sure that we are talking about the same things. p = .50 means that picking a person at random from our group of suspected terrorists is equivalent to flipping a coin. It does NOT mean that the program is equivalent to going through the population and flipping that coin to determine suspicion, which seemed to be the implication.

Second, just because perfection is impossible does not mean that any other state arbitrarily close to perfection is impossible. While those numbers would indeed have to be awfully close to 1 and 0 to get meaningful data, the jump to impossability is unsupported.

It is sad that we get such a cursory job when a well supported one would likely reach the same conclusion. One of the biggest problems is that most assessments I've seen - certainly those posted in this thread - neglect the fact that while we may be curbing one threat (namely terrorism), we may be enabling others which have the potential to effect far more people.

klassobanierasJuly 10, 2006 10:41 AM

"As an exercise to the reader, you can use the same analysis to show that data mining is an excellent tool for finding stolen credit cards, or stolen cell phones."

I don't see the difference. Even though the base-rate is presumably higher for credit-card fraud, you'd still get a significant number of false-positives, and the analysis suggests that anything less than p=1 is unacceptable. As if everyone flagged by the system would go straight to jail without further investigation or a court-case.

lukemJuly 10, 2006 10:43 AM

So Bruce!

I still can't believe you're endorsing an article that says a hypothetical system that identified 700 out of 1000 terrorists at a cost of 30,000 false positives would be "totally useless"!

Isn't some sort of clarification in order? Your blanket approval of this analysis completely ignores the fact that the feasibility of such a system depends on the false positive rate as well as the false positive cost. For some values of those quantities, such a system would very clearly make sense.

And as other commenters have said, it would be very nice if we could move past this poor analysis and discuss the deeper issues at hand here.

CommandoJuly 10, 2006 10:46 AM

This guy misses the whole point of data mining. Its not just to 'guess' or the miore accurate lessen the physical number---its to create connections that in turn can identify patterns or social networks (in this case a terror ops and support cell). We need to loose the high math that obviously only works when you are measuring an anologus sitiuation. Clearly here, we are not. I propose my simple Birds of a Feather Theorem--if a known terrorist eats lunch with Person X....than person X is likely a terrorist........

AnonymousJuly 10, 2006 11:12 AM

I'm curious how many people who are endorsing trading our privacy for this pie-in-the-sky notion of "automatic" surveillance have heard of the base rate fallacy before reading this post, and how many of them could correctly explain it to the layman.

Tim VailJuly 10, 2006 11:14 AM

I think the author was serious when he said that the people misidentified would probably have their lives turned upside down. In another words, those of you concerned about the cost of false positives -- it is relatively high.

One terrorist is highly unlikely to kill 3,000 people. Whereas by putting more than that number of people in prison for life because we think they are a terrorist probably already did more overall damage to society than the terrorists themselves could have done. And add to that the cost of the system itself...

Last comment -- regarding those who say that this is only to single out people for closer examination -- this "simplistic" computation is assuming all is said and done. Meaning, the NSA already filtered everyone out to the best of their ability, reexamined them again and again, and this is what they wound up with. Any further examination would have to look at completely different criterias used by the said exhaustive examination. It is simply unrealistic to assume that the further examination would not look at the same factors, and come to the same conclusion.

This is the difference between a coin flip and security checks. Coin flips has no memory, past results does not affect future results, but because of the nature of security checks. Security checks tend to look at the same set of data, and therefore are likely to have a sort of "memory" about what happened before.

Carlo GrazianiJuly 10, 2006 11:18 AM

It is also depressing to see how easily people here dismiss the
false-positive aspect of this problem "because the government doesn't just
dispatch teams of assassins" upon getting an automatically-generated tip
from a system trigger.

This is totally besides the point. The problem with a system that
generates tens of thousands of false positives is that eventually, one of
them will point to some innocent person with enough suspicious connections
(travel to dodgy countries, unsavory friendships, unusual monetary
transactions, used a credit card at a time/place consistent with the
presence of other suspects, etc.) that the investigators will judge him
worth charging.

Anyone who thinks that everyone booked by the FBI and publically declared a
terrorist is guilty as charged should Google "Richard Jewell", and read up
on the Atlanta Olympic bombing. If Mr. Jewell had found that bomb after
9/11 instead of before, he might be in a Navy Brig today, due process
being what it is these days.

BrianJuly 10, 2006 11:20 AM

@klassobanieras

> I don't see the difference. Even though
> the base-rate is presumably higher for
> credit-card fraud, you'd still get a
> significant number of false-positives, and
> the analysis suggests that anything less
> than p=1 is unacceptable. As if everyone
> flagged by the system would go straight
> to jail without further investigation or a
> court-case.

There are two significant differences between data mining for terrorists and data mining for credit card fraud.

1) Credit card fraud is more common, so you are more likely to catch an actual criminal.

2) Weeding out false positives is cheap. You call up the card holder, and ask them about the transactions. The entire process can be automated and takes a couple of minutes.

Distinguishing between a false-positive terrorist detection and the real thing is significantly more difficult. And because the base-rate for terrorism is low, you are going to be wasting a lot of time and money on those false positives.

We cannot afford to be stupid about security. Right now a lot of people seem to assume that the NSA can put a crystal ball on top of a stack of phone bills and find a terrorist. Data mining is not magic. What the NSA is doing is not going to work.

Question for those of you who think the NSA program is a good thing: how many terrorists has it caught? Figure out that number, and that'll give you a good idea of how effective the program really is. Share the number with me and you might even convince me to support the program.

lukemJuly 10, 2006 11:30 AM

Brian, I am not saying I think the program is a good thing, I just don't think you can categorically dismiss it without knowing something about the error rates involved.

I'm not exactly sure what I think about such programs, to be honest. I think preventing a terrorist nuclear attack would be a huge deal, but I don't know how to estimate the likelihood of such an attack in the first place, of course, and I don't know what the actual chance of a surveillance system foiling such an attack is either.

One thing that would make me somewhat more willing to consider these systems would be some sort of assurances that they would only be used to prevent large-scale terrorist attacks, as opposed to drug law violations or (just imagine!) copyright infringements.

I think that engineering comprehensive oversight policies would be an extremely productive thing for us to do as soon as possible, before these systems became too deeply entrenched.

KevinJuly 10, 2006 11:46 AM

Anyone want to guess how long before a mass-mailing worm that posts threats against the US govenment is written? Or, a botnet is used to "plot terror" via IRC by posting false plans?

Even if the system for electronic interception works, it's so vulnerable to flooding it with false positives as to make it useless.

Carlo GrazianiJuly 10, 2006 11:50 AM

On credit card fraud, it is worth pointing out the fact that the credit card companies have a huge database of fraudulent transactions to use as a training set for their software. They are also looking for clear signatures --- a spike in jewelry purchases on a card that's done nothing but grocery shopping, for example. This allows them to calibrate the (many, many) parameters in the system, as well as to estimate the error rates (by training on part of the data set and testing on the other part).

By contrast, there are very few reliable "signatures" of terrorist communication, at least ones that differ from other social networks. And, only a relatively small number of known terrorist communication trees must be used to calibrate many parameters. It's simply not the same kind of problem as the one the credit card companies have solved.

sidelobeJuly 10, 2006 11:51 AM

How does the math change if the goal of the NSA is to identify *one* terrorist rather than *all* terrorists? It seems to me that this is a realistic goal, which would lead to finding additional criminals once you find the first one.

I, too, am against mass surveillance, but you can't use this logic to discredit it.

Lou the trollJuly 10, 2006 11:53 AM

Hmm... I think there's several other issues at play with these types of systems. What defines a terrorist? A pattern of communication or an action? If it's a pattern, then we are all on a dangerous slope. One day that pattern is a series of calls to North Korea and then to Explosives-R-Us. A few days later it's a few phone calls to the Sierra Club. The next week it's posting on this website.

Tim VailJuly 10, 2006 12:01 PM

@sidelobe

The base rate for fraudulent transaction and misidentification rate is significantly higher when evaluating credit card transactions.

Meaning -- the article supposes there are 1 out of 300,000 terrorist. However, the number of fraudulent transaction is significantly higher than 1 out of 300,000. As for misidentification rate -- like someone mentioned, credit card companies know the general purchase pattern on each card. They are more likely to be accurate, and coupled with the base rate (frequency of fraudulent transaction being higher) -- they thus have lower misidentification rate.

To sum up -- the rate of terrorists to everyone else is what makes all the difference in whether this is feasible or not.

I'm with Moshe on this one. It seems to me inconcievable that the NSA is using this program in utter isolation and that none of its output is corroborated with other type of evidence against terror suspects.

It only takes a moment of thought, or a quick peek at this blog to see that the odds of catching a terror suspect goes up if you salt your pool with proven suspects and test the validity of results.

For example, we know that the intelligence services were able to break the Liberty City Cell with HUMINT. Certainly a network graph was established for the associates of that group. False positives? Considering that success, established by first person corroboration, why would they risk their funding for the 'Bayesless' surveillance? I'd more likely suggest that a devious NSA would say that their telephonic surveillance was a part of discovery when it was not.

But the suggestion that someone would try to misdirect the targets of these suveillance activities for political purposes is not as disciplined as the math attending that argument. A moment's reflection would show that it is far easier to rat out a politically motivated misuse of the terror trap than to actually hijack the program. So the very suggestion that outsiders are making would be very easily corroborated by an insider willing to leak. But such corroboration has not been made, and because of that we would have to assume that 100% of the insiders are conspiring to hijack the program. Fat chance.

AnonymousJuly 10, 2006 12:06 PM

I don't understand the "I'm against mass surveillance but the article's logic doesn't hold" argument. I think that mass surveillance already has privacy costs we're not willing to pay because a) it's the type of thing that authoritarian governments do to justify other actions that limit liberty and b) even if the government is benevolent, the program may be subject to OTHER abuse.

The article's logic is that even if the government is benevolent AND has a 90% success rate, the lives of 3,000 will have been unduly harmed. Again, imagine being one of the 3,000 and your accuser touts a 90% success rate.

Sorry, I'm one of those idealists who believe that we've lost the war on terrorism if we believe that "9/11 changed everything" including the way we look at our personal privacy and the liberties we take for granted.

Tim VailJuly 10, 2006 12:10 PM

Hmm...I think I made a mistake. Misidentification rate doesn't have to change all that much (even though it is probably better). All you have to do is let's suppose out of 300 million transactions there are -- how many woud be fraudulent? Let's guess maybe 30,000 (which is probably more realistic than 1,000 terrorists out of 300 million).

Viola, you have just multiplied the probability by 30 fold.

JakeSJuly 10, 2006 12:14 PM

There's a lot of loose language in this thread, starting with Prof Rudmin.

NSA's surveillance is indeed useless for finding terrorists, but that's not its primary purpose.  It's very unlikely to find actual terrorists, for reasons noted by the Prof and contributors above.  Its purpose is to find people with "links to terrorrism" or "terrorist sympathisers" (as the media and politicians often say).  Mass surveillance can be very effective at spotting links between people, or actvity that can be seen as suspicious.  The problem, as also noted, is the frequency of false positives.  Whoever thought that each potential positive would be checked by a human is living in dreamland.  Far more likely that the bureaucracy simply puts each apparent positive straight onto the no-fly list, the list of people to be harassed if they ever make the mistake of going out of the US and coming back, and so on.

Jean-Charles de Menezes (remember him?) was a false positive.  Surveillance showed that he lived at the same address as someone thought to be a terrorist.  Therefore he was "linked to terrorism".  No human bothered to find out that he lived in a separate apartment at that address.

klassobanierasJuly 10, 2006 12:15 PM

@Brian:
> 2) Weeding out false positives is
> cheap. You call up the card holder,
> The entire process can be automated
> and takes a couple of minutes.

I know that the problem (in the terrorism case) is the civil-liberties cost, and I would be extremely uncomfortable about this in the real-world.

But the analysis under discussion contains really silly implicit assumptions about this, and that's what I take issue with. The author requires that p be "very near to one". From that, we can infer that he believes that the cost of investigating a single innocent outweighs the value of finding ~10 terrorists. I don't have any first-hand experience of NSA investigation, but is it really equivalent to the worst efforts of 10 terrorists, focused onto a single person? If so then you guys have much more urgent problems than data-mining.

If the article was been about credit-card fraud instead of terrorism, the author would probably omit any mention of second-level checks, and instead imply that false-positives go straight to debtors prison. And maybe throw in a spurious coin-flipping analogy for good measure ;)

BrianJuly 10, 2006 12:19 PM

There can be no doubt that the NSA's program is backed up by human detective work. In fact, that is the number one reason we ought to be incredibly suspicious of this program. Everytime the NSA program spits out a warning, FBI agents (or someone like them) need to go investigate that warning. If the false positive rate is high, then we are wasting valuable human resources on pointless investigations. And there are some fairly basic mathematics that indicate the false positive rate is going to be pretty high.

That kind of program does not help security. Just the opposite: we have a limited number of guardians of our safety, and we want to make sure they are working as effectively as possible. A surveillance system with a high false positive rate means that we are taking resources away from more productive avenues of investigation.

Maybe I'm wrong about this. Maybe the phone logs are correlated with other data that reduce the false positive rate to manageable levels. But I've seen absolutely no evidence of that. The terrorism cases that have been made public have all been the result of old-fashioned detective work, not the magic of data mining.

It took years before Daniel Ellsberg leaked the Pentagon Papers. I hope it doesn't take that long to find out the truth about the NSA's phone surveillance.

Pat CahalanJuly 10, 2006 12:29 PM

Remember, all, that Dr. Rudmin's numbers are exaggerated highly in an attempt to explain to the non-mathematically inclined what the difficulties are with this program.

@ lukem

If you have 30,000 false positives (remember, this is probably a low estimate!) and 400 correctly labelled terrorists (probably a *very* high estimate!), that gives you 30,400 people to investigate. Presumably these 30,400 people look "suspicious" from all of the data we've been able to gather (which is quite a bit!), which means just looking at their call records or bank statements (standard investigative procedures) isn't going to tell us anything new -> these 30,400 people have to be detained, put under humanint surveillance, or some other highly invasive investigative technique. That's a huge amount of manpower. We're not talking about giving 30,400 a fingerprint test and checking them off a list.

Aburt pointed out an example of looking at 10K suspects to find one perpetrator, but that's a flawed comparison because that's 10K suspect to find one perpetrator who has already committed a crime, as opposed to one who hasn't *done* anything yet. In the first case, we can assume a finite amount of time required to find the actual perp... in the second case we're always going to be adding potential terrorists to the list -> the investigation goes on forever. Also, if you have a set of evidence from an existing crime to examine, it's pretty easy to eliminate a large volume of the 10K suspects very quickly (fingerprints don't match, DNA doesn't match, they have an alibi, etc.) We're not investigating an "normal" crime here, we're talking about investigating conspiracy to commit acts of terror, which is a completely different problem.

As some people have pointed out, the time frame is an issue (how many people you tag in a day/year/etc). This is true. And yes, we don't know what the actual costs are of a misidentification (either to the suspect, which certainly can't be discounted, or to the investigative team itself, which needs to expend resources to check the innocent out). We also can't really estimate the cost of unchecked terrorists, because we don't know what the plans are, how likely the plans are to succeed, and what the costs will be if the plans are executed properly. And finally, we can't estimate the cost to civil liberties, because we don't know what the level of imposition is to those incorrectly identified.

Some people are going to estimate those one way, and say this is a feasible technique. Others will estimate another way, and say this is unfeasible. I think that a majority of people who would delve thoroughly into this program *would* come away convinced it is severely flawed, but the ability to examine the program in depth is restricted to a very few people.

Personally, I look at it as a clear violation of just about every standing principle upon which this country is supposedly founded: the right to be presumed innocent, the right against unreasonable search and seizure, other civil liberties, etc. Economic analysis aside, this is not supposed to be something our government should be doing, period.

Kh3m1stJuly 10, 2006 12:36 PM

I thought it was a great article. I am against the NSA domestic spying program for many reasons. One of which - the program reminds me too much of the former USSR. The goverment watching over it's people "for their own good". No matter how noble the intention of the government - the old saying "absolute power corrupts absolutely" comes to mind. This program will be used somehow in a way it was not advertised to the US people. The US government has used the threat of terrorists and terrorism to undermine a host of civil liberties in the last few years. And the majority of the US population doesn't mind or care (in my opinion).

No matter how much debates is placed on the statistics used to come to the authors conclusions; it does not change the fact domestic spying is occuring, and you civil liberties are not the same as they were on 9/10/2001.

Bob O.July 10, 2006 12:37 PM

Another item to consider is that the data mining is not the sole tool being used in the hunt for terrorists. When combined with other intel, it undoubtably can contribute.

Sorry, but the analysis in the post is both misdirected and misleading.
Let's assume that the numbers as presented are correct - 3000 innocent people are investigated, and 400 terrorists are stopped. This does not constitue a failure but rather a success, since cost of (being investigated) is relatively low (an hour of one's time, or several hours of the NSA's time without telling you, etc - no real harm done) while the benefit of arresting 400 terrorists is high (is terrorist is a potential mass killer).

Plus, arresting 40% of the terrorists is likely to foil terrorist plans by other terrorists (as they will be more fearful of discovery, thus move slower, etc).

The same fauly logic can 'support' avoiding police investigation in general (as the police usually investigates multiple suspects per crime, and thus have a low convinvtion per investigation ratio). This is just a partisan red herring.

XellosJuly 10, 2006 12:46 PM

--"I just don't think you can categorically dismiss it without knowing something about the error rates involved."

Given the FBI has already publicly complained about the number of wild goose chases the NSA has sent them on, I think we can reasonably posit a rather significant false positive rate...

BrianJuly 10, 2006 1:07 PM

@Xellos

Do you have a link for that?

Konstantin SurkovJuly 10, 2006 1:17 PM

I can't believe this guy is really a professor. The article is unbelievably lame. And, if NSA can pinpoint a terrorist with "probablity p=0.2308" (which means that approximately one of four people caught turns to actually be a terrorist), it is a great success of the system.

SpaceshipJuly 10, 2006 1:24 PM

What if the NSA were to use their powerful systems to join in:

* The SETI@Home project?
* The Folding@Home project?

It would be interesting to see how much faster/further we could advance as a society.

klassobanierasJuly 10, 2006 1:28 PM

@Spaceship: They'd probably have a higher detection rate with SETI.

quincunxJuly 10, 2006 1:30 PM

"I should mention that I'm also dismayed to see an article from LewRockwell.com in this blog, since that site has more than a slight tinge of anti-semitism about it."

What does this have to do with the topic at hand?

The only "anti-semitic" thing about it is not having the US aid Israel or any other place for that matter. The site is anti-war, anti-state, pro-market. If 'semitism' is about channeling vast amounts of money from taxpayers' into the middle east and thereby creating religious tension, then lewrockwell.com is 'anti-semitic' in that sense.

The site has about 100 columnists, some of which are Christian, some Jewish, some Muslim, and some Atheists, and one Congressman. The site has a very cosmopolitan perspective, with columnists from all over the world. Their only common link is 'ant-war, anti-state, pro-market'.

"Personally, I look at it as a clear violation of just about every standing principle upon which this country is supposedly founded: the right to be presumed innocent, the right against unreasonable search and seizure, other civil liberties, etc. Economic analysis aside, this is not supposed to be something our government should be doing, period."

I agree with you 100%, but this erosion of freedom is nothing new in the American republic. 95% of the activities of the government is not constitutional. The constitution is essentially a dead letter.

Every item in the Bill of Rights now has an * next to it, with a long list of exceptions.

" This does not constitue a failure but rather a success, since cost of (being investigated) is relatively low (an hour of one's time, or several hours of the NSA's time without telling you, etc - no real harm done) while the benefit of arresting 400 terrorists is high (is terrorist is a potential mass killer)."

Intersting. The psychic cost of being investigated to the innocent party is not 'relatively low' from their perspective.

Everyone is a potential mass killer, if driven to it.

Since the US continues to be in Iraq, the number of potential terrorists is constantly increasing. This great NSA program will likely be around for a long time, if not in public, in secret.

"Plus, arresting 40% of the terrorists is likely to foil terrorist plans by other terrorists (as they will be more fearful of discovery, thus move slower, etc)."

Ha ha ha. Yeah right. They will simply use better technology to get around it. The market supplies many ways to communicate - they will simply learn from the mistakes of others.

To think that is the case, one needs to look at the Drug War, and see how no matter how hard the gov tries to crack down, the profit motive keeps the drugs rolling in.

You can also see the CAN-SPAM act, and see how well that has worked. The spammers are shaking in their boots, I'm sure.

" it does not change the fact domestic spying is occuring, and you civil liberties are not the same as they were on 9/10/2001."

Sorry to say this but the are pretty much the same. You have just as much civil liberties as back then. The only difference is that the gov had a much easier time at violating it then before, but it always had the ability to before.

Perhaps one needs to look back on the kind of civil liberties we had during WWI & WWII. The court case of "The United States vs. The Spirit of '76 (1776)" is a good example, and shows you precisely when we actually departed from our founding fathers.

lukemJuly 10, 2006 1:39 PM

All I am saying is take all the statements from the article like:

"Ergo, NSA's surveillance system is useless for finding terrorists."

and replace them with things like

"Ergo, NSA's surveillance system could catch 700 out of 1,000 terrorists at a cost of 30,000 false positives."

and then we can all have a reasonable discussion about A) whether we believe those numbers, B) at what point (if any) those numbers would justify the operation of such a system, C) etc etc etc.

klassobanierasJuly 10, 2006 1:55 PM

Each time the NSA confirmed or discarded a suspect, they could recalculate the set of remaining suspects, taking into account this new information. The list of suspects would likely shrink as individuals were investigated, and the NSA would only need to investigate a subset of the 30,000 initial results.

Disclaimer: I still think it's a terrible idea.

I think the author misses the point. While I don't agree with the NSA program for legal reasons (against the law), I don't see it as the author describes. From what I have read, the purpose isn't to find "a" terrorist, which is what the author's probability analysis attempts to demonstrate. I thought the purpose of the program was to build social network maps. Therefore, the probability analysis should be whether the program is good at identifying a map of X people (where X is a size of a given cell of collaborators) over a given time period of T, where they make Y number of calls to each other. The lower any of the numbers the lower the probability of success. For the brief write-up here, the analysis didn't look at all three factors.

@arl "I find it hard to believe (but not impossible) that the NSA lacks the skills in statistics to not understand the value (or lack thereof) in the use of data mining."

You assume that finding "terrrorists" is their goal. Yes they understand all of this and still do it. Now ask yourself what they gain.

I'll be here when you come back from that rabbit hole.

AnonymousJuly 10, 2006 2:40 PM

@wondering I've seen several stories about things the US gov does right here. Maybe the implications about the many stories about what they are doing wrong should make you think.

Also as to "what to do". Get a copy of Beyond Fear and Secrets and Lies. Read those. Then you will have the tools to answer those questions. Also those basic tools are covered in many posts here. But I think the info covered in those books could be safely assumed to be base knowledge here. Kind of pointless to repeat the obvious here.

"I think the author misses the point."

I think most of the commenting are missing the argument. READ IT AGAIN. It does not matter _how_ the system is implemented. It can be a guy with a coin, or some super-sophisticated math-model, a wiley social-network machine of epic proportions, and all of this followed by a legion of checks and counter-checks.

IT DOESN'T MATTER.

The end result is that of the number of terrorists localized by the process is T and the number of people hassled is N, then "probability of a terrorist given all the data and all the analysis" P(T|data,hypothesis,etc) == T/N. Even widely optimisitic estimates of T and N show that this probability is a small number.

If you flipped a coin, then you'd get 150,000,000 terrorists found, of which no more than 1000 could possibly be terrorists. So it should be obvious that you wouldn't want to flip a coin.

In Bruce's scenarios, he shows that if you have something better than flipping a coin (something MUCH better than flipping a coin) you still don't get much help. The reason that you don't get much help is that you flip the "new and improved (tm)" coin for everybody and since virtually everybody you flip it for is not a terrorist, even a low error rate gives you a huge number of innocent people and not many terrorists. That is, there are better ways of finding terrorists.

The scenario I see coming from the administration is that having a single terrorist reach his goal is so awful an outcome that anything (and I mean ANYTHING) is justified. As horrible as 9/11 was, more children died in the US of malnutrition in 2001 than died in the World Trade Center bombing. This should make it clear to everyone that there are many threats and that resources should be allocated appropriately. Security is trade offs.

The cardinality of T and N make huge political differences. It does matter.

quincunxJuly 10, 2006 3:59 PM

"The scenario I see coming from the administration is that having a single terrorist reach his goal is so awful an outcome that anything (and I mean ANYTHING) is justified"

I think there is some lack of thinking here. The action you seem to be justifying is just the sort of thing that CREATES terrorism in the first place, unless you have been brainwashed into thinking that the terrorists were attacking our "freedom".

" As horrible as 9/11 was, more children died in the US of malnutrition in 2001 than died in the World Trade Center bombing. "

As terrible as the war in Iraq is, it has not managed to effectively kill 500,000 children unlike the trade sanctions earlier.

Yes, and the best trade off is ceasing to be a target.

"I'll be here when you come back from that rabbit hole."

Their rabbit hole is their security blanket, without it they would have to actually put some serious thought into the matter.

DavinoJuly 10, 2006 4:03 PM

Finding 400 terrorists is a high number. If you look through the news of the number of terrorists that we were able to capture and actually prove that they were terrorists, the number is small -- smaller than the number of people we've turned into people we don't want to let loose from Guantanamo. Have we convicted even 40?

Setting the detection threshhold in a datamining process is an exercise in tradeoffs -- If we set the level low enough so that we catch all 40 (or 400) terrorists, we'd probably have to investigate everybody, Even if we set the threshhold lower so we only get 50% of the terrorists, we'd still probably have to investigate nearly 50% of our population. The difference between the fraction we have to investigate and the fraction of the target populations we catch is 'lift' and I bet that if we're only investigating 30K out of 300M people, or 0.01 percent, we're missing more than 99% of the terrorists, because surely we're not getting significant lift out of this model.

The only possible benefit I see from this program is that it maybe does help to identify people associated to a particular already-known terrorist: You might do a run and get a list of the ten, hundred, or thousand people most likely to be associated with Richard Reid, But his mom and dad and classmates are going to come in pretty high on those lists, and the pizza guy is going to start showing up before it gets too long. How useful are results like "Bush is associated with Saddam because we have a pictures of Rumsfeld with both of them"? The program may be of use in already ongoing investigations, as in looking for surprises or more leads, but it is going to be nearly useless for picking the needles out of the haystack.

The problem with the article seems to be based on some unstated assumptions:

1. The "system" (whatever that is) tags people, in a binary fashion, as terrorists or not.

2. Bad things happen to people who are tagged as terrorists by this system, without any intervening activity, investigation, or filtering.

The claim "NSA's surveillance system is useless for finding terrorists" is hyperbole; a more accurate statement would be that someone the surveillance system thinks is a terrorist has a low probability of being one. Whether that's "useless" or not is an entirely separate question that depends on how the system's results are used and what the relative values are of finding terrorists and mistaking ordinary people for terrorists.

It's entirely possible that the answer is the same. But an article that pretends to be scientific should support its claims, not merely assert them based on unstated assumptions. The jump from "low probability" to "useless" deserves as much or more analysis than the basic Bayesian math we're given.

So, out of 300M, the system, applied once, gives 400 positives and 29600 false positives. How do real-life people use Bayes training? They apply it again. Let's apply it again, to the next batch of emails, blogs, phone calls, etc. Then we get this: 3 false positives and 160 terrorists.

That's the practice of the people that know what Bayes is about and know how to apply it to real-life data.

I wonder whether the author is a practicing professional or just a clueless "evangelist".

DavinoJuly 10, 2006 4:42 PM

It is useless in the sense that if you are only going to investigate 30,000 people out of 300,000,000 people, the 99.99% of the people you ignore are probably going to include nearly 99.99% of the terrorists you are trying to find. If there's 1000 terrorists, the model would be amazing (5 times more powerful than a 2x lift commercial dataminers consider resounding success) if it caught even 1 of them.

As Mark Twain said, "Facts are stubborn, but statistics are more pliable". What if they were simply monitoring calls originating in the US to known terrorist telephone numbers overseas. This would significantly increase the base rate, and would be considered illegal surveillance under the FISA statute.

BrianDJuly 10, 2006 6:04 PM

Using numbers from the article, pick a person at random. There's a .00033% chance that they are a terrorist. Randomly pick a person flaged by this system, and there's a 1.32% chance that they are a terrorist.

It sounds like a great improvement. But, you're still missing 60% of the terrorists (again, using numbers from the article)! How exactly is this system helping out? It's mostly missing the bad guys, and most of the people it red flags are actually not terrorists.

If it caught 40% of the terrorists with very few false positives, or if it caught damn near 100% of them with bunches of false positives, it might be worth it. But to miss most of them, and then to have a ton of false positives? I really don't see how people can call this anything but an ineffective system.

Oh, wait -- "if it catches even one its worth it", "if I have nothing to hide why should I care", and "I'm either with you or against you", etc.

I'm not at all concerned with what the Bush administration will do with the information they [illegally] collect. It's not as if they've illegally detained anyone... As for the facts, why let those get in the way of the truth.

NeighborcatJuly 10, 2006 6:33 PM

Gee, for a bunch of statistics whizzes, you sure are easy to distract. An alleged mathematical proof that an illegal data mining effort targeting US Citzen is ineffective at it's stated task is completely irrelevant.

What if the US goverment started searching 1000 homes a day at random in the US because it identifies criminals? I have no doubt such an effort would be effective at it's stated task, criminals would be found every day. Why isn't the government doing this? ITS ILLEGAL UNDER OUR CONSTITUTION. Get it? The efficacy of the method is irrelevant!

Is the ideological link between secret data mining efforts to find undefined evildoers and door-to-door searches too tenuous for you? I'm not surprised, that's why governments work up to revoking citizens rights real slow and easy. You won't notice a thing. Ask a holocaust survivor.

AnonymousJuly 10, 2006 7:07 PM

I think that to identify a terrorist, law enforcement must have actionable information. They must have the capability to foil an actual terrorist plan and arrest the culprits. Until this is done, no terrorist has been "identified". They only have "suspects".

What the article shows is NSA data mining is far from "identifying" terrorists. All they get is a large suspect pool that have a higher probability of containing a terrorist than the population at large. But that probability is still very low in absolute number.

What is the next step? Do they have the capability of transforming the probabilities into actual determination of who is a terrorist and who is innocent? Or do they just put the suspects on some "watch and harass" list on the hope this will somehow foil some kind of unidentified plot?

Unless they can reliably turn their suspect set into actual terrorist identification, the program is indeed useless.

RalphJuly 10, 2006 7:16 PM

I have seven years experience in the front line of network security and make the following limited comments from that position:

1. ANY detection system in which 74 out of 75 reported incidents are false positives (98% failure rate) is not practically manageable. By failure rate I mean failure to correctly identify an incident. Think of your spam detection tool having 98% mistakes.

2. Data mining tools are useful when they have known signatures and patterns to search on but are much less so when looking for new attack vectors or the unknown.

3. Data mining tools are powerful if you have highly trained people using them - these people are a very rare and expensive resource. The opportunity cost of having them data mine is very high so you want them using efficient systems (not systems with a 98% failure rate).

If you don't see his point try reading the maths again and try not to read more into what he is saying than is there, this is a mathmatical observation.

If the NSA says "this person is a terrorist" on the basis of mass survellience they have less chance of being correct than a man flipping a coin.

Peter GlaskowskyJuly 10, 2006 7:21 PM

. png

If I understood you correctly, - if there was a burglary, and witnesses say that they saw a suspect leaving in a blue sedan, the police should NOT try to create a list of all the blue sedans in the area, right? There can be 100 blue sedans, so the chances that an owner of one of them is our criminal - is only 1%. So much less then 100%! So close to 0%! Nah, we do not need to try looking at blue sedan owners - it is uselss and mathematically impossible.

Nice conclusion.
http://skipole.blogspot.com/2006/07/...

FreddieJuly 10, 2006 11:05 PM

@Neighborcat:
Right on! Most of the comments here are pretty useless. Let's hear the statistical debate from the point of view of well-known and respected statistics experts, not random readers of this blog. The real issue is that, at least since 9/11, the currrent administration has shown a clear propensity to disregard the constitution and other laws in its pursuit of criminals. The excuse of a "War on Terror" is total BS - it's just a permanent excuse for a police state. A long precedent of US law provides for procedures for alleged criminals to be investigated, indicted, and tried, while protecting the rights of the innocent.

Yeah you did write something similar previously Bruce and it was just as flawed/rigged as this is.

Where but in the least informed dicussions is it suggested that the NSA calls database is used to identify terrorists rather than providing an unrivalled and infinitely useful investigative tool to aid existing investigations by providing an outline of a suspects personal contact networks ?

As the previous poster mentions MV databases are completely useless in indentifying criminals (using the same flawed logic you have) since there is no specific make or model specific to criminals. Yet these records are infinitely useful when suspects are identified via other means.

Now you could bring math and probability into the arguement for why the MV databases are useless or you could just drop the childish straw man arguement to begin with.

If we are very generous with you Bruce and assume you are not intentionally misrepresenting the issue to obscure the benefits of the program then your understanding of the subject of NSA surveillence is so low that you have no business writing about it in the first place.

It's about time you put up or shut up.
If you actually believe there is no benefit in knowing who a terrorist suspect is in communication with then say so.
If you actually believe that all disclosures stating that this call contact mapping is what the program consists of are lies then say so and state why.
If you actually believe that the NSA is using this call data as a first resource in identifying terrorists rather than as an information tool to support existing investigations then say so and cite a source which suggests this is the case.

Otherwise all you are doing is ignoring what is known about this program, ignoring the quite clear uses of it and substituting your own idea about how it could be pointlessly used without any supporting basis for assuming this is what is occurring.

Using the same rigged logic I could argue that DNA isn't useful in providing positive identification on an individual when the police line up test tubes in a room and ask a witness to pick one out as a positive ID.

There is nothing suggesting this is what police do with DNA evidence, everything that is written about the use of DNA evidence suggests it isn't and it is blatantly clear that if I used such a childish, ignorant and rigged example of how evidence could be pointlessly used that I was either intentionally misrepresenting the usefulness of such evidence or had NFI what I was writing about and should stop.

This is where you are on probably the biggest security/privacy story of the past 5 years. Wow.

Joe in AustraliaJuly 10, 2006 11:37 PM

I think there are several flaws in the assumption that the value of the program lies in identifying terrorists from their phone conversations. Firstly, these calculations assume that terrorists are isolated. My understanding is that they usually have an extended support network. This means that the program should be viewed as a tool for locating terrorist networks, not individuals.

Secondly, it is my understanding that terrorists are risk-averse. They don't want to be discovered; the people supporting them don't want to be discovered. It is possible that an unquantifiable increase in risk will discourage them, even if it is small. We see people making similar decisions with respect to other unquantifiable risks all the time.

Finally, I think the wholesale datamining is probably more useful post-facto: once you have identified a probable terrorist, you can then examine the suspect's record of calls and (depending on what gets stored) their content. The fact that the program is publicly presented as a means of pre-emptively identifying terrorists doesn't mean that that is what its true aim is: the public explanation may have been made to increase morale, or to scare terrorists, or to make the program seem more legal.

RalphJuly 11, 2006 1:19 AM

@Tank

You claim the article is flawed but offer no mathematics to refute it. You suggest it might be rigged but also offer nothing to support the accusation.

Data mining for MV crime after it has been commited is not the same as looking for someone you think might commit a crime at an unknown future date.

Please don't use the word we because you don't speak for me. If you represent more than yourself plse could you disclose this to other readers.

Dimitris AndrakakisJuly 11, 2006 2:55 AM

@klassobanieras:

Do you actually expect people to take you seriously with a nick like this ?

MatthaiJuly 11, 2006 3:53 AM

First possibility is they are paranoid. Second is they do not target terrorist, but political oponents.

But there is also third possibility. They are just wasting the money. Or using the money for something else. Look, they have a great job. They can be incompetent and inefficient and they can always hide themselves under "national interest". They won't tell you their success rate and amount of spent money, because that could "endanger national security".

It's a great job, isnt't it?

quincunxJuly 11, 2006 4:48 AM

Good point Matthai, not much attention is given to political empire building, or the general workings of Iron Law of Oligarchy in a Monopoly framework.

The way to get ahead in gov is to built an empire of employees beneath you. If you can just figure out some excuse for doing it, you will. It is also important to waste as much of your budget as possible so that you can claim that more is necessary. Of course a higher budget is necessary anyway since the previous fiscal period was entirely spent on misallocating the market economy and generally creating more problems in society.

In gov, failure is success.

(I need not go into the fact that 'cooking the books' & GNP calculation is nearly the same thing upon close inspection)

Now don't get me wrong, gov can be very successful in a narrow sense, especially when they outlaw competition, but 'catching terrorists' is not something they do as well as 'creating terrorists' (just like they are worse at 'performing useful services, economically' than 'creating fiat money'. Of course having people believe they can is a great excuse for perpetual conflict for perpetual peace.

If some willing people can just take the time to examine some history their teachers glossed over (somewhat having to do with being threatened to be forced out of the teachers' union) - they would realize that this time period we're in sure seems A LOT like other periods, almost to a tee. And if one sees how they play out (and will continue to play out if people continue to believe that societies' biggest parasite [look up etymology of 'politics'] is actually its greatest benefactor' they should certainly be skeptical of the optimists & those in denial.

I advise some reading of Man, Economy, & State by Murray Rothbard and Robert Higgs' Crisis & Leviathan & Against Leviathan to any scholar that would like to approach this topic in any socially scientific manner.

The original article by Professor Rudmin looks too narrowly at the issue.

I have written something on the maths of this; however, it would not post well here because of the layout. The complete comment can be found here: http://www.camalg.co.uk/sundry_2006/...

The final textual part of my comment is as follows.

The individual score is very important, and is that aspect that Prof Rudmin has not considered sufficiently. One does not have to consider for arrest and interrogation, every legitimate terrorist suspect. A much more likely policy, for such well-informed organisations as the NSA and the FBI, is that a sorted list of the higher-scoring legit suspects would be produced, with their scores. Valuable and expensive covert (or in some cases overt) investigatory resources could then be allocated to the very highest-ranking legit suspects, as judged cost-beneficial and according to resource availability.

Now, of course some politicians and managers, of the statistically uninformed sort worried about terrorism (and the need to be seen to 'do something') might introduce the odd and serious glitch into this well-understood process. This may well cause investigatory teams to be tasked with futile investigation of the (very likely) innocent. Likewise some law enforcement 'foot soldiers', improperly tasked or insufficiently well trained in the real importance of their work, might find some of the investigatory legwork seemingly pointless.

Now for some very approximate numbers (or perhaps not).

If P(T) is 1/300,000 and investigatory resources are available for 1,000 investigations (of a particular sort and cost), we have no idea (prior to looking at the actual scores from data mining), as to what threshold 'e' should be set. However, we do know that we should consider no more than the top 1,000 candidates.

Then we should consider the scores based on the data mining evidence 'e' (that is the approximate Detection Gain, Watchlist) and also the assumed a priori probability P(T) (which is only known approximately). This is to determine whether the investigation of the least likely individuals on the hot list should actually go ahead. This decision should take into account the cost of the investigation (including the adverse motivational effect of pointless tasking on investigatory staff), together with the level of invasion of privacy and possible infringement on civil liberties (justified through the circumstances and P(T | e) in the least likely case pursued).

Now, of course there are several unknowns in all of this. The a priori probability 'P(T)' is only approximate. Likewise, the Probability Density Functions (PDFs) arising of the target (P(e|T)) and non-target (P(e|~T))data subsets are only known approximately. [Though note that the PDF of the non-target set is known much more accurately than the PDF of the target set, and this itself is useful in avoiding bad investigatory targeting.] However, it should be quite obvious that targeting the top-ranking of a sorted list (derived according to evidence of some merit) is far better than forgetting the ranking and setting some arbitrary threshold based on very approximate assumptions.

Best regards

BernhardJuly 11, 2006 6:11 AM

@Nigel

Aren't you ignoring the fact that the sorted list of top-scoring suspects will be full of false positives?
I cannot see a reason why real terrorists would on average have a higher score than false positives.
Otherwise, the probability of detecting a terrorist would be very close to 1, which is not a realistic assumption.

First, the text file layout was not very good. I've now put up a PDF file, which is a bit better. It's at URL: http://www.camalg.co.uk/sundry_2006/...

@Bernhard

Assuming the data mining algorithms provide any discrimination in favour of the target subset, every entry in the top-scoring few will have a higher probability of being a target than those not in the top-scoring few. Furthermore, the higher in the list, the more likely that person will be a legitimate suspect (even if the actual probability of them really being a terrorist is still low). This is on the basis of the "gain" in discrimination obtained from the data mining.

Consider for example, a person who has telephone the number abroad of a known terrorist organisation; this is against the 290+ million persons in the USA who have not phoned this organisation. Do you not think that caller is somewhat more likely to be a legitimate terrorist suspect than everyone else.

Now, there is, of course, the case that any prudent terrorist would not do something so obvious. However, he may have contact by telephone with a less clever accolyte who has, or who did somewhat earlier and did not take the excellent advice to change phone number, address, mobile phone, etc.

Each tiny bit of such evidence helps a tiny bit. If enough tiny bits are put together, cost-effectively by automatic processing, it is of some help.

Best regards

@Brian

Spy Agency Data After Sept. 11 Led F.B.I. to Dead Ends
http://www.nytimes.com/2006/01/17/politics/...

"In the anxious months after the Sept. 11 attacks, the National Security Agency began sending a steady stream of telephone numbers, e-mail addresses and names to the F.B.I. in search of terrorists. The stream soon became a flood, requiring hundreds of agents to check out thousands of tips a month.

But virtually all of them, current and former officials say, led to dead ends or innocent Americans.

F.B.I. officials repeatedly complained to the spy agency, which was collecting much of the data by eavesdropping on some Americans' international communications and conducting computer searches of foreign-related phone and Internet traffic, that the unfiltered information was swamping investigators. Some F.B.I. officials and prosecutors also thought the checks, which sometimes involved interviews by agents, were pointless intrusions on Americans' privacy."

DavinoJuly 11, 2006 8:12 AM

Nigel, setting the threshhold from a ranked list makes sense. However since you're looking at such a small fraction of the population (1000/300,000,000) even obscenely dramatic improvements in the P(T|e)/P(T) gain is still insignificant. A gain of like 2x or 10x, (which is an amazing level of success in commercial data mining applications), would net you only 0.007 or 0.03 terrorists with a false positive rate of 99.9999% or 99.9967% of the 1000 people investigated.

The only way this program could be of any use is in producing a list of persons associated with a specific person already under investigation. If we want to devote 1000 more investigations into the associates of Mohammed Atta, we'd run this program, take the top 1000 most associated with him, strike off the ones we already know about, (like Atta supposedly met Saddam, Saddam shook hands with Rumsfeld, and then Bush hired Rumsfeld) and then take a look at the ones that remain.

@Davino

You are of course right, as am I, each in our own particular way.

However, I don't rate your argument on the handshaking; that is unless it is hyperbole. If the latter, that is (I judge) too subtle for many who read here.

The first important part is that the Detection Gain (W) should be very high, to compensate for the fact that the a priori probability is very low. Thus, if the product of them is less than say 1% (my not very informed judgement) then that person is not worth further investigation. This is on the basis that the absolute value of any of the assumed figures is rather poor.

The next important point is those making the more detailed resource allocation and tasking decision must have some grasp of the numbers and what they mean. If, as is reported just above from the NYtimes, numerate judgement has gone (hopefully temporarily), the money, effort and commitment will be wasted.

Going back to Bruce's original posting, and the referenced articles, they are too pessimistic. They are also wrong to the extent that they do not consider the ranking of targets (as I describe) as an aid to resource allocation.

Finally, the 1 in 300,000 is not a particularly good starting point. Add in some Detection Gain (not absolute) concerning sex, age, ethnic background, religion, nationality, education. Then add in another set of Detection Gains, concerning good-guy attributes. One can only do these things where they are known (which is by no means common, and which itself costs). There are problems and dangers. But it's still likely to be worth doing to an appropriate extent, rather than not doing through following inadequate reasoning. Putting numbers in makes it grey. It's not black and white and it never has been, except for the simple-minded.

Best regards

BrianJuly 11, 2006 9:40 AM

@Nigel

Your point about relative scores assisting targeting is a good one. There is some discussion of the NSA providing rankings to the FBI (scores of 1, 2, and 3). However, those rankings don't really help if the false positive rate is very high for even your highest ranked targets.

The Bush administration has been extremely interested in publicizing any successes in the war on terror. None of those successes have been attributable to the NSA's program. The NSA's program has been going on for years, but it hasn't contributed to the capture of a single terrorist.

From this I conclude that the NSA program has been and continues to be a waste of money and a massive violation of the law without making anybody safer. If the program has in fact been successful, the NSA needs to prove it, both to Congress and the people.

@Neighborcat

"The efficacy of the method is irrelevant!"

In a court of law, you're probably right. In the court of public opinion, it makes a big difference.

1. The most optimistic analyses I've seen of wholesale data mining all ignore the obvious: the enemy, not being fools themselves, can have opted out of the universe being mined by the simple expedient of using communication channels that the NSA cannot examine.

Couriers can travel without their presence being recorded, as passengers in cars or mass transit, or as unregistered passengers on aircraft, trains, or ships. There is no electronic communication here, so monitoring is physically impossible.

Handwritten messages, or electronically recorded messages, can travel by courier, or through the mails, undetected.

Operating entirely outside the sphere of surveillance reduces the base rate to mathematical zero, obviating the entire data mining enterprise's ostensible justification for its existence.

If the terrorists are keeping their terrorist communication outside the sphere, then we are building castles in the air and having discussions about engineering and architectural concerns that simply don't matter.

2. If wholesale data mining is done diligently, it will result in complete failure, for a reason not evident in statistical analyses.

Suppose you were in charge of investigating 30,000 positives a day, and leaving cases open indefinitely was unacceptable. Even if your staff were huge, clearing 30,000 cases a day would put you all in the business of clearing cases, and only that. After the first several cases, they would all start looking alike, and your abilities to make distinctions would extinguish quickly. So, even if there were the rare occasional terrorist among your positives, you would routinely clear his case because routinely clearing cases is all you know how to do.

3. If wholesale data mining is done dishonestly, while it will never turn up a terrorist, it will generate bogus terrorists, keeping up with government demand in their publicity scams.

If the agency investigates 30,000 positives a day, the unofficial standing order would be to pick out the few who would most easily be framed. (With 30,000 random people to pick from, finding the idiots should be no trouble.) Run the picks through kangaroo courts and make sure the press sticks to the party line. Keep reminding the public what a great job the government war on 'terrism' is doing. Meanwhile remember to occasionally put out nonspecific warnings to take no specific actions at no specific time in no specific place.

4. The NSA people involved here are not stupid or innumerate. But they do know where their money comes from and they are willing to play along. It's that, or leave. Those who have left can claim honor. Those who have stayed are criminally responsible.

5. The cost to a filthy-rich government of a single false positive is negligible. The cost to that single false positive can be maximal: it can be the ruination of his life, even his execution without trial.

6. In 1776 the troublemakers in the colonies declared independence, insisting they would not tolerate shabby treatment from somebody named George. What was that line about not learning from history?

@Roy, who wrote: "In 1776 the troublemakers in the colonies declared independence, insisting they would not tolerate shabby treatment from somebody named George. What was that line about not learning from history?"

But they chose someone called George to lead them, and got the French to help (on the sound basis that they, the French, would think 'my enemy's enemy is my friend').

Which just goes to show that arbitrary facts are no help, as well as arbitrary numbers being no help.

Best regards

AnonymousJuly 11, 2006 12:04 PM

Bruce - I think much of your work is great but on this issue I have to inform you that you're missing a few tricks.

I can't tell you much about what I do, but in my everyday work, I use data mining techniques (admittedly quite unique and specialised ones, but data mining nevertheless) to track down fraudsters, terrorists and other 'organised' criminals. And it works. In fact, it works really well. I don't need to appeal to Bayes theory or to any speculation based on completely unrealistic made-up scenarios. I can simply point to the fact that I do it for real day-in, day-out on real live data and it works. It doesn't catch everyone, but it does catch many. And the impact on everyone else is miniscule. We throw away data that doesn't relate to or contain anything suspicious immediately so we don't have to waste more time and money working on it.

One of the many things you've either missed or chosen to ignore is that it is not only information about actual bombers and terrorist cell members that gives useful leads to identify a terrorist plot. There are all sorts of individuals and activities that play a part in enabling acts of terror against innnocent citizens. Who sells the materials to these guys on the black market? Answer? Crooks. Greedy people. How do they get money to fund their terrorist acts? For tThose training camps in Afghanistan and elsewhere? Answer? Drugs, fraud, serious organised crime. Follow the money. Who runs the websites that host manuals on how to build bombs to maximise casualties? Bad guys. None of these people might be classed as 'terrorists' by your simplistic assumptions but I reckon many people would count these illegal activities as 'fair game' in the fight against terrorism. Certainly these activities are not included in any of the numbers you've used. If you tot up the number of people involved in these activities and the number of relationships amongst them, you suddenly find a lot more needles for the same amount of hay.

Also, your implication that data mining only works with known profiles is wrong; unsupervised clustering analysis can detect anomalous behaviours without ever being told what they look like. And your statement that there are no well-described terrorist profiles is plain wrong, There are hundreds of them. I use them every day.

Pat CahalanJuly 11, 2006 12:12 PM

@ Boris

Your blue sedan counterexample is seriously rigged. 30,000 or 300,000,000 suspects is functionally equivalent if you have 10 cops. Investigating blue sedans makes sense if there are 100 blue sedans and you have 10 police officers, it makes no sense if there are 1,000,000 blue sedans and you have one cop.

Using the numbers from the example in the link (remembering they're pretty generous) - if you have 1000 cops, that's 30 suspects per cop to investigate. Again, remember that all of the "ususal suspect" questions have already been asked (that's the point of the data mining in the first place). You already know there's "something suspicious" about these suspects, so you'd have to figure that investigating these suspects is going to require you to put in some serious work. If it takes 2 weeks to clear a suspect, that's 1,000 cops working full time for 3/5 of a year to identify the 400 terrorists.

Now, admittedly, that looks like a pretty good deal. But if you have 1,000 cops working full time for 3/5 of a year to catch 4 terrorists, that's not so much of a good deal if you can catch 8 of them by having one cop log into a chat room and pretend to be an Islamic militant and do humanint. If it takes a month of bugging their phone and following them (something that may require a team of investigators), the tradeoffs start looking really bad quickly...

@Boris

The problem with your blue sedan analogy is that the NSA is examining every blue sedan in the country, not just every one in the neighborhood.

Kevin

BennyJuly 11, 2006 12:49 PM

@ Anonymous (12:04 pm):

Could you please provide pointers on how to find more information about these successes? It would almost be comforting to me to see evidence that data-mining can work against terrorist networks, that we are not throwing privacy out the window for dubious gains. But it's hard for me to imagine the US government not widely publicizing any such successes to justify their efforts.

@Nigel

I think you missed the point with scoring. Scoring eventually results in a binary choice, you either investigate further or you don't. If you investigate hundreds of thousands of people, then the resources you could have applied in other areas are mostly wasted.

Or perhaps you have three options:

Very high score -- shoot on sight
Low score -- ignore

DavinoJuly 11, 2006 1:46 PM

Kevin Davidson: I agree. And from the original post, the very highest of scores 1-(3,900/300,000,000) under the most fantastic conditions, would give maybe a 23% chance of being right, or 77% chance of shooting an innocent person.

Terrorist's moms and their mom's friends might score surprisingly high by this program.

This seems to be the thread that wont die.

You guys also seem to be overlooking what they do AFTER they've decided that (a given person) is not a threat (begging the question that they ever actually decide someone will be excluded from further 'processing'), what do they do with the information pertaining to him/her? Keep it until something he/she has done IS illegal? Let office workers take it home on a laptop and leave the hard drive sitting on the roof of their car overnight? Sell it?

AnonymousJuly 11, 2006 3:08 PM

@Benny

I'd rather not say anything more, but let's just say that my work is very unique and specialized and is unencumbered by stuff like math or proofs or anything like that.

BrianJuly 11, 2006 3:18 PM

@ Anonymous (12:04 pm)

There is no doubt in my mind that it is possible to do data mining for terrorist related activities.

I seriously doubt that trolling through the phone bills of 300 million people is a useful part of that analysis.

"unencumbered by stuff like math"....datamining without math...wow, that's pretty nifty technology

AnonymousJuly 11, 2006 3:49 PM

That's right, buddy. It's unique and highly specialized and very secret. All your maths and proofs and simplistic assumptions doesn't cut it. It's datamining running on time-tested cliches like "follow the money".

i think this article is absurd. actually his approach to explain his theory does not make sense to me. it is like to say that nobody is going to win the lotto because the chances are slim:

1/(56*55*54*53*52*51)
{"el numero es monstruoso pero no infinito"--borges

but there have been a lot of winners.

-d

@Ralph at July 10, 2006 07:16 PM

>If the NSA says "this person is a terrorist" on the basis of mass survellience they have less chance of being correct than a man flipping a coin.

Using the original assumptions (40% detected, 0.01% false positives), I calculated the following:

299 970 000 non terrorists correctly identified (99.9897%)
30 000 non terrorists misidentified as terrorists (0.01%)
400 terrorists identified correctly (0.0001%)
600 terrorists missed (0.0002%)

If my calculations are correct, it means that the system is correct 99.9898% of the time.

The statistic compared to the flipping of the coin was:
If identified as a terrorist, are they really a terrorist.
If the system gave 50% for that statistic, wouldn't it mean that only one innocent person would be detected per terrorist detected?
Surely narrowing their search from 300 million to 30 400 to find their 400 terrorists would be a worthwile step.

> "NSA's surveillance system is useless for finding terrorists."

But useful for just helping narrow down the search?

Jon SowdenJuly 11, 2006 11:29 PM

"That's right, buddy. It's unique and highly specialized and very secret. All your maths and proofs and simplistic assumptions doesn't cut it. It's datamining running on time-tested cliches like "follow the money"."

Wow - you've cracked biological computing too! Double wow!

I am the author of the post at "Posted by: Anonymous at July 11, 2006 12:04 PM". I didn't actually mean to post it anonymously: explorer helpfully cleared that form field for me when I wasn't looking. I've no idea who the subsequent "Anonymous" was, but I fear he/she may be taking the mickey.

Of course I cannot point to evidence of these successes and I cannot tell you where to find the technology without revealing who I work for. What I can say is that I do not work for the US government; I work in the UK.

But you don't need to believe *me*; if Bruce and others actually bothered to do some research on Google, they'd find lots and lots of people all successfully doing data mining in this way. I'll give you a clue: social network analysis. And while they were there, they could look up data mining and find out what it was. This would hopefully stop them making gob-smackingly stupid assumptions about how you would use it to look for terrorists and other criminals. The 'maths' presented assumes that data mining techniques just look at every 'indicator' one by one to see whether it's terrorist or not. This is clearly ludicrous and ignorant. The whole point of data mining is to work with relationships amongst multiple entities and statistical relationships amongst multiple indicators. Thus rendering all of this 'maths' about data mining utterly meaningless.

> @Tank You claim the article is flawed but offer no mathematics to refute it.
> Posted by: Ralph at July 11, 2006 01:19 AM

What math?
I said the assumption that this data is used to identify persons as terrorists rather than identify the human networks associated with an identified subject is flawed.
Square that or add 7 if you like but the point you missed was that adding math to a flawed assumption is pointless.

> You suggest it might be rigged but also offer nothing to support the accusation.

Yeah... i did. The problem here is apparently that you didn't read or understand anything I posted before you replied to it.

BTW did you fail to provide math to refute my assertion that DNA is completely useless for identifying criminals because vials of DNA all look the same in a line up (false positive rate) or did you get the point I was making about a rigged arguement against the usefulness of data ?

> Data mining for MV crime after it has been commited is not the same as
>looking for someone you think might commit a crime at an unknown future date.

Yeah that was my point.
My other one was you'd need to be ignorant or intentionally misleading if upon learning that there was an MV registry which is used frequently by all law enforcement agencies in investigations, you assumed that it was being used for predicting future crimes or identifying potential criminals.

Supporting such a ridiculous assumption with maths, however competantly calculated, in no way improves upon the ridiculousness of your assumption.

> Please don't use the word we because you don't speak for me. If you represent
> more than yourself plse could you disclose this to other readers.

Yeah actually i do speak for you since you're not gonna be willing to say you disagree with what i've written.
In fact since i can't imagine anyone will i think i'll stick with the all encompassing "we" as entirely appropriate.

> And your statement that there are no well-described terrorist profiles is plain
> wrong, There are hundreds of them. I use them every day.
> before you write about it again. -- " " @ July 11, 2006 12:04 PM

This is a reoccurring problem.
Given the fact that reporting on suspect and evidence captures in terrorism cases is now worldwide mainstream news and that there are at least a dozen published works dealing only with analysis of terrorist's motivations, personal accounts and their lives at some point you have to conclude the ignorance is willing and purposeful.

Hell places like SITE are now included alongside the NYT in google news. Exactly how much research could you do on the topic of terrorism and still believe that the best characteristics the NSA has for terrorist profiles is their 7/11 purchases and which phone numbers they dialled. My guess is none.

AnonymousJuly 12, 2006 7:57 AM

Just to summarize:

All of you that use "maths" is stupid and ludicrous but I can't prove any of this because I work for some super secret stuff in the UK. But if you don't believe me, you're stupid and ludicrous too because you don't use Google.

And, bloody hell, that stupid explorer cleared my name out again when I wasn't looking! Someone's attacking me! Perhaps I can find out who really is doing this using my sophisticated data mining techniques that Bruce and everyone can't seem comprehend the brilliance of.

@tony: your lottery analogy fails because the people who did NOT win the lottery were only out ~\$1, not ostracized by their neighbors, put in jail or had their homes and possessions seized.

bob wrote: "... not ostracized by their neighbors, put in jail or had their homes and possessions seized"

That is why I like such a measure as Detection Gain, Watchlist.

It is fairly easy to understand, for example, that the legitimate suspect is thought to be approximately 10,000 times more likely to be a terrorist than the average US citizen (ie around a 97% chance that he is innocent); he needs to be investigated further, prior to any consideration of arrest or search warrants. That indicates how the suspect should be treated, much better than "he's a suspected terrorist, bring him in (dead or alive)".

And remember that the current fuss is about traffic analysis of telephone call logs: it's no where near evidence in the sense normally considered in a criminal prosecution.

Best regards

Bruce,

Do you ever respond to these comments? I usually love your site, but this post is extremely bad. My first thought upon reading it was that the 50-50 coin toss analogy was terrible, and in fact a system with one false positive for every true positive would be an excellent system indeed. And sure enough, klassobanieras and others have been hammering this point. Do you have a response?

I guess I find this worrying because I normally respect your judgement. Not to get too personal, but are you letting your feelings against the program cloud your analysis?

johnbJuly 12, 2006 2:14 PM

The comparison to flipping a coin is specious - flipping a coin on 300 million people in the US would misidentify 150 million as terrorists. A detector that was only 23% accurate, but only misidentified 3000 people, as in the example, would be quite useful.

Another example of damn lies and statistics.

VulturetxJuly 12, 2006 7:52 PM

Wow all the wrong assumptions from the orignal article onward to the commentors.

1. Data Mining when using a seed of "known terrorist(S)" significantly increases the detection rate. Yes there are more false positives than true positives. Turns out - many decision trees have this fault; does not stop beneficial results from occurring.
2. Data Mining is a group of programs ran by NSA. When a hit is collaborated by multiple programs the possiblility of a false positive is lessened.
3. Contrary to the extremists like Roy , being tagged as a "terrorist suspect" by the NSA does not mean investigation even much less the death and impisonment he claims. Since the FBI and other agencies subject these lists to human review.
4. Yes the system has worked, and the NYT has talked about it. They just did not understand the methodology.

5. Congrats you are already the victim of data mining. Usually multiple incident victims, but you keep reading your email and going to websites.

Me -someone who has built the Data Mining collection Clusters.

winsnomoreJuly 12, 2006 9:19 PM

While the good professor doesn't know exactly what criteria NSA uses, he is surely brilliant for proving it can't work!!

Keep digging junk Bruce .. to find "scientific" arguments to agree with you .. what's next Michael Moor's dissertation on proababilties.

Clive RobinsonJuly 13, 2006 6:41 AM

Having read through the postings, the argument appears to boil down to the probability of finding a lone terorist before he has committed the act, based on his communications and contacts.

In practice I doubt very much that that is the main aim of most anti-terrorist activities.

The professor is probably correct, you will not find an intelegent lone terrorist by data minning or any other mass survalence technique, it is just to easy to stay below the noise level. Also history of their communications and conntacts is not likley to throw up any other terrorists.

Also the lone terrorist due to supply difficulties is not likley to have access to sufficient materials to be "Random Target" active. They are more likley to pick a target such as an aircraft or train where a small explosion will produce a "high value" return. Due to this the normal survalence systems are considerably more likley to pick them up.

However if you think instead about terrorist organisations you are not dealing with lone individuals, this gives rise to recruitment issues where a history of communications and contacts will have a high probability of identifing other members of the terorist organisation.

With a terrorist organisation, the most desirable person to remove is the "Directing Mind", followed by the "Financing Hand", then either the "Supporting Network" or "Recruiting Agents". If these people are removed then the terrorist organisation will become at best disffunctional or cease to exist.

The terorists who commit the actual acts are as has been seen recently "expendable bio-mass/DNA" and will have been kept as an issolated group for a significant period of time by the organisation for security. This means that their may well not be sufficient history in the NSA DB for their communications and contacts to be seen.

However if you can identify even one recruiter and work your way back up the command chain to the directing mind, you can then work your way back down the individual paths to quite a large part of the organisation.

The problem is that in an established terrorist organisation the recruiter is likley to know they are a marked person and will use non conventional communications (say cut outs) and contacts back to the rest of the organisation.

This also suggests that the Proffesor might also be correct, in that you cannot mine data you don't have.

However the next line of attack the security services can take is to follow the financing and purchasing chains. Even terrorists need to eat, sleep and relax, all of which requires the expenditure of money. Unless they are out at a job then they will need to receive the money from somewhere.

Datamining for people with odd finacial profiles is going to prove very very fruitfull not just for finding terorists but drug dealers, people trafficers and other criminals.

We do not know if the NSA has access to everybodies financial information but it would seam unlikley that they did not at some level (Tax returns etc) or could easily obtain it in bulk (afterall large chunks are for sale as a commercial activity and the DHS does have the power to get the information if it so desires).

Also to commit a serious attack terorists need transportation and other materials, most of which can be traced back to a financial transaction, the recording of which is usually beyond their control.

Also some materials are just not that easy to get hold off in the quantities required, so looking at abnormalities in purchases (or thefts) of certain materials and other items might well give an indication as to an event becomming likley. Likewise with importation and transportation information. Again we do not know if the NSA has access to these types of records but again it would seam unlikley that they do not at some level.

So if the NSA can get access to financial, purchase and transportation records as well then the odds of finding terrorist goes up a lot.

As the credit agencies do a lot of financial modeling of US citizens allready, a scan through their DBs cross corelated with even a very large list of possible terrorists will produce significant dividends.

I think that with additional data over and above the communications and contacts a fairly effective automated system could be quite effective at finding large numbers of "undesireables" not just terrorists...

If you thought the argument was wrong, you are incorrect. NSA dragnetting is not effective at finding terrists. The probablility argument is quite correct. The NSA dragnet pulls in and misidentifies many many many innocents, while locating only a few 'baddies', and the problem of seperating those groups still remains. Probably not very easy, given that all target suspects fall into the category, by definition, of what the NSA call 'dodgey'.

The coin-flipping is referring to looking at people who have been selected through NSA, not at looking at random members of the population. It is a confusing presentation to use.

The rest of the argument must be that the cost of invading the privacy and unjustly accusing 30,000 or however-many innocents to find 90 or however-many *potential* criminals is too high.

I think this is a discussion about gaining intelligence concerning people who *may* at some point commit a crime, more than it is about locating people who evidence indicates have committed crimes. If it was the latter, then I think a more directed approach would be taken. Would NSA FBI etc even consider this level of attrocity if they had an option of following hard evidence leads? I don't think so.

Either they have a secondary motive, or they have simply mis-judged the appropriateness of this response.

further on the coin flipping thing,

The point of the anology is that even *if* the NSA dragnet is good enough to make the probability that a dragnetted identity == terrist P = 0.5, the problem is that you have still got a big bunch of people who that P applies to. Go back and look at the numbers used in the examples given, and question which of these hit/miss-rates you think are realistic for an automated system to achieve.

Make up an example for yourself using the hit and miss rates you think are real.

Then add it up like this:

* N(I) number of innocents dragged = population of US (a very large number)

* N(!t) number of innocents believed by NSA to be terrorists = population of US * misidentification rate (still a very large number)

* N(T) number of terrorists dragged = population of US * terrorism rate (a very small number)

* N(t) number of terrorists actually identified as terrorists by NSA (an even smaller number)

* P(T) probability that a person *identified by NSA as a terrist* is *actually* a terrorist = N(t) / [N(t) + N(i)]

Remember that N(i) is much larger than N(t) -- a very small number divided by a very large number, ie approximately zero, as explained by the good Professor.

Recall also that N(i) is very large, so sorting through the [N(i) + (N(t)] group by hand is not likely to be feasible. And because the P(T) is almost zero, any correlation between appearing on the NSA list, is specious.

The argument given above by some readers that the NSA are nice to people who appear on the list kinda reinforces this argument, rather than weakening it. The NSA *know* that the list is meaningless.

So what is the purpose of the drag-net?

Don't ask simply, what is the purpose of the list? -- that is not necessarily the purpose of the drag-net. In fact, I hope the list is not the purpose of the drag-net, since as pointed out also by the Professor or Bruce, the list does have correlations to activities other than terrorism -- unless the identities are chosen *completely at random*.

So what we end up with is a mass of publicity, a mass of fear toward the state, a mass of fear of terrists, and a list of people who fit some set of criteria which has not been made public.

But what we don't end up with is a useful list of people who have any useable probability of being associated with terrorism.

And what was the cost in financial terms for the technical implementation, let alone the social and personal costs and future political and societal implications? This technology is not run-of-the-mill. It has been purpose-designed, and implemented at great cost, at multiple points in the system, ie multiply the cost by number of installations (it's not deployed widely enough to become cheaper with scale ... unless it is being deployed globally ...)

oops: N(!t) and N(i) should both be either N(!t) or N(i) .. take your pick which symbol I should have used.

AnonymousJuly 16, 2006 11:25 PM

And one more thing:

Don't bring the 'the probability can be further refined by additional research' argument.

The probability assigned is defined as the final probability outcome of the SYSTEM. If you think it can be refined, then assign a better probability in the first place. Doesn't matter. The sums still say you're wrong.

above is me

oh and if you have 1/300,000 terrorists (@Nigel Sedgwick), you have a problem that no dragnet is gonna cure. In a population of 300,000,000 -- 1,000 *terrorists*? Are you kidding me?! I don't see embattled militia fighting street-to-street over there yet. I don't see internal faction wars. Or do I?

I think P(T) is much, much lower, I think actual terrorists, you should compare the 9/11 incident, which allegedly took 20 personel within the US.

And given this has not happened again, it probably means that even less than 100 people in the US would commit a major act of terrorism if you did nothing (I'm guessing).

And of those 100, how many are truly competant? And are they likely to have the same success as before? If so, why? Because your people are all busy snooping on their neighbours, instead of trying to make the country a nicer place and make its installations less useable for harm?

I think you'd be lucky to find 1,000 actual terrorists worldwide in any given year. What's that .. 1/6,000,000. Put that figure into your NSA dragnet probability calculator and watch the smoke come out.

@Tank: "Where but in the least informed dicussions is it suggested that the NSA calls database is used to identify terrorists rather than providing an unrivalled and infinitely useful investigative tool to aid existing investigations by providing an outline of a suspects personal contact networks ?"

Probably in the FISA court room, I guess. Isn't that the exact type of scenario where a warrant is granted? So tell us, in your infinite expertise, what is the FISA-abortive NSA thing for?

... and many more responses like this .. I do not have the time. Good luck with it, hope you are over it soon America.

Oh and the best use for datamining?

Finding credit card numbers and logins to use to book hotels, cars, flights ... buy drugs, cars, guns, seduce your boss' wife, commit acts of larceny and ... terrorism I guess, in the worst-case scenario.

But we trust our friendly law-enforcement now, don't we, huh? And our government. Trust both implicit and explicit. Yeah, right.

"As terrible as the war in Iraq is, it has not managed to effectively kill 500,000 children unlike the trade sanctions earlier."

Comparing ten years with 2 years? And what about the mutagenic effects of the chemicals and heavy metals and radiological elements sprayed around, which will have the same effect as the ones sprayed about ten years ago did, ie birth defects and odd syndromes.

>> @Tank: "Where but in the least informed dicussions is it suggested that the NSA calls database is used to
>> identify terrorists rather than providing an unrivalled and infinitely useful investigative tool to aid existing
>> investigations by providing an outline of a suspects personal contact networks ?"

> Probably in the FISA court room, I guess. Isn't that the exact type of scenario where a warrant is granted?

Yep. The only thing that should generate a question mark here is how you went from sounding like you had a clue in one sentence....

> So tell us, in your infinite expertise, what is the FISA-abortive NSA thing for?
Posted by: chunkada at July 16, 2006 11:45 PM

....to sounding like you're puzzled by what you just said yourself in the following sentence.

BTW who gives a shit what FISA is doing or not ?
It doesn't factor into the conclusions of this article or my statements about the usefulness of phone contact data for mapping human networks.

Well, there are lies, damn lies, and statistics. This fits into the statistics category. The assumptions here are fundamentally flawed, sorry. You are absolutely correct in that with the overwhelming majority of non-terrorists relative to terrorists, an INITIAL positive "hit" as a terrorist is far more likely to identify a non-terrorist than a terrorist, but with each successive round of testing, the ability to identify a terrorist increases dramatically (and "further investigation" would not mean interrogation, it would mean reading a second email, or more likely, reading a first email as the initial positive hit would be from a computer identifying some anomaly like a key word or strange internet purchase).

As an example: the majority of people who test HIV positive on their first test DO NOT have HIV, but no one says the test is useless, because all of those people then take a second test, and the vast majority of people who test positive multiple times DO have HIV. Once you've gone through one or two rounds of selection, the odds of separating true-positives from false-positives becomes very favorable.

I'm ambiguous on the use of data mining to capture terrorists, but I hate to see the credibility of statistics diminished in the eyes of the public because people without the ability or desire to use it properly try to abuse it to sway public opinion.

Oh, one more thing: Floyd Rudmin, your professor. He is a professor of psychology. Which means, right, he is not an expert on Bayesian analysis. He's just some guy as far as statistics are concerned. I feel it was dishonest to not state that he is a professor in a field not related to statistics because it leads readers to assume his is an expert opinion. It isn't, its just propaganda.

@Tom, "Oh, one more thing: Floyd Rudmin, your professor. He is a professor of psychology. Which means, right, he is not an expert on Bayesian analysis"
Not a psychology but the study of groups within groups would have some merit to this.
if you have 50 groups(character/ethic/lifestyle,etc), and one of the groups is likely to be the type for a terrorist then a 80% susses rate, could just be based on the selected persons which were in the neighborhood, and there wasn't the other 59 groups in high numbers.
A million dollar house in a very bad neighborhood, the people that own the million dollar house might not show up in the result type thing

E-mail is optional and will not be displayed on the site.

Remember Me?

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>