Data Mining for Terrorists

In the post 9/11 world, there’s much focus on connecting the dots. Many believe that data mining is the crystal ball that will enable us to uncover future terrorist plots. But even in the most wildly optimistic projections, data mining isn’t tenable for that purpose. We’re not trading privacy for security; we’re giving up privacy and getting no security in return.

Most people first learned about data mining in November 2002, when news broke about a massive government data mining program called Total Information Awareness . The basic idea was as audacious as it was repellent: suck up as much data as possible about everyone, sift through it with massive computers, and investigate patterns that might indicate terrorist plots. Americans across the political spectrum denounced the program, and in September 2003, Congress eliminated its funding and closed its offices.

But TIA didn’t die. According to The National Journal, it just changed its name and moved inside the Defense Department.

This shouldn’t be a surprise. In May 2004, the General Accounting Office published a report that listed 122 different federal government data mining programs that used people’s personal information. This list didn’t include classified programs, like the NSA’s eavesdropping effort, or state-run programs like MATRIX.

The promise of data mining is compelling, and convinces many. But it’s wrong. We’re not going to find terrorist plots through systems like this, and we’re going to waste valuable resources chasing down false alarms. To understand why, we have to look at the economics of the system.

Security is always a trade-off, and for a system to be worthwhile, the advantages have to be greater than the disadvantages. A national security data mining program is going to find some percentage of real attacks, and some percentage of false alarms. If the benefits of finding and stopping those attacks outweigh the cost—in money, liberties, etc.—then the system is a good one. If not, then you’d be better off spending that cost elsewhere.

Data mining works best when there’s a well-defined profile you’re searching for, a reasonable number of attacks per year, and a low cost of false alarms. Credit card fraud is one of data mining’s success stories: all credit card companies data mine their transaction databases, looking for spending patterns that indicate a stolen card. Many credit card thieves share a pattern—purchase expensive luxury goods, purchase things that can be easily fenced, etc.—and data mining systems can minimize the losses in many cases by shutting down the card. In addition, the cost of false alarms is only a phone call to the cardholder asking him to verify a couple of purchases. The cardholders don’t even resent these phone calls—as long as they’re infrequent—so the cost is just a few minutes of operator time.

Terrorist plots are different. There is no well-defined profile, and attacks are very rare. Taken together, these facts mean that data mining systems won’t uncover any terrorist plots until they are very accurate, and that even very accurate systems will be so flooded with false alarms that they will be useless.

All data mining systems fail in two different ways: false positives and false negatives. A false positive is when the system identifies a terrorist plot that really isn’t one. A false negative is when the system misses an actual terrorist plot. Depending on how you “tune” your detection algorithms, you can err on one side or the other: you can increase the number of false positives to ensure that you are less likely to miss an actual terrorist plot, or you can reduce the number of false positives at the expense of missing terrorist plots.

To reduce both those numbers, you need a well-defined profile. And that’s a problem when it comes to terrorism. In hindsight, it was really easy to connect the 9/11 dots and point to the warning signs, but it’s much harder before the fact. Certainly, there are common warning signs that many terrorist plots share, but each is unique, as well. The better you can define what you’re looking for, the better your results will be. Data mining for terrorist plots is going to be sloppy, and it’s going to be hard to find anything useful.

Data mining is like searching for a needle in a haystack. There are 900 million credit cards in circulation in the United States. According to the FTC September 2003 Identity Theft Survey Report, about 1% (10 million) cards are stolen and fraudulently used each year. Terrorism is different. There are trillions of connections between people and events—things that the data mining system will have to “look at”—and very few plots. This rarity makes even accurate identification systems useless.

Let’s look at some numbers. We’ll be optimistic. We’ll assume the system has a 1 in 100 false positive rate (99% accurate), and a 1 in 1,000 false negative rate (99.9% accurate).

Assume one trillion possible indicators to sift through: that’s about ten events—e-mails, phone calls, purchases, web surfings, whatever—per person in the U.S. per day. Also assume that 10 of them are actually terrorists plotting.

This unrealistically-accurate system will generate one billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999% and you’re still chasing 2,750 false alarms per day—but that will inevitably raise your false negatives, and you’re going to miss some of those ten real plots.

This isn’t anything new. In statistics, it’s called the “base rate fallacy,” and it applies in other domains as well. For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population. Terrorist attacks are also rare, any “test” is going to result in an endless stream of false alarms.

This is exactly the sort of thing we saw with the NSA’s eavesdropping program: the New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm.

And the cost was enormous: not just the cost of the FBI agents running around chasing dead-end leads instead of doing things that might actually make us safer, but also the cost in civil liberties. The fundamental freedoms that make our country the envy of the world are valuable, and not something that we should throw away lightly.

Data mining can work. It helps Visa keep the costs of fraud down, just as it helps Amazon.com show me books that I might want to buy, and Google show me advertising I’m more likely to be interested in. But these are all instances where the cost of false positives is low—a phone call from a Visa operator, or an uninteresting ad—and in systems that have value even if there is a high number of false negatives.

Finding terrorism plots is not a problem that lends itself to data mining. It’s a needle-in-a-haystack problem, and throwing more hay on the pile doesn’t make that problem any easier. We’d be far better off putting people in charge of investigating potential plots and letting them direct the computers, instead of putting the computers in charge and letting them decide who should be investigated.

This essay originally appeared on Wired.com.

Tags: banking, base rate, credit cards, data mining, Department of Defense, essays, false negatives, false positives, FBI, FTC, national security policy, NSA, privacy, profiling, surveillance, terrorism, Total Information Awareness

Posted on March 9, 2006 at 7:44 AM • 93 Comments

Comments

JB • March 9, 2006 8:31 AM

Bruce, your insight is admirable. Keep up the good work.

roy • March 9, 2006 8:54 AM

Nice work, but the people who need to understand this will refuse to listen.

Stephane • March 9, 2006 8:56 AM

Thanks to data mining, you don’t even need terrorists to spread terror among people.. 🙁

Anonymous • March 9, 2006 8:58 AM

[Public health nit pick]
“For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population.”

Actually, one needs to distinguish between screening the general population for rare diseases and diagnosing them. You screen with a test or procedure that yields low false negatives (with an acceptable rate of true positives), and then having “concentrated” the high risk population, you can diagnose with a more complex procedure with low false positive rate because you’re now using a high prevalence population. The two step procedure lets you optimize at both points.

And if you get too many false positives at step one, you change how you screen because its costing you and everyone else too much. Health management folks do this sort of cost-benefit stuff all the time, as do folks in other fields.

The problem with terror plot data mining is that the screening step is essentially useless, as you correctly point out, so the “diagnostic” energy of the investigation step is totally wasted.

[/nit pick]

Put an HMO exec in charge of the terrorist hunting process — the data mining will stop instantly!

Derob • March 9, 2006 9:09 AM

Bruce, though I agree with your general conclusion that false positive and negative rates will make a data mining application very difficult, and I also think your gouvernement not always applies the smartest tacktics, it does depend a little on how the mining is set-up. For instance, if it is used to explore social networks of people associated with terrorism and to watch events in these networks the process will become easier. So if a specific person or object is coupled to a general pattern or profile, the haystack might become a bit more transparent and better structured. However, it seems that it is not always done like that.

Tank • March 9, 2006 9:16 AM

Quote: “Data mining is like searching for a needle in a haystack. There are 900 million credit cards in circulation in the United States. According to the FTC September 2003 Identity Theft Survey Report, about 1% (10 million) cards are stolen and fraudulently used each year. Terrorism is different. There are trillions of connections between people and events — things that the data mining system will have to “look at” — and very few plots. This rarity makes even accurate identification systems useless.”

Your example’s a bit rigged.
Apart from the fact you should be comparing 900m cards vs 250m people (or work into your evaluation of cc data mining the fact that cards are used more than once) there aren’t trillions of behaviour characteristics in al Qaeda training manuals.

The number of people who have records of overseas travel to Pakistan and have reported their passports lost/stolen is a lot smaller.

Likewise I’d love to know the % of your readership who know what the inside of a chemical supplier looks like. Round about 0% I’m guessing. That pool seems pretty small too.

And what I wonder will those recommended humans rather than machines decide to do with a phone captured in a war zone on a guerrilla enemy affiliated with secretive organizations ?

Today in 2006 WHAT clearer investigative path is there in tracking down what human networks a person is connected to than who they have in their phone logs ?

You’ve know what the machines (programed by the humans mind you) have decided to do. Tell us the kind of human you hope is in the investigative loop (rather than the janitorial loop) who doesn’t do the exact same thing.

I’m assuming you won’t opt for the old beat the connections out of the guy method and turning him loose is really going to screw up your security cost vs benefit scale so I’m struggling to think of what could be more prudently done with the leads available.

Well, apart from scaling back the military to Fijian size so that they don’t capture so many people and leads all at once leading to the flood of leads you refer to. Again, really not improving the security cost vs benefit scale.

Let me guess, you got nothing.

Tank • March 9, 2006 9:25 AM

“Thanks to data mining, you don’t even need terrorists to spread terror among people.. :-(”

No the paranoid irrationally terrorfied demographic seems to take care of itself in that regard.

Seriously, was the 0 people complaining of how hard done done by they were by this just enough to push you over the edge and board up your windows ?

Are you right now cowering in uncontrollable fear that you too could be one of those zero people ?

Because if you actually are terrorfied by this then you’re the biggest factor in that result, not that visit from the FBI… that you didn’t get.

Clive Robinson • March 9, 2006 9:25 AM

Bruce, I think you have missed the point when it comes to Government Data Mining.

In the UK we have a Government that likes to spend money lots and lots of it and actually get little in return (see stuff about NHS spending doubling and actuall performance increcess in single digits).

We have been told that things like ID cards etc are for Anti Terror / Drugs / Sex criminals etc etc. In reality all these systems end up being used as Tax Raising systems.

Why, well the Tax take from Large Organisations is falling as they off shore and virtulise their companies (ICT makes this much much simpler). This leaves the ordinary man on the street and small organisations.

In the UK we have had a series of “Catch the Benifit Cheat” help lines and adverts in practice thay have cost more than they have saved. We are now getting the “Catch the Tax Avoider” adverts where they have a bloke pretending to be a builder / electrician who works for cash does a bad job and runs away knowing that you cannot catch him.

When you remove the window dressing it boils down to “we want to stop you using people whoe don’t charge you our outrageous tax”. To back this up a series of laws where introduced a year ago making it an offence to have any electrical work carried out in your home unless it was by an electrician who was registered with the Government…

The reason for Government Data Mining is to find people who might not be paying Tax, when you know you can find them then you can start hiking personal taxation….

derf • March 9, 2006 10:04 AM

There are an estimated 10 million illegal immigrants in the USA from Mexico. If the data mining can’t find them, how will it find the 100 that may be up to terrorist mischief?

It’s too easy to dodge data mining operations anyway. If you use multiple fake identification cards and addresses, along with cash, any data mining system based on computer records would be hard pressed to find you.

Dave H • March 9, 2006 10:04 AM

Data Mining vs. targetting data surveillance: as many point out, when there is evidence from OTHER sources that raises the probability of an attack within a population, the statistics change. But Bruce isn’t argueing against this kind of targetted surveillance, just the opposite. His point is data mining as the STARTING point is ridiculous, for the statistical reasons pointed out. And of course, if you have such evidence, it isn’t mining anymore…

dlthomas • March 9, 2006 10:23 AM

Tank – your example has little to do with data mining. Numbers extracted from a phone on the battlefield, I would guess, meets probable cause for connections to terrorism as required by FISA, and you can get a warrent and listen. With provisions of the USA PATRIOT act, that warrent applies to any phone of the individual in any locality without getting another court order – it just doesn’t transfer automatically to other people.

Doug • March 9, 2006 10:30 AM

While you have a point, you must realize that data-mining is an up-and-coming field. I work for a company that does a lot of data-mining, and we are developing algorithms to understand language, among other things. While data-mining may not be the answer now, as it evolves, it may become the best solution in years to come.

stacy • March 9, 2006 10:47 AM

But if they are going to do it anyway…
http://www.uclick.com/client/wpc/nq/2006/02/18/index.html
🙂

Randy • March 9, 2006 11:02 AM

“Man can’t fly”… 150 years ago, this was a common statement, but “we” continued to try. Today, we can go from the east to west coast in a couple of hours. If the people were to have listened to you back then, we would have given up and still be spending months on the back of a Jackass.

Just because there are some flaws in design and execution today, doesn’t mean we shouldn’t try and prevent Terrorist’s activity.

paul • March 9, 2006 11:07 AM

One area where data mining can be useful, it seems, is to develop interesting information about people we already now to be bad actors. Some of that can be fed back to identify new folks. But in the initial cases the profile is absolutely as specific as one could ask, since the subjects are known to be terrorists…

Phil • March 9, 2006 11:45 AM

Randy:

I want to stop terrorists. I don’t want to waste money and time on programs that don’t do anything to actually stop them, but might look like they do.

Most data mining is a waste of manpower and money. Let’s put that towards our police and firemen, and our covert operatives. THAT will find terrorists a lot more efficiently than any computer analyzed matrix of “things most people don’t know”.

Prohias • March 9, 2006 11:46 AM

“If the people were to have listened to you back then, we would have given up and still be spending months on the back of a Jackass”

Sure we can fly from east to west coast in the matter of a few hours. Yet as far as I can tell, we are still on the back of a jackass. 5 1/2 years and counting.

Joe Buck • March 9, 2006 11:52 AM

Data mining is good for tracking the plans and actions of people trying to overthrow the government — at least if those people are the opposition party, or political demonstrators. That’s because there are a lot of them, and they operate mostly in the open. Since the senders and receivers of e-mail can be obtained without a warrant, and since travel is tracked, the party in power can use data mining to assure that it remains in power, by disrupting the organizing efforts of the opposition.

Jonathan • March 9, 2006 12:53 PM

Mr. Schneier – While I agree with your overall analysis of the problems associated with data mining, I think there are two relevant issues that you missed, one of which Tank already addressed.

The other is that the statement “There is no well-defined profile, and attacks are very rare??? is almost certainly inaccurate. The proper statement is “… successful attacks are very rare.??? Your statement excludes the possibility that some attacks are foiled, and thus reaches the conclusion that we only know about terrorist attacks what we can determine about the successful ones after the fact. You can certainly defend that position if you so desire, but I doubt that even the most skeptical truly believes that the billions spent on intelligence and law enforcement are wholly wasted.

While one might attempt to make an argument that successful attacks are fundamentally different from unsuccessful ones, there is at least some evidence to the contrary. For one thing, if you’re going to blow things up you either have to purchase explosives or things that can become explosives. Likewise, terrorists (of recent notoriety, anyway) predominantly come from or have ties to certain areas of the world. As you point out, applying such criteria to your data mining effort will increase the false negative rate; a system with these assumptions doesn’t catch the IRA members or Timothy McVeigh.

Larry • March 9, 2006 1:41 PM

Respectfully, I think you are missing the point behind data mining. It is not to connect the dots or to nail a specific individual. That would be grossly unrealistic. The point behind data mining is to find a few more dots that have a higher probability of being “interesting” that the rest. Data mining is just a tool, a first cut at sifting the wheat from the chaff. Human follow-up and analysis is what would connect the dots.

The “flaw” is not the data mining but in the quality of the subsequent analysis. …or lack there of. The solution is to provide investigators with higher quality leads.

kashmarek • March 9, 2006 2:00 PM

The data mining is not there to detect terrorists, although if it did, that would be a nice benefit. Like the ports deal, where economic (business) interests trump security, the data mining is there to give businesses information about you (and the current administration thinks that is good for them too). Terrorism is just the cover story. 1984 is the story being covered up.

Uche • March 9, 2006 2:13 PM

Does anyone have the slightest clue what Tank is going on about? I get the impression he disagrees with Bruce and at least one of the commenters, but I can’t really understand the substance of what he’s saying.

Former Zoomie • March 9, 2006 2:36 PM

When the Berlin Wall “came down” on Nov 9, 1989, HQ USAFE completely failed to anticipate and recognize what was happening. The entire command immediately went to yellow alert and a 12-hour shift military exercise posture while the scary, deadly Trabis full of tired-looking Eastern Bloc families chugged and stumbled across the opened Hungarian-Austrian border. Some “invasion”!

The reason this happened this way was because HQ USAFE Intel were not looking at unclassified information that would have clearly shown what was happening. As the administrator of their profiling system, I had been asked to route unclassified FBIS translations of Eastern Bloc printed and electronic media into a trash queue. (The attitude was that if it’s unclassified, clearly it’s worthless!)

I used to read the contents of the queue (Pravda, et. al.) for my own entertainment and as such had a unique front row seat for events as they developed in Eastern Europe. Several months before the “fall” of the Wall, I went to the user responsible for analyzing this sort of thing and tried to call her attention to some really crazy stuff that I was reading, and suggested that analysts start looking very seriously at the stuff being printed in Eastern Bloc media. I got blown off, and the FBIS messages continued to be deleted in favor of classified (and in the end, useless) data.

In a word – you are right. It’s not the data – it’s the people looking at the data, and how willing they are to look at what they already have. They simply do not need more data. Sing along, you know the words: the way to find a needle in a haystack is not to pile on more hay – and I can tell you, they’ve already got more than they can reasonably look at.

Half • March 9, 2006 2:37 PM

Bruce,

The facts you lay out are so well known, and the conclusion so incontrovertible that perhaps another consideration should be introduced, in light of the evident desire of politicians to pursue these programs regardless:

It must be a mistake to assume that data mining programs are intended to find and foil terrorist plots.

Maybe these programs have other purposes, for which they’re admirably suited.

Anon1 • March 9, 2006 2:41 PM

I have been doing data mining for a variety of organizations for several years now. For the last 3 years I have spent my time working for the military developing tools to detect computers which have been exploited. I am not talking about machines that are part of bot-nets (SPAM, DDos, etc.), these are easy to detect and hopefully the academics will figure out that we don’t need another thesis on finding these machines. What I am talking about are machines that have been specifically targeted by a (group of) skilled hacker(s). What we are trying to do is find several machines that have been exploited out of the many computers/devices on the network. Because of the orders of magnitude this is truly trying to find a few needles in large haystacks.

Tank had a good point and I think Jonathan hinted at one.

Tank’s comment about your numbers being unfair is correct. With the credit cards one is trying to find the cards which are being used for fraud. With terrorists we are trying to find the terrorist, not the connection that leads to the indication that a person is a potential terrorist. To make your article fair you would have to change the number of credit cards to the number of credit card transactions to compare with the number of potential terrorists interactions that need to be scanned. Or what would be less accurate is to change the number of interactions of people to the number of people being monitored.

Jonathan had a point about profiles. With a competent cracker using a computer to exploit other machines on the network, there is very little common pattern between crackers, or even between attacks from the same cracker. This is a problem with which the data mining community has not addressed very well. As you were saying the standard way to do data mining is to setup a profile and look for that. Techniques have been developed that get around the problem of needing a profile. I will not elaborate on the techniques here, but will say that they are being used on production networks, and are very successful.

I believe that this renders the basic premise of the article flawed. I will concede that the report that the NSA is kicking out thousands of people of interest to the FBI is not a good thing, as it is a waste of resources. But this becomes difficult to analyze because the intelligence community breaks operations up into sections and when they have a success they do not talk about it. So I would suspect that we are only getting part of the picture.

I am claiming that data mining is a powerful methodology that with proper techniques can come to bear on this problem. I state this because I have seen it used in practice on a similar problem (rare class problem), and the technique has worked well. If the way that the NSA is currently doing operations is helping at all I don’t think that we can judge, but we can say that what we see of the program is not so great. I don’t believe that we can evaluate weather or not this system is worth the money, and civil liberties that we are giving up.

another_bruce • March 9, 2006 2:54 PM

mr. schneier’s analysis is sound, but one of his sentences annoyed me. “in hindsight, it was really easy to connect the 9/11 dots and point to the warning signs…” my understanding is that a number of people actually did connect the 9/11 dots before it happened and warned their superiors, who did nothing.
something there is that wanted 9/11 to happen. not just al-qaeda, something among us. cui bono?
remember the expression on bush’s face at the “my pet goat” moment from fahrenheit 9/11? squirreliest damn presidential expression i ever saw, including all of the prime nixon moments.

S. Colcord • March 9, 2006 3:47 PM

“…for a system to be worthwhile, the advantages have to be greater than the disadvantages.”

Very true. However, the public is not conducting a rational cost-benefit analysis. They receive an illusory sense of security from knowing that ‘someone’ is doing ‘something’, and value those illusions more highly than they value their lost freedoms, or their tax money.

Their political representatives are duly spending their money to purchase those illusions. The alternative would be the politicians attempting to educate the public, and few politicians have the courage to risk their careers by showing genuine leadership.

Koray Can • March 9, 2006 3:48 PM

I think that data mining is a very advanced field, and if we all were to suddenly become experts in it, we’d be very surprized by what it can or cannot do. So, I don’t consider myself qualified to judge its capabilities accurately. I also think Bruce’s example is too simplicistic.
However, I do want to know this: if wholesale surveillance can be fine-tuned to this level of accuracy (–for I think extracting a terrorist plot is the hardest task for a data miner), why don’t we use this technology for preventing crime or car accidents first?

Bruce Schneier • March 9, 2006 3:59 PM

“Actually, one needs to distinguish between screening the general population for rare diseases and diagnosing them. You screen with a test or procedure that yields low false negatives (with an acceptable rate of true positives), and then having “concentrated” the high risk population, you can diagnose with a more complex procedure with low false positive rate because you’re now using a high prevalence population. The two step procedure lets you optimize at both points.”

Definitely true, and if I had more space I would have gone into that complexity. One assumes that any terrorist-detection data-mining system will try to do something similar. I don’t think it will help.

Bruce Schneier • March 9, 2006 4:01 PM

“Apart from the fact you should be comparing 900m cards vs 250m people (or work into your evaluation of cc data mining the fact that cards are used more than once) there aren’t trillions of behaviour characteristics in al Qaeda training manuals.”

I hope it’s not a rigged example; I tried not to make it such.

I decided that cards is a more useful indicator than people. Cards are not always stolen in wallets; they’re stolen online too. I ran the numbers both ways in my drafts, and it didn’t make much difference.

Of course there aren’t trillions of characteristics in the al Qaeda training manual; if you think I said that you misunderstand the example. There are trillions of legitimate data points from all of us in the U.S. going about our daily business that need to be data mined for the few al Qaeda characteristics.

Bruce Schneier • March 9, 2006 4:02 PM

“There are an estimated 10 million illegal immigrants in the USA from Mexico. If the data mining can’t find them, how will it find the 100 that may be up to terrorist mischief?”

That’s a really good point. Thank you.

Bruce Schneier • March 9, 2006 4:03 PM

“While you have a point, you must realize that data-mining is an up-and-coming field. I work for a company that does a lot of data-mining, and we are developing algorithms to understand language, among other things. While data-mining may not be the answer now, as it evolves, it may become the best solution in years to come.”

Agreed. Research in this field should be funded and encouraged. Also, debate in how this tool should be used in society, the balance between this kind of wholesale surveillance and liberty, and etc., should also be encouraged.

I just don’t think we should waste our security dollars on it yet.

Bruce Schneier • March 9, 2006 4:05 PM

“Just because there are some flaws in design and execution today, doesn’t mean we shouldn’t try and prevent Terrorist’s activity.”

Of course not. But if you have $10B to spend on stopping terrorism, would you rather spend it on countermeasures that have a decent chance of working or countermeasures that are likely to be completely ineffective? That’s the question here. We should definitely try to prevent terrorism. This system isn’t going to do it. I suggest we spend the money on something that might.

Bruce Schneier • March 9, 2006 4:07 PM

“Respectfully, I think you are missing the point behind data mining. It is not to connect the dots or to nail a specific individual. That would be grossly unrealistic. The point behind data mining is to find a few more dots that have a higher probability of being ‘interesting’ that the rest. Data mining is just a tool, a first cut at sifting the wheat from the chaff. Human follow-up and analysis is what would connect the dots.”

Exactly. That’s not a point I missed; that’s exactly what my analysis talked about. The human follow-up is an insurmountable problem, because the false positives are so great.

Tito • March 9, 2006 4:12 PM

Wow, I am amazed to hear this as a counter to Bruce’s excellent points:

“We have to do something. This is something. Therefore we have to do this.”

Kevin Lynch • March 9, 2006 4:21 PM

There are an estimated 10 million illegal immigrants in the USA from Mexico. If the data mining can’t find them, how will it find the 100 that may be up to terrorist mischief?”

That’s a really good point. Thank you.

I would have to disagree here. The implicit assumption is that someone actually IS trying to find them and do something about their presence. But we know that there is no concerted effort to track down, find, and deport illegal aliens. It is reported widely and often that even known illegal aliens are regularly allowed to remain unmolested by the law, and that in many jurisdictions, it is illegal to inquire about immigration status during governmental or economic transactions. Hardly a good example.

Fish don't notice the water • March 9, 2006 4:31 PM

“The fundamental freedoms that make our country the envy of the world are valuable, and not something that we should throw away lightly.”

You believe that nearly everyone understands this to be true. You think that this is self-evident. It’s not.

Most people will not value freedom until they have tasted its opposite.

1915bond • March 9, 2006 4:34 PM

While the technology of data mining may not be polished, even DoubleClick’s crude example is startling:

http://www.computerbytesman.com/privacy/banads.htm

Edward • March 9, 2006 4:40 PM

I would like to point out to all you skeptics, here is a thought. Being that i am a Application Programmer, specializing in security and db’s, i must agree with Mr. Schneier. We have done a similar project on a much smaller scale obviously as a project in one of my security classes at school, and his figures are very acurate. What I dont think most of you get is, a data mining application, is still a mathamtical algorith. Now call me crazy, but i do believe it has been proven you can’t predict any selective behavior in any creature, or even nature for that matter. I do beleive Einstein himself failed at that, it was called his Theory Of Everything. He worked on it to his dying day, Basicly stating with mathematical formulas, you could predict even the unknown choices of people and nature itself. Where we all know it is not possible. So to say you can use a mathematical formula, to predict the choices of people, no matter how much you think you know about them, is a complete farce.
Thank You

InternationalCitizen • March 9, 2006 4:49 PM

Hi, I’m writing just after reading your article on why data mining
won’t stop terrorism, and it was very informative and well-written.
What made me write, however, was this quote:

“The fundamental freedoms that make our country the envy of the world
are valuable, and not something that we should throw away lightly.”

I agree wholeheartedly that those freedoms shouldn’t be sacrificed,
especially when there’s little gain to be had. But envy? Maybe from a
banana republic or a dictatorship, but I can’t really see how (for
example) any citizen of the EU would envy the USA when it comes to
civil liberties. At least I wouldn’t get arrested (or fined, I confess
to not knowing the details) for drinking on the street, in, say,
France. Or if I was with a group of friends, one of which had
marijuana on him (even though that has nothing to do with me), as
happened to an American friend of mine the other day in New Hampshire.
Although the US is an example in many areas, it is most definitely not
the source of all democracy in the world or the country that all
others should aspire to be.
Having pride in one’s country is no flaw, as long as it doesn’t
blind us to what the rest of the world has to offer/teach, or the fact
that they may do some things just as well or better. I always admired
the sacredness of free speech in the USA, but in recent years even
that doesn’t seem as sacred anymore.
Once again, great article. I couldn’t help but express my opinion
about that sentence, which was the only thing I didn’t like about it.

Smuler • March 9, 2006 5:38 PM

Could someone please explain the math in the example? What is multiplied/divided by what?

Sorry to be such a rock, but I need a Barney-style breakdown.

Jan Theodore Galkowski • March 9, 2006 5:46 PM

Thanks much for a great article on data mining, probability of detection, and false positives.

Alas, the same matters afflict “bad guy” detection systems at airports and elsewhere. As others have commented, I just can’t see those transport systems which win with great throughput tolerating the slowdowns associated with high false alarm rates. So, while they may serve as a deterrent and make travellers feel better, I don’t have a lot of confidence they’re doing any good.

People have a hard time understanding this stuff. There was an article a bit ago showing that many physicians didn’t understand how the likelihood of someone having a disease given a test indicated they did depended upon its prevalence in the sample population. See for instance:

http://bmj.bmjjournals.com/cgi/content/full/327/7417/741

This is another example of why science and maths education is so important in a society, so folks understand the limitations of what can and cannot be done with systematic, technological approaches and what, if used, they might cost in time and convenience. It also would let people themselves see when authorities are simply blowing smoke.

Incidently, there’s a great site that explains much of this stuff in detail at:

http://yudkowsky.net/bayes/bayes.html

Best of luck,

reggie • March 9, 2006 6:00 PM

Holographic databases and other multivariant techniques based on less mainstream approaches to data mining are evolving very quickly and generating results in this field.

In my opinion you’ve over-simplified the discussion to a point that it’s remiss to the point of negligent.

Tragic to see such hype in mainstream from someone who should know better.

Pat Cahalan • March 9, 2006 6:02 PM

@ Anon1

With a competent cracker using a computer to exploit other machines on the
network, there is very little common pattern between crackers, or even
between attacks from the same cracker.

Although I’m hardly an “intrusion detection” specialist, I would say that this isn’t precisely true.

While hackers undoubtedly follow different attack patterns, and use different tools, they still need to open a communication with a system to begin. There are a limited number of ports that any given host should be listening to/opening connections from.

In other words, there is already a large “weeding out” process that can be accomplished through traffic analysis when cutting down your number of transactions to analyze in your data mining process.

This does not apply to terrorist activity.

Historically, I’ve found more intrusions through traffic analysis than anything else -> “This box should not be listening to anything on port 10245, let’s take a closer look” or “Nobody should be trying to connect to this machine over SSH from outside this subnet, let’s take a closer look”.

So, hacker activity itself may be like terrorist activity in its “nebulosity”, but the inevitable traffic generated by hacker activity is easier to discover than terrorist activity -> both activities are small pebbles thrown into a body of liquid, but the ripples caused by the pebbles are decidedly different.

Jan Theodore Galkowski • March 9, 2006 6:14 PM

“Holographic databases and other multivariant techniques based on less mainstream approaches to data mining are evolving very quickly and generating results in this field.”

ah rubbish. it doesn’t matter WHAT technique is being used. there are limitations to what you can KNOW. technology is not magic.

Boris Kolar • March 9, 2006 8:03 PM

Data mining may be very effective against terrorism, if you’re looking for related events instead of suspicious ones. For example, given a known terrorist, data mining can point to other terrorists from his cell.

People who are working together usually share some common behaviour. For example, if one finds an interesting web site, others will access the same page shortly after they are told about it. So once you find a dot, data mining makes it easier to find other dots from a huge database of past events.

I think, Bruce, that your reasoning does not apply to data mining for related events.

funkyj • March 9, 2006 8:06 PM

We don’t need data mining, we need not to fire good people like Sibel Edmonds who blow the whistle on folks who are doing a bad job.

http://www.sourcewatch.org/index.php?title=Sibel_Edmonds

Of course wholesale data mining is a lot less threatening to the bureaucrat that are looking to build their empire.

roy • March 9, 2006 8:14 PM

For those who want to calculate on their own:

1-a is the sensitivity of the test
1-b is the specificity of the test
p is the prevalence of the condition tested for

The probability a positive is wrong is:

   p(1-a)
   ---------------
   p(1-a) + b(1-p)

The probability a negative is wrong is:

   (1-b)(1-p)
   ---------------
   (1-b)(1-p) + ap

For extremely rare conditions, where the prevalence is vanishingly small, these probabilities are approximated by:

false positive  ~ a

false negative  ~ 1

If the whole US population, including you, were tested for leukemia, then you should not worry if your survey result was positive: it is almost certainly mistaken.

However, if your oncologist has you sent to a diagnostic center, where the specificity and sensitivity for the diagnostic test were the same as for the survey (but the prevalence there is substantial), then if your result is positive you should worry.

If the test is for ‘terrorist’, then a nationwide survey will produce a host of false positives, the followups of which will waste even more time and money. And the rare true terrorist will almost certainly be missed.

David Jensen • March 9, 2006 8:15 PM

I was pleased to see a technical professional from computer science addressing this issue. We all need to speak up and try to put the debate over the use of these technologies on a stronger footing. Your work and your books, from what I know of them, have done a great deal toward that end.

However, I want to take exception with one of the arguments you offer in your Wired News piece, and recommend an alternative. You suggest that nearly any realistic classifier is likely to produce an unacceptable number of false positives. This argument is a key technical point in critiques of many screening systems, from AIDS testing, to polygraphs, to computer security systems. During the debate over TIA, it was made by an ACM committee and by a Scientific American editorial, among others. I made a version of this argument myself in a report on money laundering that I helped produce more than a decade ago when I served as a Congressional staffer:

“Even if the accuracy of the system is nearly perfect, the results are still discouraging. If the system is 99 percent accurate, then all 20 illegitimate transfers would be correctly classified, and 400 legitimate transfers would be misclassified as illegitimate. Therefore, even with a system with remarkable accuracy, nearly all of the transfers identified as illegitimate actually would be legitimate.” (p. 171).

Office of Technology Assessment, U.S. Congress, Information Technologies for the Control of Money Laundering. OTA-ITC-630, U.S. Government Printing Office, Washington DC, September 1995. http://www.wws.princeton.edu/ota/disk1/1995/9529_n.html

Since that time, however, I’ve come to realize that this argument is naive for some types of systems. Specifically, systems that examine networks of connected records (e.g., financial transactions, communications, web pages, social or organization networks) can be designed to largely escape this problem. I outlined this argument in some detail in a paper several years ago:

Jensen, D., M. Rattigan and H. Blau (2003). Information awareness: a prospective technical assessment. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. http://kdl.cs.umass.edu/papers/jensen-et-al-kdd2003.pdf

We show that the performance of screening systems can be improved substantially because the inferences about individual records are not made independently. That said, the alternative system is subject to a range of new and technically interesting critiques, which I think are more useful and informative for debates over public policy.

Jan Theodore Galkowski • March 9, 2006 8:16 PM

“Data mining may be very effective against terrorism, if you’re looking for related events instead of suspicious ones. For example, given a known terrorist, data mining can point to other terrorists from his cell.”

but, given that a terrorist in a sleeper cell is trying to blend in, that terrorist will also have many contacts with simple bystanders. how do you reliably tell which contacts are terrorist contacts and which aren’t, particularly since the non-terrorist contacts will inevitably outnumber the terrorist ones by a lot, and the really terrorist contacts will be covert?

Jan Theodore Galkowski • March 9, 2006 8:22 PM

“Specifically, systems that examine networks of connected records (e.g., financial transactions, communications, web pages, social or organization networks) can be designed to largely escape this problem.”

but such detection mechanisms are readily avoided by devices known to people in espionage for a long time. in particular, if the means of communicating is in public view and the agents using that means aren’t known to one another, there is no network to map.

this was done, for instance, in the long past by placing personals ads in newspapers which used appropriate words that had coded meanings known to the participants. today you could do the same on craigslist.org.

moreover, as i mentioned above, what do you consider a “link” in the network? obviously something like a financial transaction is pretty loud. but how about merely knowing someone or being a coworker?

Roger • March 9, 2006 9:57 PM

Data mining works best when there’s a well-defined profile you’re searching for,

I’m sorry, this is quite misleading, and — I hope I don’t seem rude here, I don’t mean to be — seems to indicate that you don’t really understand what data mining is. If you have a well-defined profile that you’re looking for in a database, that’s not data mining, it’s just called “doing a database query”. That process doesn’t need any fancy names because everyone already understands it, and it has existed since long before databases were in electronic form. In fact IBM started out when they developed mechanical ways of speeding it up back in 1911. Every police officer and the FBI does it every day, routinely for decades now.

Data mining, on the other hand, is a set of tools to help you discover what profiles you should be looking for, assess how reliable they are, and decide whether or not they are going to be worthwhile to pursue. It isn’t an automated process. It always involves a human analyst in the loop, operating the tools, examining the results, modifying and trying again. In short, it is close to what you are asking for when you said “We’d be far better off putting people in charge of investigating potential plots and letting them direct the computers, instead of putting the computers in charge and letting them decide who should be investigated.”

Terrorist plots are different. There is no well-defined profile,

Even this, however, is doubtful. Terrorists actually fit profiles far better than thieves do. Oh, don’t get me wrong, there are many exceptions; but in the case of terrorists, the exceptions are far rarer than they are for thieves. The reasons for this are fairly simple. Firstly, terrorists have very strong motivations which are significantly different to the rest of humanity (and, while differing in detail, can nearly always be categorised in one or more of a small number of types, arguably three); in contrast, thieves are almost entirely motivated by a drive which is felt to a greater or lesser degree by nearly everyone. Secondly most terrorist organisations — certainly all the more effective and dangerous ones — adopt a paramilitary training model, which increases individual effectiveness but also imbues particular characteristics and methodologies.

A false positive is when the system identifies a terrorist plot that really isn’t one. A false negative is when the system misses an actual terrorist plot.

Those are false positives/negatives for queries. A false positive for a data mining system is when it says “people who suddenly pay off their debts are 0.1% likely to be terrorists” when the actual rate is only slightly higher than background population. A false negative is when the system fails to identify that 100% of Al Qaeda members are Muslims. False positives in data mining are likely to be extremely rare, and probably indicate a serious systematic corruption of your data. False negatives are likely to be more common and arise from a variety of sources.

Depending on how you “tune” your detection algorithms, you can err on one side or the other: you can increase the number of false positives to ensure that you are less likely to miss an actual terrorist plot, or you can reduce the number of false positives at the expense of missing terrorist plots.

This is quite true, and is exactly the sort of thing that data mining is good at helping you to do. In particular, it should tell you whether or not there exists some combination of parameters that will give you useful results, or if you need one more data source to make it work, or you should just give up.

There are 900 million credit cards in circulation in the United States. According to the FTC September 2003 Identity Theft Survey Report, about 1% (10 million) cards are stolen and fraudulently used each year. Terrorism is different. There are trillions of connections between people and events — things that the data mining system will have to “look at” — and very few plots. This rarity makes even accurate identification systems useless.

Others have pointed out that this analogy isn’t quite right, but not quite got to the crux of where it falls down completely. In credit card fraud detection, we are looking at a sea of transactions associated with some card, some transactions are good and some are bad. We have to pick out nearly all the bad transactions without stopping too many good transactions. This is actually quite hard, and a lot slip through the net. Additionally, we may eventually class the entire account as compromised if it is getting far too many bad transactions. Given that we already have the bad transactions identified, this latter part is trivial.

In contrast, when we are looking for a terrorist, we don’t care all that much about individual transactions. It is the individual we are concerned with. Once an individual has accumulated a lot of “bad transactions”, he may be flagged suspicious. This inverted direction of association makes the tolerances much, much easier for the terrorist hunter than for the credit card fraud algorithm. And actually, it can be made even easier; while the police may want to be certain about the individual, the counter-terrorism agency is more concerned with groups, and so can make the associations even stronger (with a mental note to be careful about selection bias) by combining results for groups of associates. Let’s see where this takes us with your example. We’ll use your probabilities (whilst noting that they are not so much “optimistic” as “plucked out of the air”, but that’s OK, this is all just for an example).

Assume one trillion possible indicators to sift through: that’s about ten events — e-mails, phone calls, purchases, web surfings, whatever — per person in the U.S. per day. Also assume that 10 of them are actually terrorists plotting. This unrealistically-accurate system will generate one billion false alarms for every real terrorist plot it uncovers.

No. It may flag one billion transactions as “possibly suspicious”, but at this point we haven’t gotten out of the inner join of the SQL query, never mind dumped a report on a policeman’s desk. To go on, at this point we need another parameter: let’s assume that our real terrorists repeat the suspicious activity each day. (Whether or not this is plausible would depend on the nature of the activity, which is totally hypothetical at this point. In any case, it doesn’t much affect the result unless the detected activity is so rare even for real terrorists that it effectively only occurs once every few years.) On the second day, our list of false alarms have been reduced to 1 million, while there is a 99% chance that every single one of the original true positives is still in the list. By day 3, we have 1,000 false alarms and a 98% chance that all the original true positives is still detected. To cut a long story short, by day 6 there is only a 1 in a million chance that any suspects on the list are false positives, while there is a 94% chance that all the true positives are still detected, thus not only detecting the ring but rounding up its entire membership.

This test could fail to yield positive information about the presence of the terrorist cell if the real terrorists performed the suspicious act very, very infrequently. But for it to actually drop to 0 information, they would need to do it at no higher rate than the background population (i.e, once per 1000 days, or two and three-quarters years) AND have the members of the cell otherwise unassociated and independent in the query to at least the same degree as is true in the background population. (Such a test would then indeed be useless, and it would be a very bad data minign session indeed to have suggested it.)

Conversely, if there is some other aspect of the query which enables them to be grouped before the query and independently of it, then this query is already so sensitive it is essentially impossible for them to evade detection, whilst the chance of false positives becomes “astronomically low”.

This is exactly the sort of thing we saw with the NSA’s eavesdropping program: the New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm.

Umm, this isn’t true, Bruce. The very New York Times article you cite says most turned out to be false alarms (according to an anonymous source), but then lists no less than 9 persons operating within the US who have been arrested and charged with serious terrorism related offences, whom all sources agreed were originally identified solely as a result of the NSA program. It also cites a number of other positives which are “disputed” because they may have also been identified by other sources (but more on that in a moment). However duplication of some “hits” doesn’t exactly invalidate the program. It’s the non-overlaps that are more significant. So far, there are 9 “hits” developed by NSA that were missed by other means, vs. 0 hits developed by other means that were missed by NSA. So if you had to decide between “data mining” and “all other means” (which would be silly, there is no such dichotomy and the methods are actually complementary), it would be a clear win for data mining, scrap everything else.

Even then, there is room for considerable doubt about whether any of the items listed were developed by other means. The NYT says that “some” officials thought that “they had already learned of the plans through prisoner interrogations or other means”, but goes on to acknowledge that “because the program was a closely guarded secret, its role in specific cases may have been disguised or hidden even from key investigators.” It is of course standard practice in SIGINT to disguise sources of information. And in this case, it seems to have been especially prudent, since the FBI officers in question did indeed go on to betray their country by releasing classified information to a newspaper — information derogatory to a rival agency, publicised illegally and anonymously a few weeks before program budgetary hearings on Capitol Hill…

Jan Theodore Galkowski • March 9, 2006 10:24 PM

Roger, i’m skeptical the NSA system could be that proficient. what are they measuring? mere phone calls or emails? financial transactions? Web site postings? what if the means of communication are less technological? that’s always been the weak point of SIGINT. terrorists have been known to avoid using cell phones for precisely that reason. do they actually track social networks? provided how? by HUMINT?

also, at some point the information needs to come for interpretation to a person. they need to make judgment calls.

remember, this intelligence apparatus mistook a Chinese embassay in the Balkans for a Serbian headquarters, making it the destiny of a precision bomb.

my personal concern is that such an apparatus can lull an administration into a false sense of security and, as well, into a sense that it is the best they can do. it is one way, one tool of pursuing counterterrorism. it’s not clear it is cost effective. it tends to appeal to people who like gizmos and technology. it’s not a kind of a process that lends itself to Lagrangian precision. worse, because the logic of these data mining devices are themselves shrouded in secrecy, there is no way to independently assess how good or bad they are, or how distorted the results are because of bureacratic policy. one credit agency still lists me as living where i lived 10 years ago, despite my protests. they won’t change because they stupidly have a policy saying they won’t update the record with a new address.

someone can “betray their country” just as easily by setting aside laws which they are obligated to follow having longstanding traditions and interpretations. if there were an effective whistleblower mechanism within the intelligence services, extreme measures might not be needed. their whistleblower protections aren’t even as strong as those provided for people in the military. finally, if executives of all parties and stripes didn’t abuse secrecy to hide their dirty laundry, secrets might have greater respect. that happens all the time and is repeatedly abused, especially when the intelligence services want to hide the details of their operating budgets.

someone can also betray their country by pretending to provide it with a tool or device or capability which they advertise as being more potent than it is, yet accepting full funding and pay for it.

Jan Theodore Galkowski • March 9, 2006 10:27 PM

“…publicised illegally and anonymously a few weeks before program budgetary hearings on Capitol Hill…”

it’s not clear the AIPAC determinations apply to that case. moreover, the Times may just as well had the motivation of waiting a year for publication to protect its sources, under the guess that a year old trail would be a lot more difficult for a vendetta-prone executive to follow.

igjugarjuk • March 9, 2006 10:46 PM

The people behind TIA aren’t stupid. They’ve done the numbers, and know that the system cannot work as they are stating publicly. That means one of two things, Either:

(1) They want to create the PERCEPTION of doing something, or
(2) The system is used for something OTHER than searching for the actual terrorist (as we know it) incidents they claim they are chasing.

Point (1) may have been an option when it was publicly known as “TIA”. Now that they have hidden the program from the public, we know it cannot be for public perception purposes.

In considering point (2), consider the possibility that the number of “Incidents” is not 10, but (say) 10 million. By lowering the bar on incidents that they consider “worth examining”, the false positive issue becomes moot, largely reduced, and the “cost” is largely eliminated: the FBI (etc) agents are not running around chasing false leads, and the cost to civil liberties = 0, since it’s not their cost, it’s our cost. Thus, they would consider the system as being very viable, and worth pursuing, even if they have to surreptitiously slip it into hidden programs, away from public (and congressional) scrutiny.

The only question to be resolved then is “Exactly what level have they set as the trigger point for an “incident” worth profiling, and further scrutiny, or action?” My money’s on them euphemistically referring to those they are building profiles on as “Enemies of the State” – otherwise known as “Opponents of the current regime’s hedgemonic program”. In other words, me and you.

Jan Theodore Galkowski • March 9, 2006 10:54 PM

igjugarjuk, as one Air Force general illustrated military and aerospace thinking, “a weapons system that works is more militarily useful than one that doesn’t”. perhaps the same kind of thinking applies to intelligence gathering.

in other words, perhaps they really believe “It’s not whether we can detect and find them or not, it’s whether they think we can detect and find them”.

Roger • March 10, 2006 1:12 AM

@Jan Theodore Galkowski:

Roger, i’m skeptical the NSA system could be that proficient.

What do you mean by “that proficient”? Unless you know how many more remain, their proficiency is impossible to assess.

what are they measuring?

I don’t know. It’s a secret! 8^)

mere phone calls or emails? financial transactions? Web site postings?

Who knows. It may be a complex algorithmic combination of numerous factors. Deriving the algorithm and testing its reliability is the whole point of data mining. On the other hand, they may have happened upon something quite simple. We don’t know, and are unlikely to find out, because one of the key dictums of signals intelligence is “sources are fragile”: that is, once an opponent has any inkling that some channel is leaking SIGINT, it is usually very easy to block up the leak.

what if the means of communication are less technological?

Then they are unlikely to be detected by SIGINT methods. Al Qaeda recognises this, and in the infamous training manual it recommended resorting to low tech methods rather than trying to outsmart the western SIGINT agencies. However, there are serious limitations to co-ordinating a global organisation by carrier pigeon.

that’s always been the weak point of SIGINT. terrorists have been known to avoid using cell phones for precisely that reason. do they actually track social networks?

Does who track social networks? I suspect most people do, unless they are hermits or autistic. If you are investigating a criminal organisation larger than two or three persons, you’re pretty well bound to make some kind of effort to work out who is who.

provided how? by HUMINT?

Possibly. Different forms of intelligence collection tend to be complementary.

also, at some point the information needs to come for interpretation to a person.

Yep.

they need to make judgment calls.

Depends what you mean by “judgement calls”. If you mean, take a guess based on inadequate information, then maybe, but probably not. Some of these sorts of information can be quite reliable, others less so. One beauty of it though, is that unlike most other sources it actually enables fairly accurate measurements of its own reliability.

remember, this intelligence apparatus mistook a Chinese embassay in the Balkans for a Serbian headquarters, making it the destiny of a precision bomb.

Huh? What makes you think NSA had anything to do with that? The official conclusion was that it was caused by someone using an out-dated map, i.e. nothing to do with NSA or any other covert intelligence agency. (The most popular, erm, unofficial conclusion is that it probably wasn’t an accident!)

my personal concern is that such an apparatus can lull an administration into a false sense of security and,

So accurate intelligence is bad because it looks too good? Hmm. Firstly, that would be the failure of the person receiving it, not vice versa; and second, I suspect you have never had to endure intel reports if you think that even the lowliest moron is likely to regard them as infallible.

as well, into a sense that it is the best they can do.

Depends what you mean by “best” I suppose. Arguably SIGINT is the best form of intelligence collection, in that it is the least nasty. Plus, it is the most cost-effective. These two properties make it very attractive to politicians in democracies, and so it has perhaps been over-emphasised at precisely the time when the UKUSA partners have lost their dominance of the field.

Intelligence officers however realise that the best product is attained by the complementary usage of a variety of source types, even though use of HUMINT does mean that somewhere, somehow, someone is going to get it in the neck.

it is one way, one tool of pursuing counterterrorism.

Absolutely. Different forms of intelligence collection tend to be complementary.

it’s not clear it is cost effective.

On the contrary, that is one of the few things that is clear. Every independent analysis of intelligence economics that has ever been done — and there have been many — shows that SIGINT is so far ahead of all alternatives in cost-effectiveness that it introduces the danger of the bean counters under-funding everything else to get more SIGINT. This is, indeed dangerous, because different forms of intelligence collection tend to be complementary.

it tends to appeal to people who like gizmos and technology.

No, in my experience it tends to appeal to politicians, because it is cheap, and involves much less risk of your people appearing on Al Jazeera with a scimitar at their necks.

People who like gizmos and technology, and are immature enough to let that influence their decisions, tend to like PHOTINT. (“Do you have any idea how many records the SR-71 still holds after 40 years!!!”)

it’s not a kind of a process that lends itself to Lagrangian precision.

I have no idea what you mean by this. I understand several meanings of “Lagrangian”, none of which fit here.

worse, because the logic of these data mining devices are themselves shrouded in secrecy,

Devices? Erm, it’s just done on big computers. If you mean algorithms, no they aren’t at all secret, there is a great deal of open publication about the techniques. Individual results are usually kept secret, because they represent a tactical advantage which can be diminished if known by your enemies/competitors/beneficiaries.

there is no way to independently assess how good or bad they are, or how distorted the results are because of bureacratic policy.

Well, you know they caught more Al-Qaeda terrorists operating in the USA than were caught by everyone else put together.

one credit agency still lists me as living where i lived 10 years ago, despite my protests. they won’t change because they stupidly have a policy saying they won’t update the record with a new address.

Well, clearly they are idiots with contempt for their customers. I don’t see what it has to do with data mining though, except that they probably don’t do any.

if there were an effective whistleblower mechanism within the intelligence services,

There is, it’s called the intelligence ombudsman, but in the US the role has been around for less than a year. (My own country has had such a role for years.)

the Times may just as well had the motivation of waiting a year for publication

A newspaper sit on a story for a year? Are you kidding?

Jan Theodore Galkowski • March 10, 2006 1:24 AM

“Well, you know they caught more Al-Qaeda terrorists operating in the USA than were caught by everyone else put together.”

i do not know that. source?

“A newspaper sit on a story for a year? Are you kidding?”

actually, the Times did. they were informed of the source a year before it was disclosed. you yourself accused them of waiting until just before budget deliberations regarding the program started in Congress.

i quote the New York Times from 16th December 2005: “The White House asked The New York Times not to publish this article, arguing that it could jeopardize continuing investigations and alert would-be terrorists that they might be under scrutiny. After meeting with senior administration officials to hear their concerns, the newspaper delayed publication for a year to conduct additional reporting. Some information that administration officials argued could be useful to terrorists has been omitted.”

Anonymous • March 10, 2006 5:24 AM

Data mining is a bit like Google. It’s not a magic box that will answer your question, but it will suggest where to start looking for an answer. And, like Google, it requires useful input. It won’t do well if you are looking for “something interesting”, but it may give very good results if you know what you are looking for.

I would prefer if data mining was not used, because privacy implications are enormous if the system is abused. But it can be extremely effective tool.

Boris Kolar • March 10, 2006 5:25 AM

I would prefer if data mining was not used, because privacy implications are enormous if the system is abused. But it can be extremely effective tool.

Tank • March 10, 2006 5:27 AM

I didn’t think you said this, I’m pointing out that the “mining” part of data mining throws out those trillions of uninteresting characteristics in favour of focusing on the thousands of interesting ones, then repeats.

That is after all the entire point/function/operation of the subject here.

You mention particular puchasing habits which are identifiable as suspect cc transactions which enable financial institutions to operate more economic data mining and followup investigations. Yet you make no consideration for such identifiable characteristics for what intelligence agencies are targetting.
Again, you are playing a handicapped field.

We all know you’ve researched more than enough on profiling people carrying out electronic and physical security crimes. Take some time out to read IntelCenter.com and SITEinstitute.org along with some researched accounts of actual perpetrators if you are going to do the same for a different bunch of criminals who have been very liberal in letting their expertise and methods walk out of Afghanistan into Pakistan and fly back to the english speaking world.

Stephane • March 10, 2006 5:36 AM

@Tank:
“Are you right now cowering in uncontrollable fear that you too could be one of those zero people ?”

False dilema.. Fear is not binary, you cannot force me to choose between “uncontrollable fear” and fearless

“Because if you actually are terrorfied by this then you’re the biggest factor in that result, not that visit from the FBI… that you didn’t get.”

We’re not talking about the same thing.. Do muliplicating false alarms in your neighbourhood seen on the news would make you have less fear?
IMHO many FBI visits for nothing won’t help very much, people may unfortunatly start to have less confidence and respect for them..

Dewey • March 10, 2006 8:20 AM

On Validity

1) Bruce has a very interesting theory about the value of data mining. My question is: is there a way to test this theory? Can we in some way measure the current reality of data mining to see if the numbers match Bruce’s model? Unfortunately the public (meaning me, at least) doesn’t have enough visibility into the actual data to make this decision. The decision making agendas of those who do is currently subject to debate.

2) Bruce talks alot about better tradeoffs — well and good. However, a comparison is only valid to compare two things. It makes no sense to evaluate only data-minings effectivness — we have to have something to which we can compare it to. Either we need a baseline number we’re willing to use (e.g. cost per death prevented = $1000) or we need an alternative proposal to which we can compare cost.

Yes, data mining doesn’t look particularly good, but what’s better? (That’s a question, not a challenge.)

For the record my personal values set the price of civil liberties very high and intuitively I agree that we’re making a lot of bad tradeoffs. I’m trying to move my own thinking from intuition to somewhat rigorous logic.

roy • March 10, 2006 8:45 AM

On the news, DHS announced that the terrorists have been positively linked to a certain type of watch, which they use as a timing device. The type? Casio.

I checked my watch: it’s a Casio. I will investigate myself exhaustively and keep you posted.

Dewey • March 10, 2006 8:47 AM

Paranoia:

There’s a very effective way to make data-mining more useful: reduce the allowed variability of the “normal” population. This is what really chafes with the civil liberties crowd (of which I’m one).

Jan Theodore Galkowski • March 10, 2006 12:41 PM

@Tank,

“Take some time out to read IntelCenter.com and SITEinstitute.org along with some researched accounts of actual perpetrators …”

i have. it’s a bunch of superficial statistics with pretty charts, or repeats of terrorist manuals, primarily a rehash of well-known threats and means of operation. there’s nothing of meaning there. past patterns or intensity of operations can’t be validly extrapolated to make specific predictions.

@Dewey,

“Yes, data mining doesn’t look particularly good, but what’s better?”

unfortunately, there’s nothing quite like HUMINT, even if it consists of highly paid informants whose information needs to be carefully vetted before being believed.

there is much IMO to be gained from careful and in-depth cooperation with major powers having deep HUMINT assets in the area, especially Russia, China, and France. we should be cultivating those relationships, not being callous about them, or even diminishing them.

Dewey • March 10, 2006 12:54 PM

@Jan,

HUMINT — agreed completely. However, can we quantify anything here? Even a back-of-the-envelope calculation might help in this discussion.

How much does an operation cost? What’s the probability of success? Failure?

My goal is to figure out what the best use of resources might be.

Eric Likness • March 10, 2006 2:40 PM

All political arguments aside, I loved your argument about the inevitability of the government’s TIA program ever working. I was especially intrigued by your inclusion of the term “base rate fallacy”. My math background and familiarity with statistics and probability are not what they should be, so I had to look this term up and Wikipedia had (of course) some links to more authoritative articles on this subject. The initially Wikipedia entry was somewhat vague, but at the bottom, someone had linked to another article, which as you can see is from none other than a government intelligence agency. I have to say the author of this article does a great job explaining the ‘base rate fallacy’. I just wish John Poindexter in all his infinite consulting wisdom had read this very article. Then possibly TIA wouldn’t have ever been implemented, or ‘tried out’ to see if it would work.

http://www.cia.gov/csi/books/19104/art15.html#ft145

patrick henry • March 10, 2006 3:48 PM

mr.schneier,

i read with interest your article regarding data mining for terrorists. would you speculate on the following thesis: the government’s program of data mining for terrorists is simply a cover story for the real purpose of the program: domestic spying for political purposes. how feasible would it be to create profiles for dissidents and seditionists? would it be effective to data mine to discover specific types, groups, and individuals who would be defined as political enemies of the administration? or would there be other purposes for total information awareness that the government would like to hide behind the cover story of data mining for terrorists?

Pat Cahalan • March 10, 2006 4:33 PM

@ Roger

Some good points, but I have some questions. About this:

To cut a long story short, by day 6 there is only a 1 in a million chance that
any suspects on the list are false positives, while there is a 94% chance
that all the true positives are still detected

If I’m reading you correctly, assuming a population of 250 million innocent people and 20 terrorists, your math here would indicate that you’re going to send a FBI/SWAT team to pick up 250 innocent people (1 out of every million) and 18.8 terrorists. If you’re off by an order of magnitude either way (easily done with guesstimates like this), that’s 2,500 innocent people and 1.9 terrorists. Investigating 2,501.9 people (or even 269) is pretty resource intensive.

If bagging 1.9 terrorists enables you to capture the remaining 19.1, and by doing so you prevent another 9/11, this may be worth it. However, you may not actually halt the terrorist activity, you may not capture the remaining terrorists, and their planned activity may not have resulted any more deadly event than the north Hollywood bank robbery of a few years ago.

Nobody talks about letting the government do these sorts of surveillance programs to stop armed robbery, in spite of the fact that more people have probably been killed in armed robberies than in all of the terrorist events in the U.S.

I’m not saying that these sorts of activities may result in halting an individual terrorist event. Sure, they might.

But, they might not, and any way you slice it you’re going to spend a lot of resources following up. Is this a worthwhile expenditure, given the potential results and the effects on the false positives?

And this:

Terrorists actually fit profiles far better than thieves do.

Although your paragraph on this makes sense, in fact you’re talking about a profile that the terrorist fits. But your data mining project is based not on the profile of the terrorist, but analyzing transactions.

Even if terrorists do fit a good profile, each may or may not exhibit behavior that matches a common profile of behavior.

(example -> a bunch of the 9/11 guys took flying lessons. However, the Unabomber didn’t, and McVeigh didn’t. They had a common profile in that they all wanted to blow something up, but the effect of that profile on their transactions was vastly different).

Jan Theodore Galkowski • March 10, 2006 5:47 PM

@Roger, but quoting Pat Cahalan,

“If bagging 1.9 terrorists enables you to capture the remaining 19.1, and by doing so you prevent another 9/11, this may be worth it. However, you may not actually halt the terrorist activity, you may not capture the remaining terrorists, and their planned activity may not have resulted any more deadly event than the north Hollywood bank robbery of a few years ago.”

worse, the most deadly and devastating attacks against CONUS don’t require terrorist cells in place here. they simply require delivery from abroad.

a good estimate of the odds of a nuclear explosion (NOT a “dirty bomb”) in the United States during the next five years is 1-in-2 (50%). it would most likely kill 150000 outright, with 150000 more dying in the next couple of weeks from its consequences. (see my blog, at entry

http://www.algebraist.com/tkblog/index.htm#1096000194

for details. the source is Garwin.) for obvious reasons, once such a device were in the United States the operatives would want to detonate it as soon as possible. they would not loiter.

igjugarjuk • March 10, 2006 6:23 PM

Jan Theordore Galkowski:
Your point that the government may be trying to convince the terrorists that they can catch them, even though they cannot is invalid given that the government has been trying to do this in secret. If your point was made in “TIA” days, it may have been valid. Not so now.

Further to my point that the real enemy is actually US, consider the following NBC report from yesterday:

” Pentagon admits errors in spying on protesters

NBC: Official says peaceful demonstrators’ names erased from database

The Department of Defense admitted in a letter obtained by NBC News on Thursday that it had wrongly added peaceful demonstrators to a database of possible domestic terrorist threats.

The letter followed an NBC report focusing on the Defense Department’s Threat and Local Observation Notice, or TALON, report.

Acting Deputy Undersecretary of Defense Roger W. Rogalski’s letter came in reply to a memo from Sen. Patrick Leahy, D-Vt., who had demanded answers about the process of identifying domestic protesters as suspicious and removing their names when they are wrongly listed.

“The recent review of the TALON Reporting System … identified a small number of reports that did not meet the TALON reporting criteria. Those reports dealt with domestic anti-military protests or demonstrations potentially impacting DoD facilities or personnel,” Rogalski wrote on Wednesday.

“While the information was of value to military commanders, it should not have been retained in the Cornerstone database.”

Threats directed against Defense Department
In 2003, the Defense Department directed a little-known agency, Counterintelligence Field Activity (CIFA), to establish and “maintain a domestic law enforcement database that includes information related to potential terrorist threats directed against the Department of Defense.” Then-Deputy Secretary of Defense Paul Wolfowitz also established TALON at that time.

The original NBC News report, from December, focused on a secret 400-page Defense Department document listing more than 1,500 “suspicious incidents” across the country over a 10-month period. One such incident was a small group of activists meeting in a Quaker Meeting House in Lake Worth, Fla., to plan a protest against military recruiting at local high schools.

In his Wednesday letter, Rogalski said such anomalies in the TALON database had been removed.

“They did not pertain to potential foreign terrorist activity and thus should never have been entered into the Cornerstone database. These reports have since been removed from the Cornerstone database and refresher training on intelligence oversight and database management is being given,” Rogalski wrote.

Rogalski said only 43 names were improperly added to the database, and those were from protest-related reports such as the Quaker meeting in Florida.

“All reports concerning protest activities have been purged,” the letter said.”

TALON reports provide “non-validated domestic threat information” from military units throughout the United States that are collected and retained in the Cornerstone CIFA database.

Nearly four dozen antiwar meetings listed
The Defense Department document provides an inside look at how the U.S. military has stepped up intelligence gathering since 9/11. The database includes nearly four dozen antiwar meetings or protests, including some that have taken place far from any military installation, post or recruitment center, according to NBC News’ Lisa Myers, who first wrote about the story in December.

Among those listed were a large antiwar protest in Los Angeles in March 2004 that included effigies of President Bush and antiwar protest banners, a planned protest against military recruiters in December 2004 in Boston, and a planned protest in April 2004 at McDonald’s National Salute to America’s Heroes – a military air and sea show in Fort Lauderdale, Fla.

The Fort Lauderdale protest was deemed not to be a credible threat, and a column in the database concludes: “U.S. group exercising constitutional rights.” Two-hundred and forty-three other incidents in the database were discounted because they had no connection to the Department of Defense – yet they all remained in the database.

The Department of Defense has strict guidelines (.PDF link ), adopted in December 1982, that limit the extent to which it can collect and retain information on U.S. citizens.

Still, the database includes at least 20 references to U.S. citizens or U.S. persons. Other documents obtained by NBC News show that the Defense Department is clearly increasing its domestic monitoring activities. One briefing document stamped “secret” concludes: “[W]e have noted increased communication and encouragement between protest groups using the Internet,” but no “significant connection” between incidents, such as “reoccurring instigators at protests” or “vehicle descriptions.”

Earlier domestic intelligence gathering
The military’s penchant for collecting domestic intelligence is a trend, Christopher Pyle, a former Army intelligence officer, told NBC News when the report was first broadcast.

During the Vietnam War, Pyle revealed the Defense Department monitored and infiltrated antiwar and civil rights protests in an article he published in the Washington Monthly in January 1970.

The public was outraged and a lengthy congressional investigation followed that revealed the military had conducted probes on at least 100,000 American citizens. Pyle got more than 100 military agents to testify that they had been ordered to spy on U.S. citizens – many of them antiwar protesters and civil rights advocates. In the wake of the investigations, Pyle helped Congress write a law placing new limits on military spying inside the U.S.

But Pyle said some of the information in the database suggests the military may be dangerously close to repeating its past mistakes.

“The documents tell me that military intelligence is back conducting investigations and maintaining records on civilian political activity. The military made promises that it would not do this again,” he said.
NBC News’ Lisa Myers and the NBC Investigative Unit contributed to this report.
http://www.msnbc.msn.com/id/11751418/

And that’s just what has been exposed. Things like the secret-TIA activity is what’s below the water of this iceberg that is exposed by this article. Make no doubt about it: the enemy they see is US.

Jan Theodore Galkowski • March 10, 2006 6:30 PM

@igjugarjuk,

“And that’s just what has been exposed. Things like the secret-TIA activity is what’s below the water of this iceberg that is exposed by this article. Make no doubt about it: the enemy they see is US.”

i agree that given this domestic monitoring power, the temptation to use it to identify and counter domestic enemies and opponents must be overwhelming. that is precisely why having such power without the controls of some court, is so damn wrong. that is why i personally believe, even if the publication of the program’s existence violates some narrow reading of secrecy law, under the circumstances, the people in the intelligence arms and counterintelligence who revealed the program’s details to The New York Times were perfectly correct in their actions. and The Times was right to publish.

i have followed secrecy in the United States for thirty years. it does far more damage than it helps.

igjugarjuk • March 10, 2006 10:15 PM

Jan Theodore Galkowski:
Yes, the recent telephone-tapping scandal is just another datapoint of evidence for what I am saying about the data mining systems, that is, that we are the enemy, in the Government’s eyes. I guess the question now becomes: what can be done about it?

Jan Theodore Galkowski • March 10, 2006 10:30 PM

as evidence of how dissenters in the intelligence community are treated, consider the Barlow affair, documented at:

http://www.thebulletin.org/article.php?art_ofn=jf06edmonds

Jan Theodore Galkowski • March 10, 2006 10:36 PM

@igjugarjuk,

” I guess the question now becomes: what can be done about it?”

well, PGP encryption can be used for email, at least the more personal email. also, you can also use Web-based email instead of direct POP or IMAP services, taking care to locate the servers for the email in other countries.

there is also the open source Tor project, which attempts to obstruct casual intercepts of communications by dynamically reallocating paths on the Internet used for communications. it implies a performance penalty, however.

Jan Theodore Galkowski • March 10, 2006 11:01 PM

also, to people who believe the ongoing NSA eavesdropping program is having concrete and positive results, note there are serious doubts remaining about that by Congress, per

http://www.washingtonpost.com/wp-dyn/content/article/2006/03/09/AR2006030902181.html

igjugarjuck • March 10, 2006 11:39 PM

Jan Theodore Galkowski:
Your comments on resistance pertain to the micro, individual level. Such may protect the individual, but a rising tide lifts all boats, and the overall situation will certainly continue to deteriorate whilst we few that are technically adept may be temporarily protected.

My question was aimed more at the macro situation. What can / is to be done when we are faced with political parties that are two sides of the same coin, both intent on doing the same harm (though perhaps at differing rates), given that Presidential Executive Orders over the last 25 years effectively preclude the sort of opposition to them that would be required to solve the problem, as per:

” We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. –That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, –That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security. “

alerter • March 11, 2006 12:14 AM

Garbage-out information methodologies and operations have always invited abuse by those who wield state power and/or social authority.

“Law enforcement” at every level of government, in every country, have a long and ignoble track record of selective enforcement of laws that already on the books. From selective enforcement comes selective prosecutions and some successful prosecutions eventually prove to be wrong. None of this is apocryphal.

Now some people want to prime the “anti-terror” pump with massive data mining.

The high rates of false-positives from this data mining will provide excellent cover for the selective pursuit of a manageable subset of the mountain of investigatory leads. This huge ocean of “leads” will bring new meaning to the phrase, “the usual suspects.”

Any investigative overload will simply be prioritized according to a near limitless set of human biases that will reside outside of the data mining system itself.

(In the major metropolitan city in which I live, auto thefts are vary rarely investigated. Yes, police reports are taken. Sure, some vehicles do eventually turn up. But no officer investigates an auto theft, per se, unless there are other motivating factors, which are rare. That’s what happens when there are “too many” auto thefts and “not enough” officers on the auto theft detail.)

Increased scrutiny of select individuals/groups, under the cover of high false-positive generating anti-terror data mining leads, may turn up new circumstantial information that might take a confirmed anti-terror false-positive investigation into the realm of a potentially false-positve (non-terror) criminal investigation. There is no wall of seperation between anti-terror and anti-crime (although those who are accused of crimes may enjoy a few more legal protections than those accused of terrorism).

“Honest and innocent” persons might eventually be vindicated, but some of them will be pushed and pulled through wringers of investigatory scrutiny that will not make any one of us one bit safer from terrorism or crime.

Data mining for terrorists and terror plots is not much better than relying on clairvoyance to read people’s minds. Going either route takes us down a deep and dark hole…

geoff • March 12, 2006 4:57 AM

For those who doubt Bruces analysis there is one question to be answered. If the threat level is high and undirected data-mining works, where are the arrests and convictions?

Discouraging a potential terrorist with a whispered “We know what you are planning” doesn’t eliminate the problem; you just make them more careful. Locking terrorists up after a trial is the only way to reduce the threat.

Roger • March 13, 2006 2:28 AM

@geoff:

For those who doubt Bruces analysis there is one question to be answered.
If the threat level is high

Who says it’s high? Not even the DHS, at present.

and undirected data-mining works,

Who says it’s undirected?

where are the arrests and convictions?

They were listed in the NYT article he cited.

Roger • March 13, 2006 2:33 AM

@ Pat Cahalan:

To cut a long story short, by day 6 there is only a 1 in a million chance that
any suspects on the list are false positives, while there is a 94% chance
that all the true positives are still detected

If I’m reading you correctly, assuming a population of 250 million innocent people and 20 terrorists, your math here would indicate that you’re going to send a FBI/SWAT team to pick up 250 innocent people (1 out of every million) and 18.8 terrorists.

Sorry, you were not reading me correctly. The calculation does not indicate that one out of a million people will be falsely suspected, it indicates that there is only 1 chance in a million that even ONE person will be falsely suspected. In any case, the falsely suspected persons will not be automatically arrested, they will be added to a list of possible suspects, along with some annotation about how strong or weak that suspicion may be. An investigator will then take additional investigative steps (or not) depending on the strength of the evidence and the resources available.

If you’re off by an order of magnitude either way (easily done with guesstimates like this),

Absolutely; but I was simply using Bruce’s own figures, which are not even guesstimates so much as “for examples”. Bruce showed that the supposedly optimistic “for example” figures give lousy results — but only if you use them incorrectly. I was showing that the exact same figures, if used correctly, can give extra-ordinarily precise resolution. Resolution so fine, in fact, that it in this case it WOULD be sufficient to justify immediately applying for a warrant, but that still would not happen; in order to protect the SIGINT resource, you would still have to perform a conventional investigation to develop enough unclassified material to justify a warrant.

A real test is probably unlikely to give such extreme precision, but one of the beauties of data mining is that it doesn’t just identify possible tests you hadn’t previously noted, it also gives quite good measures of its own accuracy and precision. That is, the output isn’t “terrorists use Casio watches”, but more like “if you plug the following thirty parameters into the following formula, it will identify persons as terrorist/non-terrorist with a false negative rate of 50% and a false positive rate of 10^-12” (actually, the output is usually even more complex).

Nobody talks about letting the government do these sorts of surveillance programs to stop armed robbery, in spite of the fact that more people have probably been killed in armed robberies than in all of the terrorist events in the U.S.

This one sentence contains quite a wealth of points that I feel need to be addressed. First and foremost, data mining is not a surveillance program. It is an analytical technique. To the extent that it might be used to detect terrorists or stop armed robbers, we are not talking about more surveillance, but more effective use of existing surveillance — and possibly even reducing surveillance by identifying some forms as not being useful for crime prevention. It still has civil liberties implications if there exist current sources of surveillance data that have historically been permitted because they were regarded as unusable, but have recently been rendered usable by more powerful analytical methods. But frankly, I don’t think that all that many things fall into that category; on the contrary, the usual tendency has been to exaggerate effectiveness by assuming that once data has been collected, the analysis will be almost perfect and will be able to extract every possible nuance of meaning from it. This is of course completely untrue (if you’ve ever worked in a large scale data processing environment, you know that probably around 90% of all data collected is never analysed at all!), but it makes for a reasonable, conservative assumption on which to base a safe level of oversight — at least for compact data sources that are easily stored. This is because even if they will not be analysed now, they may be eventually. (The assumption is not so reasonable for very bulky data sources, such as video footage, because the overwhelming majority of this data is not even retained except in the event of recording a crime.)

The second point is that your assumption that data mining techniques are not used in the case of crimes other than law enforcement, such as armed robbery, is completely wrong. It is used, and quite a lot. In fact it is far more commonly used by conventional law enforcement than in counter-terrorism. The biggest law enforcement usage is probably against white collar crime and organised crime, but in some agencies it even finds usage in such lowly matters as helping to digest police patrol reports to best plan the next day’s foot and car patrols. The particular commercial toolsets currently being marketed under the name of data mining are relatively new, but broadly similar techniques, and similar but slightly more specialised tools, have been used by bureaux of police intelligence for a long time.

The third point is that the argument “threat X kills more people than threat Y, therefore we need to concentrate on threat X” is one of my pet peeves in security analysis. I personally call it “the bean counter fallacy”: the usage of techniques designed for inanimate or unreactive data sources against an intelligent and malicious opponent. You usually hear it in the form “we spent more money on the night watchman’s salary than we lost from burglary, so it is not economical to keep a night watchman.” The conclusion might or might not be correct, but it certainly doesn’t follow from the premise. To do that we need to better understand how the opponent is likely to react to some of our possible actions. Unfortunately the genuinely scientific way to approach this sort of question is often not practicable or ethical, so it often comes down to argument from opinion.

I’m not saying that these sorts of activities may result in halting an individual terrorist event. Sure, they might.

Not, “they might”, they have.

But, they might not, and any way you slice it you’re going to spend a lot of resources following up.
Is this a worthwhile expenditure, given the potential results and the effects on the false positives?

Data mining is not a method for issuing warrants. It is a method for answering questions of the type you just posed.

Terrorists actually fit profiles far better than thieves do.

Although your paragraph on this makes sense, in fact you’re talking about a profile that the terrorist fits. But your data mining project is based not on the profile of the terrorist, but analyzing transactions.

Even if terrorists do fit a good profile, each may or may not exhibit behavior that matches a common profile of behavior.

A good point; it is a practical certainty, I would think, that a model used to detect an Al Qaeda operation is unlikely to be as sensitive to, say, a right wing militia operation — if it is sensitive to the latter at all. So you will probably develop different models for each of your top two or three threats. Note also, by the way, that the counter-terrorism investigator (unlike the law enforcement officer) is usually more concerned with reliably detecting groups rather than individuals, because groups are usually far more destructive. But this gives the investigator a lot more lee-way; if I have a profile which fails utterly to detect all the local, US born recruits in a cell, but does detect the foreign born agent who recruited and trained them, that is still a good lead which may unravel the whole organisation. And even if I can’t unravel the whole organisation, removing a few key individuals may be sufficient to disrupt its operations.

Roger • March 13, 2006 2:40 AM

@Jan Theodore Galkowski:

“Well, you know they caught more Al-Qaeda terrorists operating in the USA than were caught by everyone else put together.”

i do not know that. source?

The same NYT article I was citing, which Bruce linked to as the subject of his previous blogging on this issue. It doesn’t state it explicitly — their intention seems to have been to spin the matter in the opposite direction — so one has to apply a little reading comprehension.

actually, the Times did. they were informed of the source a year before it was disclosed.

Indeed it seems they did. Mea culpa, my error.

you yourself accused them of waiting until just before budget deliberations regarding the program started in Congress.

No, I didn’t, not the NYT; I believe the NYT had the highest of motives in publishing it once they had the story. I was questioning the motives of the anonymous sources who gave the NYT the story. I still do so. They illegally gave classified information to a newspaper, “spun” in a manner as derogatory as possible to a program of a rival agency. But when you carefully analyse the “scoop” they gave to the paper, they had nothing of substance, and there was no whistle to blow to justify such a serious act. The two key points of “whistle blowing” — that the program was of doubtful legality, and that it wasted resources by producing few success — are both effectively kyboshed by their own answers to the NYT’s questions.

Questioning someone’s motives is of course generally not a valid technique in argument; it is called the fallacy of “argumentum ad hominem”, because the validity of an argument is independent of the motives of the person who presents it. However there is one occasion on which argumentum ad hominem is valid, and that — assessing the reliability of a sole source for information, rather than argument — is exactly what occurs in this case.

Roger • March 13, 2006 3:04 AM

@Dewey:

On Validity

1) Bruce has a very interesting theory about the value of data mining. My question is: is there a way to test this theory? Can we in some way measure the current reality of data mining to see if the numbers match Bruce’s model? Unfortunately the public (meaning me, at least) doesn’t have enough visibility into the actual data to make this decision. The decision making agendas of those who do is currently subject to debate.

To clarify, of course you mean “use of data mining to detect terrorist plots”. Data mining in general is a well established discipline, the validity of which is fairly easy to assess. Bruce’s argument is that while it clearly does work in other applications, including applications in which it is attempting to detect hostile agencies which are strongly motivated to conceal their activities, nevertheless there are special characteristics of terrorism which make data mining non-effective for this particular purpose.

Is there a way to test this theory? Not really. To determine the absolute success level of data mining in detecting terrorists or terrorist plots, you would first need to detect 100% of the terrorists by other means so you could compare results. This is obviously not a feasible experiment. However, you can certainly determine, or rather estimate, relative effectiveness by comparing successes from data mining programs to those from other resources. However, doing so rather undermines Bruce’s arguments because in comparison to the alternatives data mining based programs have been more successful than most alternatives, not less, and the ones that have supposedly offered results somewhere in the same ballpark of effectiveness are far more dubious ethically, not less.

2) Bruce talks alot about better tradeoffs — well and good. However, a comparison is only valid to compare two things. It makes no sense to evaluate only data-minings effectivness — we have to have something to which we can compare it to. Either we need a baseline number we’re willing to use (e.g. cost per death prevented = $1000) or we need an alternative proposal to which we can compare cost.

An interesting proposal, and one well worth doing — but exceedingly complicated. It would be worth a master’s thesis at least, I would think.

Yes, data mining doesn’t look particularly good, but what’s better? (That’s a question, not a challenge.)

I don’t think even that is true. Data mining does look good. In fact the real problem is that it looks to be so effective, it is a frightening to some.

For the record my personal values set the price of civil liberties very high and intuitively I agree that we’re making a lot of bad tradeoffs. I’m trying to move my own thinking from intuition to somewhat rigorous logic.

For the record I also set a high price on civil liberty, but given that we mostly seem to agree that the authorities need to have some sort of investigative power, I find it absolutely bizarre that use of data mining should be seen as more intrusive than the alternatives, rather than much, much less intrusive. We are not talking about new surveillance programs, we are talking about making better use of data they already have, and probably even cutting back on some surveillance programs which prove not to be a good trade off. And we are not talking about squandering government money on pie in the sky schemes, we are talking about methods which inherently assess their own value and discard the useless parts, or at least allow citizens or their representatives to make better informed choices. And we aren’t talking about something radical or new, we are talking about refinements of techniques which have been used in government for decades now, and by now are used by hundreds of government departments and thousands of businesses — mostly for non-law enforcement purposes, but including dozens of law enforcement programs.

This isn’t to say that such programs are risk free and don’t require any sort of oversight. The most obvious risk is that a data source may have been collected historically on the understanding that it wasn’t practicable to abuse it, and now it can be abused. The most obvious example is credit card transactions, logs of which are now being blatantly abused by marketers. So some oversight is required. But more intelligent analysis, which is what data mining is fundamentally about, is inherently much, much less harmful than the alternatives.

Eric Likness • March 13, 2006 9:51 AM

Can Network Theory Thwart Terrorists?
By PATRICK RADDEN KEEFE
Published: March 12, 2006

Much of what Bruce wrote in his original article on Wired.com and cross-posted to his blog is born out by this article on the NYTimes.com. If you have a subscription look for this article. The most telling line of all is the statement that Congress is worried that the NSA is drowning in a sea of data. It doesn’t matter how may plots they point in the social network maps they develop of any individual, they still have not recognized the base rate fallacy, and worse yet the idea of datamining has become fashionable through many sub-groups within the Department of Defense. Now it has diffused throughout the culture of all the intelligence fiefdoms. Will it ever be coralled, will anyone ever learn from their mistakes?

Jan Theodore Galkowski • March 13, 2006 5:17 PM

@Roger

“‘if you plug the following thirty parameters
into the following formula, it will identify persons as terrorist/non-terrorist
with a false negative rate of 50% and a false positive rate of 10^-12′”

to the extent to which those parameters are supposed to be typical of the kinds of data mining being done, i do not for a second believe them. first of all, it is extraordinarily rare for probability of detection and probability of false alarm not to be within the same order of magnitude. second, the precision is limited by the number of candidates. surely it is meaningless to have a probability of anything which is less than 10^-8. third, i’d like to see a detailed calculation of how these are arrived at, using fictitious criteria if necessary but relevant and plausible ones. i bet it’s done wrong. calculations of statistical power in these cases are not that easy, especially since the distributions of the parameters are never going to be Gaussian. hopefully, too, they’re doing Bayesian decision theory, not Neymann-Pearson. fourth, even if an algorithm had a theoretical precision of that degree, their incoming data is not that precise. the accuracy depends upon the accuracy of given data as well as the algorithm.

look, the Bureau of Justice Statistics does not even collect accurate figures for annual number homicides in the United States. i’ve tracked that down, trying to resolve a discrepancy in reporting of homicides between the National Center for Health Statistics, where county pathologists report in, and the FBI’s CJS. i finally spoke with an assistant head of CJS who admitted that CJS depends upon voluntary reporting by sheriffs and there is no independent check on the accuracy they provide.

if indeed something as fundamental as counting number of murders goes awry and isn’t reliable, there’s all the more opportunity for reports provided from subjctive sources to have spin, suspicion, and political motivations.

finally, relating to the NYT outing of the NSA domestic intelligence business, there are a number of aspects of your argument and the administration’s justification which just don’t make sense. from the side of the secrecy business, including sources overseas, a lot of the unhappiness regarding the program comes not only from Congress but from intelligence professionals themselves who have been trained to apply their skills to foreign sources. furthermore, i don’t for a second believe the “one side a foreign connection” explanation offered by BushCo because that kind of NSA surveillence was done routinely during the Cold War without the need for special court approvals. (Soviet agents obviously wouldn’t phone home to Moscow. they’d use calls through U.S. allies like the UK. those were always monitored.) given BushCo’s clamping down hard on and slapping the heads of people who go against policy in relatively innocent realms like climate change science — or for that matter shooting buddies on hunting trips — things that are said and leak out in more serious areas are very likely to be fragmentary and distorted. i just don’t believe them.

they are using secrecy which is supposed to protect information which if revealed “would do grave damage to the United States” to protect their fragile asses.

IMO, diversion of the NSA in this manner and failure to follow the law is an impeachable offense. and this is all the more reason why the secrecy laws ought to be amended to read that that practice of classifying information without detailed justification ought be susceptible to prosecution as a federal felony.

Jan Theodore Galkowski • March 14, 2006 6:21 PM

a consolidated terrorist database apparently has 200000 people in it, at least according to The Globe, citing Donna Bucella of the U.S. Terrorist Screening Center, per

http://www.boston.com/news/nation/washington/articles/2006/03/14/200000_people_in_us_terror_database/

so, what are the odds that on any given day someone in a list of 200000 will do something that will trigger suspicion? whether or not more than one suspicious act triggers on a given day, the likelihood of such random triggers is proportional to the number of suspicious acts they are screened for, as well as the number of people on the list.

hey, you can be certain you’ve got the terrorists on your list if you just put everyone in the world on it. this is akin to forecasting “rain” every day (or “sunny”) no matter what the evidence says. you’re bound to be right at least the percentage of time it rains (or doesn’t). this could be a real win in Seattle (or Tuscon).

Pat Cahalan • March 14, 2006 7:28 PM

@ Roger

First and foremost, data mining is not a surveillance program. It is an
analytical technique.

Absolutely, and to be proper we should be talking about data collection as the primary problem (rather, the issue to which I take exception), not the analyzation techniques. Of course, you have to have the data to mine and you have to get the data from some source. When I use the term “surveillance”, I’m referring to the government going out and sucking up as much data as they can get their hands on, from Google/Yahoo’s search requests, to Amazon’s buy lists, etc. under the assumption that the more data they have, the better. As you yourself point out, a vast majority of this data is going to be unanalyzed for/by any offical purpose.

But now the data exists in a central location, it’s been compiled and collated and organized and normalized. Data may be cheap to store, but there are definite costs to the compilation, collation, organization, and normalization, especially when the sources of the data have their own databases with their own business rules, etc. Plus you have to have a facility to analyze this data, and security around the facility. All of this costs money. In addition, the data itself, in this highly normalized form, represents a potential for abuse by unauthorized parties.

The second point is that your assumption that data mining
techniques are not used in the case of crimes other than law
enforcement, such as armed robbery, is completely wrong.

Actually, this wasn’t what I was trying to say (although I can see the implication)… I’m sure the Treasury and the SEC in particular have data mining programs. However, they’re looking at the data that is directly correlated to the crimes they investigate. What I’m saying is that Total Information Awareness (the gathering of the data) wouldn’t be regarded as a suitable tool for tracking down lesser crimes, in spite of the fact that those lesser crimes probably have a much greater impact than terrorism.

The third point is that the argument “threat X kills more people than threat
Y, therefore we need to concentrate on threat X” is one of my pet peeves
in security analysis. I personally call it “the bean counter fallacy”

I agree this is dangerous ground. A better way to approach any security situation is, “Threat X has this A cost, this probability of occurance H, and these possible countermeasures with N, M, O cost. Threat Y has this B cost, with this probability of occurance I, with these possible countermeasures with P, Q, S cost. We want to protect ourselves from X and Y as best as we possibly can with reasonable countermeasure cost.”

A and B are hard to measure, because you’re not just talking about people being killed, but economic impact, citizen safety, confidence in societal stability, etc.

You usually hear it in the form “we spent more money on the night
watchman’s salary than we lost from burglary, so it is not economical to keep
a night watchman.” The conclusion might or might not be correct, but it
certainly doesn’t follow from the premise.

Right. A correct analysis would be something like, “We spent more money on the night watchman than we could possible lose from some incident occuring that would have been circumvented by the presence of a night watchman, so it is not economical to keep a night watchman”. In the first case, you are only comparing relative cost of an event that has occurred, in the second you are comparing relative costs of events that can occur.

I’m not saying that these sorts of activities may result in halting an
individual terrorist event. Sure, they might.

Not, “they might”, they have.

I am not convinced that this is the case, in a practical sense, but it is more complicated that that. First, who did they catch? Second, would they have caught them anyway? Third, is there some other investigative technique that we’re not using that would have enabled us to catch them without the privacy implications of TIA? Finally, if they didn’t catch them, what would have been the consequences?

Even if terrorists do fit a good profile, each may or may not exhibit
behavior that matches a common profile of behavior.

A good point; it is a practical certainty, I would think, that a model
used to detect an Al Qaeda operation is unlikely to be as sensitive to, say,
a right wing militia operation — if it is sensitive to the latter at all. So you
will probably develop different models for each of your top two or three
threats.

Right. And each one of these models requires you to accurately surmise some sort of behavior pattern apropos to the individual terrorist group. Positive results means you have to have well tuned models and you have to guess precisely which threats are the top priority. This seems to be unlikely to produce good results in the long term.

Jan Theodore Galkowski • March 14, 2006 11:18 PM

i need to correct or qualify something i wrote above. i wrote “and this is all the more reason why the secrecy laws ought to be amended to read that that practice of classifying information without detailed justification ought be susceptible to prosecution as a federal felony”. in fact, i was mistaken. there is a component of this in the secrecy law now.

the secrecy statutes explicitly prohibit using markings of classification to conceal embarrassing, illegal or inefficient agency actions. it is suspected this is part of the motivation for increasing a category of “sensitive but unclassified information”, a category which is regulated by an incoherent mishmash of administrative policy and which does not provide for such safeguards.

for more details on this, see

http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB183/SBU%20Report%20final.pdf

Jan Theodore Galkowski • March 14, 2006 11:42 PM

there’s a nice vintage 2003 overview of what’s wanted out of these data mining systems available at

http://www.fas.org/irp/threat/agent.pdf

it even has screenshots.

alas, it’s largely in viewfoil format but, then, that’s standard for that crew.

i particularly draw your attention to the contents of Appendix B. dang, i guess i’ll have to give up that square-dancin’!

joe buck stops here • March 15, 2006 11:01 AM

I agree with Joe Buck. TIA programs are not about “terrorists” any more than the Iraq war is about Al Queda.

The NSA DOES KNOW the elementary statitistics, they are not stupid. What they are after is a nice list of common hits, the people opposed to the current regime. Leadership of this common group is still common and easily profiled. And if several false hits are stuffed into the black hole of a Gitmo when the roundup of 100,000 “traitors” comes, who cares… See you in the Gulag!

cjh • May 3, 2006 9:37 AM

Your article is truly timely and covers a subject matter needed for today’s environment.

Afer 9/11 it has become evident our methods of info gathering needed a drastic change. As with any new venture, it takes time to accumulate info, decipher meaning, determine the hoax from the truth and develop a method of tracking any and all data.

We must realize that terrorist and their plans of attack are not as clear cut as a made to television movie or a multi-million dollar Hollywood production.
It takes time to cultivate the info, research the possible senerios, decuss the probables with intelligent people who have experience in such matters. We cannot expect info to pop up out of the ground or be presented in a box with pretty paper. It is a process of effort, sweat, tears and yes, life-threatening.
The expense of data-mining may be high to most, but I would rather have my tax money spent in this manner than wasted on assisting people who are disrespectful of th US and are holding this nation hostage for welfare. We have too much fraud,waste and abuse from individuals who invade this country.. I think we should consentrate on preventing the abuse and horror that has and will again beset this Nation.

Unfortunately, a majority of people will and have dismissed this gathering of vital information as a negative disregard of privacy… they fail to realize we are living in a time in history that cannot go softly into the night. We must be fully aware of every possible breach in society from whatever source.

Looking back in our History, (hind site is 20/20)– we have been had many times because the SYSTEM did not allow a full and open review of a potential situation. The results were compromising and caused the US great harm. Are we willing to take such chances again in the future and allow a few narrow-minded special interest groups prevent us from taking a stand against people who do us harm just because CIVIL RIGHTS will be violated! TERRORISTS and INVADERS DO NOT HAVE CIVIL RIGHTS in the United States of America.

I for one encourage any form of data retrieval and how it is used to eliminate any attempt to attack the USA and its citizens.

Rick • July 15, 2006 3:39 PM

I don't disagree with the statistics argument, I just have a different viewpoint. First of all though, we must get to the definition of the term "monitoring". If you think that the term means examining the content and I think that it just means noting the termini and duration of communications sessions, then we need go no further; my belief from reading is that the telecommunications switching (and possibly the IP counterparts) records provide only the latter and my analogy deals with that data set.
Consider that the ancient Greeks knew the planets were different than the stars (hence the name). How did they know? By observing visible light over a long period (many data points) they saw changes in relationships for these few entities compared to the many others. It did nothing to help them understand the nature of either, just that they were different. Comets were observed and recognized as being different than either of these because of the briefer time they appeared, the motion through the sky was different and the changing intensity. It was only as technology improved that effort to examine specific objects of interest allowed learning about these bodies. Of course, technology has also expanded in other ways (radioastronomy) that allowed us to "see" things in the sky that the ancients would never have dreamed were there. It was only with more directed observation and better tools that we learned of the myriad of star classes, binary sytems, and what is truly unique about the nearer stars.
By analysis of the telecommunications switching records the NSA has developed a way to observe the telecommunications world in the way that the ancients used visible light to see what is "permanent" and what is transient in the night sky. It does not mean NSA knows any more about the classes of entities that they use to describe what they are seeing ( what do they call them: clusters, families, hubs-and-spokes?). The citizens'  worry is what additional methods the government could bring to bear on entities they choose to study an entity and when does each cross the Constitutional line of "unreasonable" search; the adversaries' worry is that they have invited any such scrutiny, regardless of technology.
Do the adversaries expend effort to remain obscure? Does the government apply this new-found knowledge for follow-up within legal boundaries? These are questions whose answers can help us determine the effectiveness of the program; the conundrum is that we ought not have the discussion in public because it helps the adversary with his planning, but we are not comfortable that platitudes from the government mean the right path is being followed. The cry is "I want to know ! Yes, there are reasons not to tell some people - but I am not one of those people and I can keep a secret." But it is a Faustian bargain to have representative government - you don't have to hear all of the details but to don't get to hear all of the details.

Data Mining for Terrorists

Comments

Leave a comment Cancel reply