The Problems with Data Mining

Great op-ed in The New York Times on why the NSA’s data mining efforts won’t work, by Jonathan Farley, math professor at Harvard.

The simplest reason is that we’re all connected. Not in the Haight-Ashbury/Timothy Leary/late-period Beatles kind of way, but in the sense of the Kevin Bacon game. The sociologist Stanley Milgram made this clear in the 1960’s when he took pairs of people unknown to each other, separated by a continent, and asked one of the pair to send a package to the other—but only by passing the package to a person he knew, who could then send the package only to someone he knew, and so on. On average, it took only six mailings—the famous six degrees of separation—for the package to reach its intended destination.

Looked at this way, President Bush is only a few steps away from Osama bin Laden (in the 1970’s he ran a company partly financed by the American representative for one of the Qaeda leader’s brothers). And terrorist hermits like the Unabomber are connected to only a very few people. So much for finding the guilty by association.

A second problem with the spy agency’s apparent methodology lies in the way terrorist groups operate and what scientists call the “strength of weak ties.” As the military scientist Robert Spulak has described it to me, you might not see your college roommate for 10 years, but if he were to call you up and ask to stay in your apartment, you’d let him. This is the principle under which sleeper cells operate: there is no communication for years. Thus for the most dangerous threats, the links between nodes that the agency is looking for simply might not exist.

(This, by him, is also worth reading.)

Tags: data mining, NSA, terrorism

Posted on May 24, 2006 at 7:44 AM • 57 Comments

Comments

Moshe Yudkowsky • May 24, 2006 8:06 AM

I can make this exact same argument to invalidate the entire concept of traffic analysis, yet traffic analysis is one of the best methods to gather intelligence about enemy plans and intentions.

So I’m taking this argument with more than a small grain of salt. As we well know, people with a weak encryption algorithm will hype the tremendous number of keys provided by the algorithm, but completely ignore the weakness in the encryption. In the same way, this essay hypes the tremendous number of connections each person has, but ignores the way a data-mining operation can sort through that data to find the significant, active, and important connections.

Andre LePlume • May 24, 2006 8:38 AM

I’m with Moshe here. The strength of weak ties is indeed important (look at how “Brownie” of FEMA got his former job!), but this is as much an argument for digging further than it is for not digging.

The argument against massive internal spying operations shouldn’t be a consequentialist argument, IMNSHO. The activity violates the codified fundamental tenets upon which the US was founded, and is wrong regardless of its effectiveness.

Fred X. Quimby • May 24, 2006 8:38 AM

…Many years had passed since I last read Schneier’s security blog, but once a crypto junkie, always a crypto junkie. I remember quite well, that fateful summer of 2028, when I got the call from Bruce – the “DRMies” had taken control of the United States east of the Mississippi and it was time for action. While many feared that our “random” visits to local flea markets would be uncovered by data mining, Bruce assured us that, even so, the authorities would be caught completely by surprise by our attacks using seven hundred cubic miles of silly-string and 2,401 rubber chickens…

arl • May 24, 2006 8:50 AM

If this argument is correct why do we think that the NSA has the math skills to crack very difficult cryptography but cannot handle other pattern matching problems?

ruidh • May 24, 2006 8:58 AM

The whole data mining exercise just makes no sense. Let’s say I call the number in a classified ad to ask about buying a ’65 Mustang. And let’s assume the seller has a cousin who may or may not be a terrorist, terrorist sympathisier or member of a mosque and who calls him once every other year or so to invite him to a family party. Am I going to be rounded up in a terrorist sweep for supporting terrorist activities?

All of us have fleeting contacts with people who are or know others who might be suspects. The haystack is particularly huge and gnarly and the patterns they are looking for are indistinguishable from chance encounters which happen all the time and of which we are largely unaware.

If it’s a math problem, it’s a math problem which is unsolvable because it is insufficiently specified.

Simon@AutoUpdate+ • May 24, 2006 9:10 AM

The concept of six degrees of separation as an invalidation of NSA data mining definately sounds good as a theory, but I think it is not quite complete. The missing component is the frequency of interaction. If my indirect linkage to a terrorist occurs only once every five years, I can likely be scapped as a possible collaborator. Whereas if my indirect linkage to the terrorist occurs weekly, there is a much higher cause for suspicion.

AG • May 24, 2006 9:24 AM

I want to be a Harvard Prof…
Using 8th grade party games to sum up the world wide terrorist issue is a little unresponsible.
It’s like the “Kevin Bacon game” or “send the package only to someone he knew, and so on. ”
Except in this game the packages they send back to you have TNT, anthrax, and other nasties in them.

Justin W • May 24, 2006 9:30 AM

@Moshe, I can see two counterpoints to the critiques you (reasonably) brought up. The first is that SIGINT traffic analysis has analysts looking for fairly specific event profiles. We don’t know if the NSA even has an event profile they are looking for. If they are using this form of data mining in an attempt to construct event profiles, they are WAY off due to the base rate fallacy. Even if they do have an event profile they are looking for, the Farley article demonstrates that the base rate fallacy still applies – there’s too much irrelevant information in phone records to distinguish false positives from real ones.

AG • May 24, 2006 9:33 AM

I mean irresponsible… man I can’t post till I’ve had my starbucks…

What I’m trying to say here is the Terrorist situation is like a game of spin the bottle. Sure sometimes you have to kiss the fat girl(get attacked), but you just might get to kiss the cute girl(catch a bad guy) too.

Carlo Graziani • May 24, 2006 9:37 AM

The NSA’s problem with data mining is the same as the problem with using polygraphs for security screening: the false-positive rate.

Every system making this kind of “significant/not significant” decision has sensitivity thresholds that can be set for its trigger criteria. These can be set to a continuous range, from hair-trigger sensitivity to near blindness.

As you change the sensitivity, you alter the error rate. There are two types of errors: false positives (non-significant patterns tagged as significant), and false negatives (significant patterns that don’t trigger). The more sensitive the trigger threshold, the lower the false-negative rate, and the higher the false-positive rate. And vice-versa.

The name of the game is to design a system and select the threshold so as to maximize the sensitivity while bounding the false-positive rate to acceptable levels. This is crucial even if one is not concerned with the civil rights implications of false positives, because the FPs constitute a noise term that can mask the true positive signal — if the analysts reviewing the results spend most of their time weeding through “terrorist” networks of soccer moms and college alumni phone trees, they are more likely to miss the real thing.

This is exactly what the National Academy of Science told the Securocrats in 2002, with respect to polygraphs. For the first time, they actually reviewed the data on polygraph effectiveness, and found that all existing systems had characteristic response curves
(True Positive rate as a function of False Positive rate) that produced enormous (10-30%) FP rates at any sensitivity likely to screen out real disloyalty.

This didn’t make a dent in the securocrat’s faith in polygraphs, which appears to be completely unbound from any genuine scientific assessment of their effectiveness as screening tools. They use them to this day.

There has been no report of any comparable scientific analysis of the NSA’s “Bobbing for Terrorists” data mining effort, but I seriously doubt that such an analysis has been performed. There is no bureaucratic incentive to challenge the effectiveness of the program, and strong budgetary reasons to talk up its successes.

However, given the vastly larger number of variables and the smaller number of constraints and controls and training sets available for calibrating terrorist detection (as opposed to polygraphing) I would be shocked if mere math whizzery had succeeded in controlling the false positive noise term.

dhasenan • May 24, 2006 9:38 AM

The issue isn’t just fleeting contact between terrorists and non-terrorists. The article spoke of a specific issue–cells in which contact between members is minimized–and pointed out that that easily defeats one class of NSA datamining; and then went on to point out that that class of datamining often results in large numbers of false positives.

So he showed that the method sucks both coming and going.

Then he suggested, why don’t we start grouping people by habits and characteristics instead of something so opaque as who they call? This seems like it would be more accurate. After all, a person is a terrorist because of what they do, not who they call; so naively, looking at what people are doing should predict better whether they are terrorists.

To be more precise, though, Farley suggested grouping people by habits, not determining whether someone is a terrorist based on their habits. (Do you blow up buildings often? Yes? Okay, we’ll put you down as a likely terrorist, thanks for you cooperation.)

This just tells you what a person’s interests are, and only in the span of observation.

Would this have helped for, say, WTC? We had a few recent immigrants from Islamic countries learning to fly at the same time. It might have alerted us enough to monitor them, but the kind of computing power and observational ability required to assess the situation is unreasonable.

That’s another reason to restrict such observation to non-citizens, though–a smaller group to watch.

Archangel • May 24, 2006 9:43 AM

AG, you’re being a troll. What you call ‘party games’ are legitimate social sciences research results, which are now trite because someone turned them into pablum. Milgram was one of the great social researchers of the time; his name pops up everywhere. (Particularly in some exthics coursework — the cutting edge of new fields can cut you, sometimes.) But we didn’t know this stuff then. And most people don’t know it now, either. You just say “Six degrees” and their eyes gloss over and they think they get it.

McGavin • May 24, 2006 10:10 AM

“If this argument is correct why do we think that the NSA has the math skills to crack very difficult cryptography but cannot handle other pattern matching problems?”

Mathematicians don’t make policy; generals and politicians do.
The cryptanalysts there are good, but they also have the benefit of years of good funding and dedicated secret research. They aren’t necessarily more talented in other math than what you would find in academia.

michael Gracie • May 24, 2006 10:14 AM

Then finding weak ties could only be had though data mining, eh? Instead of studying those data points wrapped tightly around the best fit line, look at the outliers instead.

David • May 24, 2006 10:23 AM

I agree that data mining can be effective.

Maybe I’m naive, but what I’m not sure of is whether or not mining phone calls and internet traffic is effective.

Without intel to backup the mining (to query the database properly), it seems to me it’s useless. Crime statistics don’t solve crime, they might tell you where to put resources, but they don’t do the legwork. Also, if you have the intel, you don’t need to data mine, you just subpoena the records using a secret warrant (which if I’m not mistaken is still legal 🙂

I’m still waiting for someone to show me a single case where the Patriot Act actually caught a terrorist.

So, uh, what is this illegal data mining to be used for again?

Ale • May 24, 2006 10:29 AM

As I have posted earlier on this blog, data mining technology has been transformed by the current US government and media in a myth: some sort of omniscient oracle to which we must give sacrifice our privacy.

If we expect to approach the privacy vs. data mining debate rationally, we must have some idea of the actual tradeoffs that we are looking at. In a sense, what J. Farley is doing is precisely that.

This does not mean that his results are definitive, however. Some of the results on which J. Farley based his commentary were widely publicized some time ago by A. L. Barabasi and his team at Notre Dame, and, from a scientific point of view, some of their claims (mostly those regarding the genome/proteome networks and the Internet) have been successfully challenged. This however does not mean that all their results should be disregarded, only that the “Six Degrees” phenomenon should not be hastily associated with other structural or topological characteristics of the network.

What he later talks about (grouping by habits) is a long standing discipline in AI and data analysis: Clustering Algorithms. The crux of almost every clustering problem is the correct definition of its metric: To be able to cluster “similar” elements together, a measure of how similar two elemnts are is needed. And in the case of complex datasets with rich metadata (such as the databases that the NSA is thought to possess) the definition of this metric is a huge problem on its own.

The problem of finding patterns on very large networks has been approached effectively (sometimes extremely profitably – think Google) but is by no means a trivial endeavour. It is refreshing to hear someone countering the hype and placing Data Mining where it belongs, in the realm of technology, rather than as some disembodied mathematical sorcery that will give us the answer to every question.

Tank • May 24, 2006 10:37 AM

@ ruidh at May 24, 2006 08:58 AM
“The whole data mining exercise just makes no sense. Let’s say I call the number … let’s assume the seller has a cousin who may or may not be a terrorist … Am I going to be rounded up in a terrorist sweep for supporting terrorist activities?”

If you have a 2nd contact with a 2nd associate of that terrorist why not put you under surveillence. Press reports have even gone as far as telling you this is the type of thing that is being done.

“If it’s a math problem, it’s a math problem which is unsolvable because it is insufficiently specified.”

And ? The idea isn’t for you to figure it out. There are smarter people with access to this data and the details of known terrorist contacts working on that.

Mike Marsh • May 24, 2006 10:48 AM

Milgram’s “six-degrees” result is really widely over-stated. Most of the packages he sent were never delivered. In the most famous experiment, what he actually found was that of the number that were delivered, there was an average separation of six degrees for white middle-class Americans. Attempts to deliver packages across racial or socioeconomic divisions were considerably less successful, as I recall.

Anonymous Troll • May 24, 2006 10:52 AM

Don’t forget that in this game, tis better for a thousand innocent people to be sent to Guantanamo then for one terrorist to escape!

Moshe Yudkowsky • May 24, 2006 11:14 AM

Justin, here’s a response to your comments. The article isn’t about phone records; it’s about data mining, which is very different and includes tremendous amounts of information.

The commercial world has a lot of experience with data-mining. A friend who designed data (not databases, but the actual data) for a large credit-card firm explained to me one day how their fraud detection software was able to discern that a purchase on a US credit card, made a in a jewelry store in Hong Kong, was fraudulent. And they determined this in real time, fast enough to catch the perpetrator in the act.

The article throws up hypothetical roadblocks which contradict everyday experience of commercial data-mining operations.

And it also propogates a very, very old fallacy: Because data mining is not perfect, it should not be attempted at all. Without actual research, without trials to determine actual error rates, without tuning, and without a lot of sweat, it’s utterly irresponsible to dismiss data mining as a possible tool.

Whether it’s a tool we should use — whether it’s good for our society — is a completely separate discusison.

Tank • May 24, 2006 11:28 AM

@ Jeff at May 24, 2006 09:23 AM
“Through all of this, one thing I haven’t heard is any comments about how this data mining excerise integrates with NSA’s overall activities… taken on its own, the validity of the results can be debated (as seen above).”

That’s because it is easier to criticise and dismiss this way. The previous article on the same topic was the same exercise.

In that case the idea was that data mining isn’t efficient because it is too hard to distinguish normal people’s 7-11 purchases from those of terrorists.

Because after 19 hijackers entered the country using the same routes, with the same travel histories, from the same countries, with the same religious backgrounds and did so because this was what was standard operating proceedure for the same terrorist group everyone is looking for today, the most well funded intelligence and law enforcement agencies in the world given one standout priority would really be looking at 7-11 purchases instead of known flags.

Literally, data mining doesn’t work if you pretend nobody knows what to mine for.

Here the idea is that calling patterns cannot identify terrorists because a phone call isn’t a telling sign of evil and everybody knowing everybody muddies the waters of what significant contacts are.

The idea is that when the Pakistani security services capture the guy who recruited the Madrid bombers and turn him and his cell phone over to the US intelligence services who find out he has been in contact with US residents abroad, there’s still really no way to know what phone numbers in the US might be of interest in terms of mapping what human networks those US numbers have.

You know, in 2006, when for the love of god it will take you until 2008 to come up with a better resource for those investigating for determining and mapping human networks than who calls who on their cell phone.

No, forget blatantly obvious uses like that which have even been flat out stated as the whole point for quite a while now, the fact that people know people means that when it comes to capturing known al Qaeda operatives there could be any reason they contacted people in the US and those people in turn contacted each other, chemical suppliers, etc.

The law of Kevin Bacon dictates that the world’s best funded and advanced intelligence agency can’t think up the kind of shit anyone can if they consider it for 3 minutes.
They’re just grappling with the same straw men Bruce here is. Yeah.

DM • May 24, 2006 11:32 AM

Whilst it is questionable if data mining is usefull for identifying previously unknown points of interest in the graph, we do know that data mining is particularily good at looking at the links to known points of interest in the graph, for example, journalists.

Clive Robinson • May 24, 2006 11:36 AM

As pointed out the problem is not so much one of association but of meaningfull association.

Without qualification a simple contact is like using an OR term in a database search, it rapidly increases the number of rows selected, which rapidly becomes meaningless (ie all but a minority of rows get selected).

Even with qualification it would still produce a lot of garbage at the output.

Think of it in the same way you would the contact frequency of letters, used in an automated cryptanalysis program. In of it’s self it can provide an indication that you have come across a “statisticaly” valid plaintext. You then need to actually check it actually makes sense first and then secondly within contect to see if it is meaningfull or not.

However if the oposition know how you are doing your search it does not take to long for the smart ones to work out how to keep below the “noise level”.

An example of this was that the Germans worked out that the Allies where using traffic analysis to work out troop movments etc. When the V1 rocket reseaurch establishment was moved the Germans faked valid traffic in a different part of Europe. The only reason it was not successfull was that the Germans where unaware that the Allies “Fingerprinted” operators and transmitters so could tell it was a fake simply due to the same operator/transmitter being used for supposedly different commands.

However also during the second world war Captured U-Boat sailors used a simple code to send messages back home which easily got past the allied censors.

Admiral Donitz had the genuine letters sent home by U-Boat sailors to their families analysed. The code was then worked out to use common phrases etc as a low bandwidth code. The result was the messages stayed sufficiently below the noise threshold to get past the censors.

There are lots of examples of this in the history of communications.

Therefore the real question is not if data mining can be succesfully done (it can be). But how long is it going to remain valid in the face of an alert and intelegent adversary.

You will end up with an evolutionary type war where the terorist will evolve their activities and communications to avoid detection, as the penalty for not doing this is failier.

An already visable (possible) example of this is the drop in the number of western born muslims going to Pakistan schools etc. IS the drop in numbers due to inocent people not wishing to be targeted by the security forces, or potential terorists not wishing to be identified, or some combination of both…

Ale • May 24, 2006 11:39 AM

@Tank:

“anyone can if they consider it for 3 minutes”

Well, I will be expecting to see your paper in the next ICDM (www.comp.hkbu.edu.hk/~wii06/icdm/) so that you can dazzle everybody with the thing that you cooked up in 3 minutes.

Mike • May 24, 2006 11:54 AM

“If my indirect linkage to a terrorist occurs only once every five years, I can likely be scapped as a possible collaborator. Whereas if my indirect linkage to the terrorist occurs weekly, there is a much higher cause for suspicion.”

But that’s the entire point – you can’t make that blanket assertion. If you are close enough, there is no frequent link, unless the NSA has bugged your apartment and can trace face to face conversations. Hundreds of people who attended the same mosque as the London bombers would have lots of frequent contacts. Nothing about the frequency or the existence of the contact says anything about the nature of the contact.

And as pointed out in the article, “strength of weak ties” invalidates your first point – I amy not see my co-conspirators for years, and then meet up with them to perform the act.

Matt Palmer • May 24, 2006 12:01 PM

The problem as far as I can see it is sifting through vast amounts of data to get at nuggets of information, when you don’t even know what correlations are important. With large data sets its easy to find random correlations. Note that correlation does not imply causation. You only get answers to the questions you know to pose, and even then the “answer” may be meaningless.

Given the miniscule amount of terrorists versus the large amount of normal people of the same religion, ethnicity, interests, etc. it will be hard to identify information from random noise.

It will also be easy to game the system by pushing through people of similar characteristics (hiding information in noise), or people doing suspicious things who aren’t really doing anything bad at all (generating false leads).

Haven’t many people in the intelligence community been saying they need more field operatives, cultural specialists and human intelligence gathering, not more ways to overwhelm them with vast amounts of data, most of which lead nowhere and take away resource from other promising areas?

mdf • May 24, 2006 12:03 PM

“The article throws up hypothetical roadblocks which contradict everyday experience of commercial data-mining operations.”

The training sets for these credit-card fraud models have (unfortuantely) millions of instances to examine. Are the models used to detect terrorists based on similarly large tables? If not (and I would imagine this is the case), then appealing to the commercial experience may be a fallacy.

AG • May 24, 2006 12:31 PM

@Archangel
I am not being a troll I am saying the research is inherently flawed.

On PAPER I may be six degrees away from a Terrorist or Kevin Bacon and/or I may be breathing in a single atom of oxygen that Julius Caesar exhaled in his last dieing gasp.

SO!?! That does make me in any REAL way connected to Julius Caesar or Kevin Bacon or whomever.

It does not allow me to find Bin Laden. It does not allow me to send him a message. It would not interfere with a data mining operation.

Data mining WORKS… If the right amount of information and the right system to search thru that information the possibities are endless.
BUT, that is the problem you could use the information for anything.

Brian • May 24, 2006 12:35 PM

The Bush administration has generally not been shy about trumpeting their successes. I suspect if the NSAs program had successfully tracked down terrorists, we would have heard about it. We wouldn’t necessarily have heard how they were caught, but I think their names would have been in the paper.

I suspect the base rate fallacy is affecting the NSA’s data mining attempts, and that is why this program hasn’t caught anybody.

Pat Cahalan • May 24, 2006 12:40 PM

Proponents of data mining keep bringing up successful uses of data mining (particularly in the credit card/financial industry) as justification for using data mining techniques for finding terrorists.

Please stop.

Data mining works well in tracking financial transactions simply because they are looking for one criminal activity (fraud) that really can only be executed in a basic way -> either take money (liquid cash) out of an account, or buy something using an account. You’re looking essentially for two possible activities, and that’s it. You also have a large quantity of simple triggers (someone made a purchase in Tibet within 10 minutes of making a purchase in North Dakota), and a large quantity of baseline data (each account holder’s activities) to use in fine-tuning your mining algorithms. Finally, the consequences of a false positive are an irritated customer and and a phone call.

Looking for terrorists is completely unrelated to this activity. First, we don’t know what “terrorism” is. We know what “fraud” is (taking money you’re not authorized to take). That’s simple and easy. Terrorism is not in any wise definable. Look at the collection of activities we call “terrorism” on the movie-plot threads. You can think of literally thousands of possible terrorist activites in a half hour of idle musing.

@ Tank

Data mining doesn’t work if you pretend nobody knows what to mine for.

That’s the entire point. Nobody knows what to mine for. They know some things they could mine for (previous terrorist activity indicators), but they don’t know all of the things that they can mine for. And any attempt to preemptively guess what to mine for is just going to increase the false positive rate. Hey, maybe someone will release chlorine gas in a tunnel. Let’s add a trigger for chlorine gas purchases. Hrm, but know we need to account for legitimate chlorine gas purchases, so we need to add negative indicators for chemical supply companies. Oh, but a terrorist may get a job at at chemical supply company, so add a trigger…

Even if, theoretically, you put in checks and balances to insure that the false positives could be adequately screened out of such a process, the only way this can scale for preventative measures (predicting new, previously-untried terrorist activities like the 9-11 hijackings) is if you continually add more data, more processing power, more manpower, more potential threats, etc. etc. This is Total Information Awareness. In order to protect the citizenry from a completely unknown threat, the watchers need to know literally everything that everyone does, because anything may be an indicator of some unknown potential activity.

Finally, a common argument for data mining is, “We would have caught one/some/all of the 9/11 terrorists if we had been doing this.”

That may well be true. However, we would have caught some/all of the 9/11 terrorists if the intelligence and law enforcement communities didn’t have their collective heads up their collective unmentionables. The right way to solve this problem is to solve the communications problem and get the organizations to use the tools THEY ALREADY POSSESS intelligently, NOT give law enforcement and intelligence new tools to screw up with, especially when those new tools require a loss of liberty that fundamentally alters the nation.

roy • May 24, 2006 12:52 PM

Consider how disconnected a single degree of separation can be. Take for example: http://www.schneier.com/blog

Decent people worried about Diebold’s voting machines obviating the entire electorate may visit this site. But so will the goblins at Diebold working on damage control.

The two sides are not poles apart, they are in different dimensions, yet data mining would link them together, treating them as tightly united.

Some of the responses are about winnowing the wheat from the chaff, as if finding any wheat justifies the entire harvest. This is simply the problem of eliminating false positives from the set of all positives to find the real positives. That takes manpower, and lots of it. If data mining turns up 1,000,000 possible connections, eliminating the false positives is not a matter of “I looked into it but didn’t find anything” : it means proving the falsity instead of guessing that it was a false alarm. Treat every positive as a ‘guilty’ until it is proven ‘innocent’. That will take tremendous resources, which don’t exist. What will happen is that the great bulk of all positives will be summarily and randomly dismissed simply because their number is gargantuan, leaving a manageable number, and then the investigators will go with their instincts — no better than dowsing rods — and will make arrests. Most of the cases that get that far will be eventually dismissed for lack of actual evidence. Most of the few cases heading to trial will be dismissed along the way. The prosecutors will try framing some people just to make some numbers (any number is better than zero), but that will all be for show, ersatz wheat to justify the huge cost of the harvest.

If we have resources, put them to direct use. Get out in the field and on the ground. Penetrate the enemy. Turn insiders into informants. Bug them. Oh, by the way, learn their language(s) as well.

As a simple illustration, say Abdul the Chemist is suspected of being a terrorist. We want to know who he calls. So, either monitor the entire planet to track the calls he makes, or sneak a look at his cell phone’s directory. Which would be your choice?

An additional problem is that only some of the information of concern is passed electronically. I could single-handedly defeat TIA by using postcards. If I and my associate both pay cash at the lunch counter, no database will ever know we met or had lunch together.

And some of what does travel electronically will be unrecognizable to surveillance. As an example, plotters can communicate with each other by altering drafts of email never sent by simply sharing the same email account. Anything monitoring the account will see only the spam it was sent, but what will it make of the low level of email activity?

Worse yet, digital information can travel without using landlines or the airwaves. A courier with a thumb drive can deliver a huge amount of information without the telecommunications world knowing a thing. A courier could operate as a full time job without anyone suspecting, provided he was driving a cab. Imagine trying to dissect the taxi-linked communications — FM radio, landline, and cell phone — in New York City even for just one day. How would you tie passengers together? They are all anonymous. Without people on the ground, you would be utterly blind.

It can get worse: steganography. Something as seemingly innocent as a weekly phone call from one person to a friend or relative may carry in the digital stream a secret low-rate channel. All the ‘external’ information about the call will look innocuous, and so will all the ‘internal’ information, even though looks will be dead wrong.

A weekly posting to a bowling team’s website of crappy pictures from last weekend’s league activities could in the images and videos carry low-rate channels of completely hidden information that TIA would be blind too. All the members of the team may have IPAs in New Jersey and ordinary New Jersey names, yet the actual people could all be Tamil Tiger trainers. NSA would never know.

A note on weak ties: A small number of people may meet secretly just once for a long discussion, and thus are never seen together ever. But they can be working together for decades afterwards.

For a pedestrian example of this, consider the LAPD Internal Affairs, where they recruit ‘shoe flies’ — cops to spy on cops — during a single one-on-one meeting in secret while the recruit is still at the police academy. They may never be seen together by anybody for the duration of the secret double life of the ‘field associate’.

DragonHunter • May 24, 2006 1:22 PM

It’s a question of cost/benefit to me. Let’s do some back-of-the-envelope calculations….

300,000,000 phones in the US (probably more) 3.3 calls per day on average from each. Thats one billion calls per day. 365 billion calls per year to analyze. If we only have a fals positive rate of 1/100th of 1 percent that gives us 36.5 million false positives to do whatever with every year.

Sounds like a waste of manpower to me.

Not to mention the potential disruption to those law-abiding private citizens who are havin their 4th amendment rights trampled upon by this program in the first place!

All of this sounds like, “Hey, we could use computers to do this” to me. It’s more fun and it’s not as hard as good work in many, many endeavors. I know. I’ve been guility in my line of work.

jayh • May 24, 2006 1:29 PM

@dhasenan >>That’s another reason to restrict such observation to non-citizens, though–a smaller group to watch.<<

The OKC bomber and the London subway bombers were native citizens.

@Moshe Yudkowsky >>The commercial world has a lot of experience with data-mining. A friend who designed data (not databases, but the actual data) for a large credit-card firm explained to me one day how their fraud detection software was able to discern that a purchase on a US credit card, made a in a jewelry store in Hong Kong, was fraudulent. And they determined this in real time, fast enough to catch the perpetrator in the act.<<

That is not data mining, it’s quite the opposite: That is pattern recognition based on experience with thousands of previous fradulent transactions and with extensive knowledge of the buying habits of the legitimate cardholder. With terrorism you have unknown actors following different patterns, and you have limited knowledge about them.

In business, data mining is mostly used in marketing, where the results of positive and negative failure which is fairly frequent (missing a potential customer, targeting the wrong customer) are not serious beyond a bit of wasted money.

jayh • May 24, 2006 1:31 PM

Oops garbled up the above a bit

@dhasenan>>That’s another reason to restrict such observation to non-citizens, though–a smaller group to watch.

the OKC bomber and the London subway bombers were native citizens. This pattern will likely continue.

First Principles • May 24, 2006 1:56 PM

For my part, I don’t understand why we’re debating whether datamining is effective. To me it’s like debating whether torture is effective: whether it is or not, it’s still just wrong.

Bottom line: I don’t care how convenient or effective it would be for this or that purpose, we as Americans should not have our every move tracked by the U.S. government. PERIOD.

Mike • May 24, 2006 2:08 PM

@ First Principles….

datamining is like torture, huh? Good thing you aren’t living in a country that DOES resort to torture for the discovery of information…

ac • May 24, 2006 2:13 PM

This program is only ineffective if its purpose is to catch terrorists.

This program could be very effective in doing the following:
– Linking everyone in the US (via several degrees of separation) with a known Al Qaeda operative. Then, should we detain that person, we can truthfully claim that they are “linked to Al Qaeda” and thus are enemy combatants, do not have the right to due process, habeas corpus, counsel, human rights, etc.
– Examining communication networks around known “problem” reporters who are getting information about criminal activity that we don’t want the public to know. Fire/arrest everyone within 3 degrees of that reporter–not necessarily to catch whistleblowers, but to send a message to potential whistleblowers.
– Collecting data on political opponents (extramarital affairs, etc) that can be used in the next election.

So, I think it’s pretty clear that this could be an extremely effective program. If it’s purpose isn’t to find terrorists. Why on earth would we assume it is? Supposedly invading Iraq had something to do with terrorism too if you remember. Why are we suddenly so credulous about this new fiasco?

geoff lane • May 24, 2006 2:26 PM

You can’t compensate for bad data by having a lot of it. You have to use carefully selected data to derive any useful information. If you don’t then the noise just swamps out the good information.

I’ve not read anything about how the data will be processed and suspect that it’s all just some boondoogle to get more funding.

ac • May 24, 2006 2:27 PM

@Mike

I think First Principles point was not that this program is as bad as torture on the absolute moral scale, but that both are simply flat-out illegal. On that scale, jaywalking, shoplifting, robbing a bank, and NSA warrantless domestic spying operations are equivalent. Yes, some laws are silly at times (jaywalking). Nevertheless, in the old days (Emerson, MLK), when they broke a law they felt was unfair, they fully expected to be arrested (these arrests led to protests, which then led to changes in the law). Now, you can simply ignore laws you don’t like (depending on who you are).

The fact that the conversation is focusing on the debatable effectiveness of the program rather than the simple fact that it’s illegal is a little disturbing to say the least. After all, if it were effective, and it DIDN’T invade privacy, the correct thing for the federal government would be: 1) Repeal the 4th amendment, 2) Pass a law allowing the NSA to snoop on US citizens without a warrant, and then 3) Do it. Skipping to step 3 is where laws are being broken.

And how, in fact, do you know he lives in a country that doesn’t torture to get information. I do, and I’m certain I’m not the only person posting on this blog who does.

Mike • May 24, 2006 2:45 PM

@ ac…

first of all, I was being sarcastic, sorry for not appending a popular sarcastic denotation to the end of my post. Thus, I do not know if Mr. First Principles is living in an Islamic country that condones torture or America that does not…

But that aside, you and he state that the program is illegal as if you know that for a fact? do you have a law degree? please, site your sources?

As for me, no I’m not a lawyer… however, I have studied the subject. I would refer you to powerlineblog.com. John Hinderocker IS a lawyer and has posted a number of very insight, fact-based comments on the legality of the NSA programs…. Specifically, I would refer you to the May 16th post http://powerlineblog.com/archives/014106.php

It links to a document written by Andrew McCarthy, David Rivkin, and Lee Casey that explains exactly why the NSA program IS legal in great detail…. enjoy the reading!

First Principles • May 24, 2006 3:29 PM

Mike wrote:

But that aside, you and he state that the program is illegal as if you know that for a fact? do you have a law degree? please, site your sources?

Some things are wrong on their face. Torture is one of them, despite anything that lawyers like John Woo might ever say.

If we need a lawyer to tell us what America stands for now, then God help us.

roy • May 24, 2006 3:44 PM

Think of lawyers as liars for hire, then ask should we trust them.

Dan Lewis • May 24, 2006 3:46 PM

I hit a Times Select wall, so here it is (or a similar editorial) in the Sacramento Bee.
http://www.sacbee.com/content/opinion/story/14256165p-15071218c.html

I am a little surprised that no one has brought up the false-positive rate that has already been mentioned in the press. This is from the Washington Post in February 2006.

“Intelligence officers who eavesdropped on thousands of Americans in overseas calls under authority from President Bush have dismissed nearly all of them as potential suspects after hearing nothing pertinent to a terrorist threat, according to accounts from current and former government officials and private-sector sources with knowledge of the technologies in use.
…
Fewer than 10 U.S. citizens or residents a year, according to an authoritative account, have aroused enough suspicion during warrantless eavesdropping to justify interception of their domestic calls, as well. That step still requires a warrant from a federal judge, for which the government must supply evidence of probable cause.”
http://www.washingtonpost.com/wp-dyn/content/article/2006/02/04/AR2006020401373.html

As for the legality of the program, I’d direct you to Glenn Greenwald. He’s a former litigator who has been on the multifarious domestic spying scandals since they began. He does not think that the administration’s position is tenable.

Here’s an index to several arguments he made. Of course, this is a few months old, so there’s a lot more reading you can do if you page through his blog.
http://glenngreenwald.blogspot.com/2006/02/nsa-legal-arguments.html

Here’s something John Hinderaker said last July:

“It must be very strange to be President Bush. A man of extraordinary vision and brilliance approaching to genius, he can’t get anyone to notice. He is like a great painter or musician who is ahead of his time, and who unveils one masterpiece after another to a reception that, when not bored, is hostile.”
http://powerlineblog.com/archives/011183.php

So please forgive me if I suspect Hinderaker of being a pro-Bush BS artist on this and other issues.

Mike • May 24, 2006 4:03 PM

@ Dan Lewis…

While I may not agree with your viewpoint, thank you for at least presenting me with sources. Its easy to get on a chat board and say “its illegal because… becuase.. well, becuase I said so”… but its another thing to do some research (right, wrong, or indifferent…). At least you researched your case and presented evidence to support your views.

I will read your links and consider the arguement they present.

ac • May 24, 2006 4:16 PM

@Mike

Thanks for the entertaining reading. However, as other have pointed out, it’s better to cite laws than lawyers. Lawyers are paid to take the sides of criminals, and lawyers outside the courtroom have little preventing them from outright deception-for-hire. See, for example the cadre of lawyers who regularly say it’s perfectly legal for the US Government to detain and torture US citizens indefinitely without judicial review, even if no evidence can be found of any wrongdoing (Woo et al). Legal ethics are pretty lax these days.

Here’s some much shorter reading for you: http://caselaw.lp.findlaw.com/data/constitution/amendment04/

Until that amendment is repealed, no law passed by Congress can make warrantless surveillance of people not remotely suspected of a crime legal. If a court ever has the chance to review this, it’s a foregone conclusion. That’s why the emphasis right now is on keeping this case from ever getting heard by a court.

jonny paycheck • May 24, 2006 4:20 PM

Somehow the pattern of corruption with the PACs and lobbyists totally escapes the Departmant of Justice who have access to phone records of Abrahmson and Rep. Wade and associates.

They are supposed to find the “terrorists”? They can’t even find there own crap when it is in their own pants.

Nick Lancaster • May 24, 2006 5:38 PM

@AG:

“If we have the right amount of information and the right system to search thru it …”

It strikes me that this is very much like entering over-broad terms into a search engine, or not understanding the impact of boolean relationships.

The NSA program, in (allegedly) gathering tens of millions of numbers, would seem to require knowledge of one party in order to locate another. But it is also naive to believe that ALL calls made by terrorists are connected to their activities. If a terrorist orders a pizza every Tuesday, and orders Chinese takeout on the last Wednesday of the month, we have calls from a terrorist occuring with regular frequency. To assess the relevance of the pizza joint or the Chinese restaurant, we must then check every number that calls for delivery, to see if we have any other known terrorists, or if a number on the terrorist’s list of calls matches another customer of the pizza place.

This neglects false-positives generated by customers who call the pizza place despite there being another pizza franchise which is closer, and also neglects to note if the association between the terrorists and the pizza place is actually valid. (That is, if I’m marking a phone booth with chalk as a dead drop or other spycraft … it doesn’t mean the phone company is part of my operation.)

PATTERNS CAN BE EXPLOITED. Yet the NSA program seems to be designed with a blind eye to this fact.

(Incidentally, I have a Bacon Number of 3, and you have a Bacon Number of 4 by reading this.)

Marko • May 24, 2006 6:04 PM

This op-ed unintentionally makes an argument that supports the program. Farley implies identifying Ken Lay’s secretary would be considered a failure. The truth is it would be a HUGE win if the program identified Bin Laden’s secretary.

Farley then suggests that even if the program finds “central” players, the program is no good because it didn’t find all of the players. In other words, if the program can’t find all players, it is not a very useful program. That is not a reasonable standard for success.

ekzept • May 24, 2006 8:54 PM

IMO, it’s possible to mine for specific kinds of interactions, if the statistical distribution of those interactions is well-defined. In military work this is often successful, because operations are conducted as their participants have been trained to conduct them. The military implements a successful pattern of execution or doctrine.

The question is, do terrorist groups have doctrines? To the extent they do, and to the extent operatives follow these doctrines, there may be a chance of finding unique interactions. To the extent terrorist operatives are home-grown, there is no chance.

The basic problem even if there are unique doctrines remains one of sample size. If the number of non-terrorists overwhelms the number of terrorists in the sample, random fluctuations in behavior among non-terrorists will mimic any peculiar pattern of behavior assigned to a terrorist cell or doctrine.

Moreover if the operatives are smart at all, their ability to out-innovate a large monitoring organization ought to be plenty of advantage. Ironically, that ability to innovate is directly dependent upon the degree to which they can take initiatives and exercise independent thought and judgment.

Let’s consider another aspect to all this. Suppose the national technical apparatus did record everyone’s conversations in detail at all times, all correspondence, all email, all Web site accesses. Does anyone seriously think good use could be made of that? Anyone who tried to make something out of it would drown in detail and data.

To paraphrase Marvin Minsky, having too much information is far worse of a problem than having too little.

Nick Lancaster • May 24, 2006 10:05 PM

@ Marko:

We already know who Osama’s second-in-command is, so how would a result identifying Ayman al-Zawahiri as Osama’s second-in-command be a win?

This is like basing our national defense on predictions made by tabloid psychics. We ALL need to be wary of this kind of mumbo-jumbo. Whether it’s General Hayden asserting that if we had had a secret surveillance program (authorized by wartime powers) before 9/11 (the incident that prompted authorization of military force), we could have caught some of the hijackers; or Dick Cheney making the wild leap that since there have been no further domestic attacks, it MUST BE due to the SPECIFIC policies of the administration.

It’s superstitious thinking, and it’s a dangerous precedent when it comes to planning national security.

Nigel Sedgwick • May 25, 2006 2:42 AM

David wrote, on May 24, 2006 10:23 AM: “Crime statistics don’t solve crime, they might tell you where to put resources, but they don’t do the legwork.”

This was as a parallel of NSA’s data mining; which he then goes on to rubbish.

Surely, on the basis of his own argument, he is wrong. Better use of resources is useful.

Better regards

Brian • May 25, 2006 8:32 AM

If you trust Seymour Hersh and his unnamed sources, there’s an interesting piece in the New Yorker this week.

http://www.newyorker.com/talk/content/articles/060529ta_talk_hersh

… One unexamined issue was the effectiveness of the N.S.A. program. “The vast majority of what we did with the intelligence was ill-focussed and not productive,??? a Pentagon consultant told me. “It’s intelligence in real time, but you have to know where you’re looking and what you’re after.??? …

Tank • May 25, 2006 9:44 PM

@Ale:

Well, I will be expecting to see your paper in the next ICDM
(www.comp.hkbu.edu.hk/~wii06/icdm/) so that you can dazzle everybody with
the thing that you cooked up in 3 minutes.
Posted by: Ale at May 24, 2006 11:39 AM

What are you talking about ?
Would I really need to make an arguement in a whitepaper that the most useful records available in establishing a map of human networks connected to a surveillence subject is who they call ? Or is that something so self evident that it doesn’t require proof ?
What is it you are doubting here ?

Benny • May 26, 2006 8:50 AM

@ Tank:

I think Ale interpreted your comment “the kind of shit anyone can if they consider it for 3 minutes” to be referring to the field of data-mining research. Which may or may not have been what you meant. But it’s pretty hard to tell, since the post that comment originated from is not exactly easy to follow.

Ale • May 26, 2006 11:12 AM

It was just like Benny said. However, you do touch interesting points in your reply:

“Would I really need to make an arguement in a whitepaper that the most useful records available in establishing a map of human networks connected to a surveillence subject is who they call ?”

I think you should. Social network topology is very dependent on the definition of what constitutes a link. Given a basic set of people, the network defined by who of those address each other in a first name basis is much less sparse than the network defined by who of those have sex with each other. However, the “six degrees” phenomenon stated is present in both cases, as a subproduct of the power law scaling on the degree distribution of the network. So I think that the definition of what precisely constitutes a link, under what conditions, and how that definition changes with time and with other environmental factors is a valid direction for research. Social interaction is nuanced, full of meanings and connotations, and this metadata (that is the bread and butter of Human Intelligence operations) is not easy to access automatically from call records. There have been excellent posts on this particular issue earlier, and I will not elaborate more on that.

“What is it you are doubting here ?”

I am doubting that the definition of metrics used to build network structures is trivial, and that something out of someone’s head after 3 minutes of thought will be of any use at all. AND, I am arguing that at the end of the day, Data Mining is a technology, and as such, its limits and capabilities must be realistically understood in order to make correct cost/benefit analyses on it. Does the loss on privacy compensate with an increase in intelligence capabilities? This question is unanswerable from a scientific standpoint as long as Data Mining is seen as a mystical oracle with the capability of correlating any possible behavior under any possible scenario. Just for the record – from an ethical standpoint, I think that the cost-benefit for this eavesdropping is dismally bad.

sal;amatu • September 4, 2008 9:36 AM

Ceri • June 26, 2010 6:55 AM

Data mining as a means of tracking illegal activity is inherently flawed. You know how many roleplayers trip the flags for terrorist activity? I know I do with great regularity when I check the specifics of assault rifles, chemical compounds used in the manufacture of IEDs (curiosity about explosive potential of the chemical mentioned as an example), looking up information about specific religious beliefs, cults, etc for background. All stupidly innocent things given what I do for fun, but I know I trip the alarm bells every time I wonder what a chemical’s explosive potential is, and given some of my friends are paranoid paramilitary types that doesn’t help, add the stories I write and I wouldn’t be surprised to see the local security organization tracking my movement as a PTT regularly.
The sad part is I no longer feel like I’m paranoid discussing things like this after seeing the treatment of individuals whose only sin was being friends of the sibling of another suspected terrorist in custody.

The Problems with Data Mining

Comments

Leave a comment Cancel reply