The Problems with Data Mining
Great op-ed in The New York Times on why the NSA’s data mining efforts won’t work, by Jonathan Farley, math professor at Harvard.
The simplest reason is that we’re all connected. Not in the Haight-Ashbury/Timothy Leary/late-period Beatles kind of way, but in the sense of the Kevin Bacon game. The sociologist Stanley Milgram made this clear in the 1960’s when he took pairs of people unknown to each other, separated by a continent, and asked one of the pair to send a package to the other—but only by passing the package to a person he knew, who could then send the package only to someone he knew, and so on. On average, it took only six mailings—the famous six degrees of separation—for the package to reach its intended destination.
Looked at this way, President Bush is only a few steps away from Osama bin Laden (in the 1970’s he ran a company partly financed by the American representative for one of the Qaeda leader’s brothers). And terrorist hermits like the Unabomber are connected to only a very few people. So much for finding the guilty by association.
A second problem with the spy agency’s apparent methodology lies in the way terrorist groups operate and what scientists call the “strength of weak ties.” As the military scientist Robert Spulak has described it to me, you might not see your college roommate for 10 years, but if he were to call you up and ask to stay in your apartment, you’d let him. This is the principle under which sleeper cells operate: there is no communication for years. Thus for the most dangerous threats, the links between nodes that the agency is looking for simply might not exist.
(This, by him, is also worth reading.)
Moshe Yudkowsky • May 24, 2006 8:06 AM
I can make this exact same argument to invalidate the entire concept of traffic analysis, yet traffic analysis is one of the best methods to gather intelligence about enemy plans and intentions.
So I’m taking this argument with more than a small grain of salt. As we well know, people with a weak encryption algorithm will hype the tremendous number of keys provided by the algorithm, but completely ignore the weakness in the encryption. In the same way, this essay hypes the tremendous number of connections each person has, but ignores the way a data-mining operation can sort through that data to find the significant, active, and important connections.