Data Mining for Terrorists Doesn't Work
The report was written by a committee whose members include William Perry, a professor at Stanford University; Charles Vest, the former president of MIT; W. Earl Boebert, a retired senior scientist at Sandia National Laboratories; Cynthia Dwork of Microsoft Research; R. Gil Kerlikowske, Seattle's police chief; and Daryl Pregibon, a research scientist at Google.
They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).
But the authors conclude the type of data mining that government bureaucrats would like to do--perhaps inspired by watching too many episodes of the Fox series 24--can't work. "If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so."
A summary of the recommendations:
- U.S. government agencies should be required to follow a systematic process to evaluate the effectiveness, lawfulness, and consistency with U.S. values of every information-based program, whether classified or unclassified, for detecting and countering terrorists before it can be deployed, and periodically thereafter.
- Periodically after a program has been operationally deployed, and in particular before a program enters a new phase in its life cycle, policy makers should (carefully review) the program before allowing it to continue operations or to proceed to the next phase.
- To protect the privacy of innocent people, the research and development of any information-based counterterrorism program should be conducted with synthetic population data... At all stages of a phased deployment, data about individuals should be rigorously subjected to the full safeguards of the framework.
- Any information-based counterterrorism program of the U.S. government should be subjected to robust, independent oversight of the operations of that program, a part of which would entail a practice of using the same data mining technologies to "mine the miners and track the trackers."
- Counterterrorism programs should provide meaningful redress to any individuals inappropriately harmed by their operation.
- The U.S. government should periodically review the nation's laws, policies, and procedures that protect individuals' private information for relevance and effectiveness in light of changing technologies and circumstances. In particular, Congress should re-examine existing law to consider how privacy should be protected in the context of information-based programs (e.g., data mining) for counterterrorism.
EDITED TO ADD (10/10): More commentary:
As the NRC report points out, not only is the training data lacking, but the input data that you'd actually be mining has been purposely corrupted by the terrorists themselves. Terrorist plotters actively disguise their activities using operational security measures (opsec) like code words, encryption, and other forms of covert communication. So, even if we had access to a copious and pristine body of training data that we could use to generalize about the "typical terrorist," the new data that's coming into the data mining system is suspect.
To return to the credit reporting analogy, credit scores would be worthless to lenders if everyone could manipulate their credit history (e.g., hide past delinquencies) the way that terrorists can manipulate the data trails that they leave as they buy gas, enter buildings, make phone calls, surf the Internet, etc.
So this application of data mining bumps up against the classic GIGO (garbage in, garbage out) problem in computing, with the terrorists deliberately feeding the system garbage. What this means in real-world terms is that the success of our counter-terrorism data mining efforts is completely dependent on the failure of terrorist cells to maintain operational security.
The combination of the GIGO problem and the lack of suitable training data combine to make big investments in automated terrorist identification a futile and wasteful effort. Furthermore, these two problems are structural, so they're not going away. All legitimate concerns about false positives and corrosive effects on civil liberties aside, data mining will never give authorities the ability to identify terrorists or terrorist networks with any degree of confidence.
Posted on October 10, 2008 at 6:35 AM • 22 Comments