Entries Tagged "data mining"

Page 3 of 7

Data Mining for Terrorists Doesn't Work

According to a massive report from the National Research Council, data mining for terrorists doesn’t work. Here’s a good summary:

The report was written by a committee whose members include William Perry, a professor at Stanford University; Charles Vest, the former president of MIT; W. Earl Boebert, a retired senior scientist at Sandia National Laboratories; Cynthia Dwork of Microsoft Research; R. Gil Kerlikowske, Seattle’s police chief; and Daryl Pregibon, a research scientist at Google.

They admit that far more Americans live their lives online, using everything from VoIP phones to Facebook to RFID tags in automobiles, than a decade ago, and the databases created by those activities are tempting targets for federal agencies. And they draw a distinction between subject-based data mining (starting with one individual and looking for connections) compared with pattern-based data mining (looking for anomalous activities that could show illegal activities).

But the authors conclude the type of data mining that government bureaucrats would like to do–perhaps inspired by watching too many episodes of the Fox series 24–can’t work. “If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so.”

A summary of the recommendations:

  • U.S. government agencies should be required to follow a systematic process to evaluate the effectiveness, lawfulness, and consistency with U.S. values of every information-based program, whether classified or unclassified, for detecting and countering terrorists before it can be deployed, and periodically thereafter.
  • Periodically after a program has been operationally deployed, and in particular before a program enters a new phase in its life cycle, policy makers should (carefully review) the program before allowing it to continue operations or to proceed to the next phase.
  • To protect the privacy of innocent people, the research and development of any information-based counterterrorism program should be conducted with synthetic population data… At all stages of a phased deployment, data about individuals should be rigorously subjected to the full safeguards of the framework.
  • Any information-based counterterrorism program of the U.S. government should be subjected to robust, independent oversight of the operations of that program, a part of which would entail a practice of using the same data mining technologies to “mine the miners and track the trackers.”
  • Counterterrorism programs should provide meaningful redress to any individuals inappropriately harmed by their operation.
  • The U.S. government should periodically review the nation’s laws, policies, and procedures that protect individuals’ private information for relevance and effectiveness in light of changing technologies and circumstances. In particular, Congress should re-examine existing law to consider how privacy should be protected in the context of information-based programs (e.g., data mining) for counterterrorism.

Here are more news articles on the report. I explained why data mining wouldn’t find terrorists back in 2005.

EDITED TO ADD (10/10): More commentary:

As the NRC report points out, not only is the training data lacking, but the input data that you’d actually be mining has been purposely corrupted by the terrorists themselves. Terrorist plotters actively disguise their activities using operational security measures (opsec) like code words, encryption, and other forms of covert communication. So, even if we had access to a copious and pristine body of training data that we could use to generalize about the “typical terrorist,” the new data that’s coming into the data mining system is suspect.

To return to the credit reporting analogy, credit scores would be worthless to lenders if everyone could manipulate their credit history (e.g., hide past delinquencies) the way that terrorists can manipulate the data trails that they leave as they buy gas, enter buildings, make phone calls, surf the Internet, etc.

So this application of data mining bumps up against the classic GIGO (garbage in, garbage out) problem in computing, with the terrorists deliberately feeding the system garbage. What this means in real-world terms is that the success of our counter-terrorism data mining efforts is completely dependent on the failure of terrorist cells to maintain operational security.

The combination of the GIGO problem and the lack of suitable training data combine to make big investments in automated terrorist identification a futile and wasteful effort. Furthermore, these two problems are structural, so they’re not going away. All legitimate concerns about false positives and corrosive effects on civil liberties aside, data mining will never give authorities the ability to identify terrorists or terrorist networks with any degree of confidence.

Posted on October 10, 2008 at 6:35 AMView Comments

NSA Snooping on Cell Phone Calls

From CNet:

A recent article in the London Review of Books revealed that a number of private companies now sell off-the-shelf data-mining solutions to government spies interested in analyzing mobile-phone calling records and real-time location information. These companies include ThorpeGlen, VASTech, Kommlabs, and Aqsacom–all of which sell “passive probing” data-mining services to governments around the world.

ThorpeGlen, a U.K.-based firm, offers intelligence analysts a graphical interface to the company’s mobile-phone location and call-record data-mining software. Want to determine a suspect’s “community of interest“? Easy. Want to learn if a single person is swapping SIM cards or throwing away phones (yet still hanging out in the same physical location)? No problem.

In a Web demo (PDF) (mirrored here) to potential customers back in May, ThorpeGlen’s vice president of global sales showed off the company’s tools by mining a dataset of a single week’s worth of call data from 50 million users in Indonesia, which it has crunched in order to try and discover small anti-social groups that only call each other.

Posted on September 17, 2008 at 12:49 PMView Comments

Data Mining to Detect Pump-and-Dump Scams

I don’t know any of the details, but this seems like a good use of data mining:

Mr Tancredi said Verisign’s fraud detection kit would help “decrease the time between the attack being launched and the brokerage being able to respond”.

Before now, he said, brokerages relied on counter measures such as restrictive stock trading or analysis packages that only spotted a problem when money had gone.

Verisign’s software is a module that brokers can add to their in-house trading system that alerts anti-fraud teams to look more closely at trades that exhibit certain behaviour patterns.

“What this self-learning behavioural engine does is look at the different attributes of the event, not necessarily about the computer or where you are logging on from but about the actual transaction, the trade, the amount of the trade,” said Mr Tancredi.

“For example have you liquidated all of your assets in stock that you own in order to buy one penny stock?” he said. “Another example is when a customer who normally trades tech stock on Nasdaq all of a sudden trades a penny stock that has to do with health care and is placing a trade four times more than normal.”

This is a good use of data mining because, as I said previously:

Data mining works best when there’s a well-defined profile you’re searching for, a reasonable number of attacks per year, and a low cost of false alarms.

Another news article here.

Posted on August 14, 2008 at 6:10 AMView Comments

NSA's Domestic Spying

This article from The Wall Street Journal outlines how the NSA is increasingly engaging in domestic surveillance, data collection, and data mining. The result is essentially the same as Total Information Awareness.

According to current and former intelligence officials, the spy agency now monitors huge volumes of records of domestic emails and Internet searches as well as bank transfers, credit-card transactions, travel and telephone records. The NSA receives this so-called “transactional” data from other agencies or private companies, and its sophisticated software programs analyze the various transactions for suspicious patterns. Then they spit out leads to be explored by counterterrorism programs across the U.S. government, such as the NSA’s own Terrorist Surveillance Program, formed to intercept phone calls and emails between the U.S. and overseas without a judge’s approval when a link to al Qaeda is suspected.


Two former officials familiar with the data-sifting efforts said they work by starting with some sort of lead, like a phone number or Internet address. In partnership with the FBI, the systems then can track all domestic and foreign transactions of people associated with that item — and then the people who associated with them, and so on, casting a gradually wider net. An intelligence official described more of a rapid-response effect: If a person suspected of terrorist connections is believed to be in a U.S. city — for instance, Detroit, a community with a high concentration of Muslim Americans — the government’s spy systems may be directed to collect and analyze all electronic communications into and out of the city.

The haul can include records of phone calls, email headers and destinations, data on financial transactions and records of Internet browsing. The system also would collect information about other people, including those in the U.S., who communicated with people in Detroit.

The information doesn’t generally include the contents of conversations or emails. But it can give such transactional information as a cellphone’s location, whom a person is calling, and what Web sites he or she is visiting. For an email, the data haul can include the identities of the sender and recipient and the subject line, but not the content of the message.

Intelligence agencies have used administrative subpoenas issued by the FBI — which don’t need a judge’s signature — to collect and analyze such data, current and former intelligence officials said. If that data provided “reasonable suspicion” that a person, whether foreign or from the U.S., was linked to al Qaeda, intelligence officers could eavesdrop under the NSA’s Terrorist Surveillance Program.


The NSA uses its own high-powered version of social-network analysis to search for possible new patterns and links to terrorism. The Pentagon’s experimental Total Information Awareness program, later renamed Terrorism Information Awareness, was an early research effort on the same concept, designed to bring together and analyze as much and as many varied kinds of data as possible. Congress eliminated funding for the program in 2003 before it began operating. But it permitted some of the research to continue and TIA technology to be used for foreign surveillance.

Some of it was shifted to the NSA — which also is funded by the Pentagon — and put in the so-called black budget, where it would receive less scrutiny and bolster other data-sifting efforts, current and former intelligence officials said. “When it got taken apart, it didn’t get thrown away,” says a former top government official familiar with the TIA program.

Two current officials also said the NSA’s current combination of programs now largely mirrors the former TIA project. But the NSA offers less privacy protection. TIA developers researched ways to limit the use of the system for broad searches of individuals’ data, such as requiring intelligence officers to get leads from other sources first. The NSA effort lacks those controls, as well as controls that it developed in the 1990s for an earlier data-sweeping attempt.

Barry Steinhardt of the ACLU comments:

I mean, when we warn about a “surveillance society,” this is what we’re talking about. This is it, this is the ballgame. Mass data from a wide variety of sources — including the private sector — is being collected and scanned by a secretive military spy agency. This represents nothing less than a major change in American life — and unless stopped the consequences of this system for everybody will grow in magnitude along with the rivers of data that are collected about each of us — and that’s more and more every day.

More commentary.

Posted on March 26, 2008 at 6:02 AMView Comments

Searching for Terrorists in World of Warcraft

So, you’re sitting around the house with your buddies, playing World of Warcraft. One of you wonders: “How can we get paid for doing this?” Another says: “I know; let’s pretend we’re fighting terrorism, and then get a government grant.”

Having eliminated all terrorism in the real world, the U.S. intelligence community is working to develop software that will detect violent extremists infiltrating World of Warcraft and other massive multiplayer games, according to a data-mining report from the Director of National Intelligence.

Another article.

You just can’t make this stuff up.

EDITED TO ADD (3/13): Funny.

Posted on March 11, 2008 at 2:42 PMView Comments

Anonymity and the Netflix Dataset

Last year, Netflix published 10 million movie rankings by 500,000 customers, as part of a challenge for people to come up with better recommendation systems than the one the company was using. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders.

Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.

Their research (.pdf) illustrates some inherent security problems with anonymous data, but first it’s important to explain what they did and did not do.

They did not reverse the anonymity of the entire Netflix dataset. What they did was reverse the anonymity of the Netflix dataset for those sampled users who also entered some movie rankings, under their own names, in the IMDb. (While IMDb’s records are public, crawling the site to get them is against the IMDb’s terms of service, so the researchers used a representative few to prove their algorithm.)

The point of the research was to demonstrate how little information is required to de-anonymize information in the Netflix dataset.

On one hand, isn’t that sort of obvious? The risks of anonymous databases have been written about before, such as in this 2001 paper published in an IEEE journal. The researchers working with the anonymous Netflix data didn’t painstakingly figure out people’s identities — as others did with the AOL search database last year — they just compared it with an already identified subset of similar data: a standard data-mining technique.

But as opportunities for this kind of analysis pop up more frequently, lots of anonymous data could end up at risk.

Someone with access to an anonymous dataset of telephone records, for example, might partially de-anonymize it by correlating it with a catalog merchants’ telephone order database. Or Amazon’s online book reviews could be the key to partially de-anonymizing a public database of credit card purchases, or a larger database of anonymous book reviews.

Google, with its database of users’ internet searches, could easily de-anonymize a public database of internet purchases, or zero in on searches of medical terms to de-anonymize a public health database. Merchants who maintain detailed customer and purchase information could use their data to partially de-anonymize any large search engine’s data, if it were released in an anonymized form. A data broker holding databases of several companies might be able to de-anonymize most of the records in those databases.

What the University of Texas researchers demonstrate is that this process isn’t hard, and doesn’t require a lot of data. It turns out that if you eliminate the top 100 movies everyone watches, our movie-watching habits are all pretty individual. This would certainly hold true for our book reading habits, our internet shopping habits, our telephone habits and our web searching habits.

The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan’s and Shmatikov’s de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.

With only eight movie ratings (of which two may be completely wrong), and dates that may be up to two weeks in error, they can uniquely identify 99 percent of the records in the dataset. After that, all they need is a little bit of identifiable data: from the IMDb, from your blog, from anywhere. The moral is that it takes only a small named database for someone to pry the anonymity off a much larger anonymous database.

Other research reaches the same conclusion. Using public anonymous data from the 1990 census, Latanya Sweeney found that 87 percent of the population in the United States, 216 million of 248 million, could likely be uniquely identified by their five-digit ZIP code, combined with their gender and date of birth. About half of the U.S. population is likely identifiable by gender, date of birth and the city, town or municipality in which the person resides. Expanding the geographic scope to an entire county reduces that to a still-significant 18 percent. “In general,” the researchers wrote, “few characteristics are needed to uniquely identify a person.”

Stanford University researchers reported similar results using 2000 census data. It turns out that date of birth, which (unlike birthday month and day alone) sorts people into thousands of different buckets, is incredibly valuable in disambiguating people.

This has profound implications for releasing anonymous data. On one hand, anonymous data is an enormous boon for researchers — AOL did a good thing when it released its anonymous dataset for research purposes, and it’s sad that the CTO resigned and an entire research team was fired after the public outcry. Large anonymous databases of medical data are enormously valuable to society: for large-scale pharmacology studies, long-term follow-up studies and so on. Even anonymous telephone data makes for fascinating research.

On the other hand, in the age of wholesale surveillance, where everyone collects data on us all the time, anonymization is very fragile and riskier than it initially seems.

Like everything else in security, anonymity systems shouldn’t be fielded before being subjected to adversarial attacks. We all know that it’s folly to implement a cryptographic system before it’s rigorously attacked; why should we expect anonymity systems to be any different? And, like everything else in security, anonymity is a trade-off. There are benefits, and there are corresponding risks.

Narayanan and Shmatikov are currently working on developing algorithms and techniques that enable the secure release of anonymous datasets like Netflix’s. That’s a research result we can all benefit from.

This essay originally appeared on Wired.com.

Posted on December 18, 2007 at 5:53 AMView Comments

Programming for Wholesale Surveillance and Data Mining

AT&T has done the research:

They use high-tech data-mining algorithms to scan through the huge daily logs of every call made on the AT&T network; then they use sophisticated algorithms to analyze the connections between phone numbers: who is talking to whom? The paper literally uses the term “Guilt by Association” to describe what they’re looking for: what phone numbers are in contact with other numbers that are in contact with the bad guys?

When this research was done, back in the last century, the bad guys where people who wanted to rip off AT&T by making fraudulent credit-card calls. (Remember, back in the last century, intercontinental long-distance voice communication actually cost money!) But it’s easy to see how the FBI could use this to chase down anyone who talked to anyone who talked to a terrorist. Or even to a “terrorist.”

Posted on October 31, 2007 at 12:03 PMView Comments

Technical Details on the FBI's Wiretapping Network

There’s a must-read article on Wired.com about DCSNet (Digital Collection System Network), the FBI’s high-tech point-and-click domestic wiretapping network. The information is based on nearly 1,000 pages of documentation released under FOIA to the EFF.

Together, the surveillance systems let FBI agents play back recordings even as they are being captured (like TiVo), create master wiretap files, send digital recordings to translators, track the rough location of targets in real time using cell-tower information, and even stream intercepts outward to mobile surveillance vans.

FBI wiretapping rooms in field offices and undercover locations around the country are connected through a private, encrypted backbone that is separated from the internet. Sprint runs it on the government’s behalf.

The network allows an FBI agent in New York, for example, to remotely set up a wiretap on a cell phone based in Sacramento, California, and immediately learn the phone’s location, then begin receiving conversations, text messages and voicemail pass codes in New York. With a few keystrokes, the agent can route the recordings to language specialists for translation.

The numbers dialed are automatically sent to FBI analysts trained to interpret phone-call patterns, and are transferred nightly, by external storage devices, to the bureau’s Telephone Application Database, where they’re subjected to a type of data mining called link analysis.

FBI endpoints on DCSNet have swelled over the years, from 20 “central monitoring plants” at the program’s inception, to 57 in 2005, according to undated pages in the released documents. By 2002, those endpoints connected to more than 350 switches.

Today, most carriers maintain their own central hub, called a “mediation switch,” that’s networked to all the individual switches owned by that carrier, according to the FBI. The FBI’s DCS software links to those mediation switches over the internet, likely using an encrypted VPN. Some carriers run the mediation switch themselves, while others pay companies like VeriSign to handle the whole wiretapping process for them.

Much, much more in the article. (And much chatter on this Slashdot thread.)

EDITED TO ADD (8/31): Commentary by Matt Blaze and Steve Bellovin.

Posted on August 29, 2007 at 11:39 AMView Comments

Police Data Mining Done Right

It’s nice to find an example of the police using data mining correctly: not as security theater, but more as a business-intelligence tool:

When Munroe took over as chief two years ago, his department was drowning in crime and data. Police had a mass of data from 911 calls and crime reports; what they didn’t have was a way to connect the dots and see a pattern of behaviour.

Using some sophisticated software and hardware they started overlaying crime reports with other data, such as weather, traffic, sports events and paydays for large employers. The data was analyzed three times a day and something interesting emerged: Robberies spiked on paydays near cheque cashing storefronts in specific neighbourhoods. Other clusters also became apparent, and pretty soon police were deploying resources in advance and predicting where crime was most likely to occur.

Posted on August 10, 2007 at 6:51 AMView Comments

Risks of Data Reuse

We learned the news in March: Contrary to decades of denials, the U.S. Census Bureau used individual records to round up Japanese-Americans during World War II.

The Census Bureau normally is prohibited by law from revealing data that could be linked to specific individuals; the law exists to encourage people to answer census questions accurately and without fear. And while the Second War Powers Act of 1942 temporarily suspended that protection in order to locate Japanese-Americans, the Census Bureau had maintained that it only provided general information about neighborhoods.

New research proves they were lying.

The whole incident serves as a poignant illustration of one of the thorniest problems of the information age: data collected for one purpose and then used for another, or “data reuse.”

When we think about our personal data, what bothers us most is generally not the initial collection and use, but the secondary uses. I personally appreciate it when Amazon.com suggests books that might interest me, based on books I have already bought. I like it that my airline knows what type of seat and meal I prefer, and my hotel chain keeps records of my room preferences. I don’t mind that my automatic road-toll collection tag is tied to my credit card, and that I get billed automatically. I even like the detailed summary of my purchases that my credit card company sends me at the end of every year. What I don’t want, though, is any of these companies selling that data to brokers, or for law enforcement to be allowed to paw through those records without a warrant.

There are two bothersome issues about data reuse. First, we lose control of our data. In all of the examples above, there is an implied agreement between the data collector and me: It gets the data in order to provide me with some sort of service. Once the data collector sells it to a broker, though, it’s out of my hands. It might show up on some telemarketer’s screen, or in a detailed report to a potential employer, or as part of a data-mining system to evaluate my personal terrorism risk. It becomes part of my data shadow, which always follows me around but I can never see.

This, of course, affects our willingness to give up personal data in the first place. The reason U.S. census data was declared off-limits for other uses was to placate Americans’ fears and assure them that they could answer questions truthfully. How accurate would you be in filling out your census forms if you knew the FBI would be mining the data, looking for terrorists? How would it affect your supermarket purchases if you knew people were examining them and making judgments about your lifestyle? I know many people who engage in data poisoning: deliberately lying on forms in order to propagate erroneous data. I’m sure many of them would stop that practice if they could be sure that the data was only used for the purpose for which it was collected.

The second issue about data reuse is error rates. All data has errors, and different uses can tolerate different amounts of error. The sorts of marketing databases you can buy on the web, for example, are notoriously error-filled. That’s OK; if the database of ultra-affluent Americans of a particular ethnicity you just bought has a 10 percent error rate, you can factor that cost into your marketing campaign. But that same database, with that same error rate, might be useless for law enforcement purposes.

Understanding error rates and how they propagate is vital when evaluating any system that reuses data, especially for law enforcement purposes. A few years ago, the Transportation Security Administration’s follow-on watch list system, Secure Flight, was going to use commercial data to give people a terrorism risk score and determine how much they were going to be questioned or searched at the airport. People rightly rebelled against the thought of being judged in secret, but there was much less discussion about whether the commercial data from credit bureaus was accurate enough for this application.

An even more egregious example of error-rate problems occurred in 2000, when the Florida Division of Elections contracted with Database Technologies (since merged with ChoicePoint) to remove convicted felons from the voting rolls. The databases used were filled with errors and the matching procedures were sloppy, which resulted in thousands of disenfranchised voters — mostly black — and almost certainly changed a presidential election result.

Of course, there are beneficial uses of secondary data. Take, for example, personal medical data. It’s personal and intimate, yet valuable to society in aggregate. Think of what we could do with a database of everyone’s health information: massive studies examining the long-term effects of different drugs and treatment options, different environmental factors, different lifestyle choices. There’s an enormous amount of important research potential hidden in that data, and it’s worth figuring out how to get at it without compromising individual privacy.

This is largely a matter of legislation. Technology alone can never protect our rights. There are just too many reasons not to trust it, and too many ways to subvert it. Data privacy ultimately stems from our laws, and strong legal protections are fundamental to protecting our information against abuse. But at the same time, technology is still vital.

Both the Japanese internment and the Florida voting-roll purge demonstrate that laws can change … and sometimes change quickly. We need to build systems with privacy-enhancing technologies that limit data collection wherever possible. Data that is never collected cannot be reused. Data that is collected anonymously, or deleted immediately after it is used, is much harder to reuse. It’s easy to build systems that collect data on everything — it’s what computers naturally do — but it’s far better to take the time to understand what data is needed and why, and only collect that.

History will record what we, here in the early decades of the information age, did to foster freedom, liberty and democracy. Did we build information technologies that protected people’s freedoms even during times when society tried to subvert them? Or did we build technologies that could easily be modified to watch and control? It’s bad civic hygiene to build an infrastructure that can be used to facilitate a police state.

This article originally appeared on Wired.com

Posted on June 28, 2007 at 8:34 AMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.