Entries Tagged "data mining"

Page 5 of 7

AOL Releases Massive Amount of Search Data

From TechCrunch:

AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

This is search data for roughly 658,000 anonymized users over a three month period from March to May—about 1/3 of 1 per cent of their total data for that period.

Now AOL says it was all a mistake. They pulled the data, but it’s still still out there—and probably will be forever. And there’s some pretty scary stuff in it.

You can read more on Slashdot and elsewhere.

Anyone who wants to play NSA can start datamining for terrorists. Let us know if you find anything.

EDITED TO ADD (8/9): The New York Times:

And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,” she said, after a reporter read part of the list to her.

Posted on August 8, 2006 at 11:02 AMView Comments

Terrorists, Data Mining, and the Base Rate Fallacy

I have already explained why NSA-style wholesale surveillance data-mining systems are useless for finding terrorists. Here’s a more formal explanation:

Floyd Rudmin, a professor at a Norwegian university, applies the mathematics of conditional probability, known as Bayes’ Theorem, to demonstrate that the NSA’s surveillance cannot successfully detect terrorists unless both the percentage of terrorists in the population and the accuracy rate of their identification are far higher than they are. He correctly concludes that “NSA’s surveillance system is useless for finding terrorists.”

The surveillance is, however, useful for monitoring political opposition and stymieing the activities of those who do not believe the government’s propaganda.

And here’s the analysis:

What is the probability that people are terrorists given that NSA’s mass surveillance identifies them as terrorists? If the probability is zero (p=0.00), then they certainly are not terrorists, and NSA was wasting resources and damaging the lives of innocent citizens. If the probability is one (p=1.00), then they definitely are terrorists, and NSA has saved the day. If the probability is fifty-fifty (p=0.50), that is the same as guessing the flip of a coin. The conditional probability that people are terrorists given that the NSA surveillance system says they are, that had better be very near to one (p=1.00) and very far from zero (p=0.00).

The mathematics of conditional probability were figured out by the Scottish logician Thomas Bayes. If you Google “Bayes’ Theorem”, you will get more than a million hits. Bayes’ Theorem is taught in all elementary statistics classes. Everyone at NSA certainly knows Bayes’ Theorem.

To know if mass surveillance will work, Bayes’ theorem requires three estimations:

  1. The base-rate for terrorists, i.e. what proportion of the population are terrorists;
  2. The accuracy rate, i.e., the probability that real terrorists will be identified by NSA;
  3. The misidentification rate, i.e., the probability that innocent citizens will be misidentified by NSA as terrorists.

No matter how sophisticated and super-duper are NSA’s methods for identifying terrorists, no matter how big and fast are NSA’s computers, NSA’s accuracy rate will never be 100% and their misidentification rate will never be 0%. That fact, plus the extremely low base-rate for terrorists, means it is logically impossible for mass surveillance to be an effective way to find terrorists.

I will not put Bayes’ computational formula here. It is available in all elementary statistics books and is on the web should any readers be interested. But I will compute some conditional probabilities that people are terrorists given that NSA’s system of mass surveillance identifies them to be terrorists.

The US Census shows that there are about 300 million people living in the USA.

Suppose that there are 1,000 terrorists there as well, which is probably a high estimate. The base-rate would be 1 terrorist per 300,000 people. In percentages, that is .00033%, which is way less than 1%. Suppose that NSA surveillance has an accuracy rate of .40, which means that 40% of real terrorists in the USA will be identified by NSA’s monitoring of everyone’s email and phone calls. This is probably a high estimate, considering that terrorists are doing their best to avoid detection. There is no evidence thus far that NSA has been so successful at finding terrorists. And suppose NSA’s misidentification rate is .0001, which means that .01% of innocent people will be misidentified as terrorists, at least until they are investigated, detained and interrogated. Note that .01% of the US population is 30,000 people. With these suppositions, then the probability that people are terrorists given that NSA’s system of surveillance identifies them as terrorists is only p=0.0132, which is near zero, very far from one. Ergo, NSA’s surveillance system is useless for finding terrorists.

Suppose that NSA’s system is more accurate than .40, let’s say, .70, which means that 70% of terrorists in the USA will be found by mass monitoring of phone calls and email messages. Then, by Bayes’ Theorem, the probability that a person is a terrorist if targeted by NSA is still only p=0.0228, which is near zero, far from one, and useless.

Suppose that NSA’s system is really, really, really good, really, really good, with an accuracy rate of .90, and a misidentification rate of .00001, which means that only 3,000 innocent people are misidentified as terrorists. With these suppositions, then the probability that people are terrorists given that NSA’s system of surveillance identifies them as terrorists is only p=0.2308, which is far from one and well below flipping a coin. NSA’s domestic monitoring of everyone’s email and phone calls is useless for finding terrorists.

As an exercise to the reader, you can use the same analysis to show that data mining is an excellent tool for finding stolen credit cards, or stolen cell phones. Data mining is by no means useless; it’s just useless for this particular application.

Posted on July 10, 2006 at 7:15 AMView Comments

Privacy-Enhanced Data Mining

There are a variety of encryption technologies that allow you to analyze data without knowing details of the data:

Largely by employing the head-spinning principles of cryptography, the researchers say they can ensure that law enforcement, intelligence agencies and private companies can sift through huge databases without seeing names and identifying details in the records.

For example, manifests of airplane passengers could be compared with terrorist watch lists—without airline staff or government agents seeing the actual names on the other side’s list. Only if a match were made would a computer alert each side to uncloak the record and probe further.

“If it’s possible to anonymize data and produce … the same results as clear text, why not?” John Bliss, a privacy lawyer in IBM’s “entity analytics” unit, told a recent workshop on the subject at Harvard University.

This is nothing new. I’ve seen papers on this sort of stuff since the late 1980s. The problem is that no one in law enforcement has any incentive to use them. Privacy is rarely a technological problem; it’s far more often a social or economic problem.

Posted on June 20, 2006 at 6:26 AMView Comments

Data Mining Software from IBM

In the long term, corporate data mining efforts are more of a privacy risk than government data mining efforts. And here’s an off-the-shelf product from IBM:

IBM Entity Analytic Solutions (EAS) is unique identity disambiguation software that provides public sector organizations or commercial enterprises with the ability to recognize and mitigate the incidence of fraud, threat and risk. This IBM EAS offering provides insight on demand, and in context, on “who is who,” “who knows who,” and “anonymously.”

This industry-leading, patented technology enables enterprise-wide identity insight, full attribution and self-correction in real time, and scales to process hundreds of millions of entities—all while accumulating context about those identities. It is the only software in the market that provides in-context information regarding non-obvious and obvious relationships that may exist between identities and can do it anonymously to enhance privacy of information.

For most businesses and government agencies, it is important to figure out when a person is using more than one identity Package (that is, name, address, phone number, social insurance number and other such personal attributes) intentionally or unintentionally. Identity resolution software can help determine when two or more different looking identity packages are describing the same person, even if the data is inconsistent. For example, by comparing names, addresses, phone numbers, social insurance numbers and other personal information across different records, this software might reveal that three customers calling themselves Tom R., Thomas Rogers, and T. Rogers are really just the same person.

It may also be useful for organizations to know with whom such a person associates. Relationship resolution software can process resolved identity data to find out whether people have worked for some of the same companies, for example. This would be useful to an organization that tracks down terrorists, but it can also help businesses such as banks, for example, to see whether the Hope Smith who just applied for a loan is related to Rock Smith, the account holder with a sterling credit rating.

Posted on May 31, 2006 at 6:52 AMView Comments

The Problems with Data Mining

Great op-ed in The New York Times on why the NSA’s data mining efforts won’t work, by Jonathan Farley, math professor at Harvard.

The simplest reason is that we’re all connected. Not in the Haight-Ashbury/Timothy Leary/late-period Beatles kind of way, but in the sense of the Kevin Bacon game. The sociologist Stanley Milgram made this clear in the 1960’s when he took pairs of people unknown to each other, separated by a continent, and asked one of the pair to send a package to the other—but only by passing the package to a person he knew, who could then send the package only to someone he knew, and so on. On average, it took only six mailings—the famous six degrees of separation—for the package to reach its intended destination.

Looked at this way, President Bush is only a few steps away from Osama bin Laden (in the 1970’s he ran a company partly financed by the American representative for one of the Qaeda leader’s brothers). And terrorist hermits like the Unabomber are connected to only a very few people. So much for finding the guilty by association.

A second problem with the spy agency’s apparent methodology lies in the way terrorist groups operate and what scientists call the “strength of weak ties.” As the military scientist Robert Spulak has described it to me, you might not see your college roommate for 10 years, but if he were to call you up and ask to stay in your apartment, you’d let him. This is the principle under which sleeper cells operate: there is no communication for years. Thus for the most dangerous threats, the links between nodes that the agency is looking for simply might not exist.

(This, by him, is also worth reading.)

Posted on May 24, 2006 at 7:44 AMView Comments

Data Mining for Terrorists

In the post 9/11 world, there’s much focus on connecting the dots. Many believe that data mining is the crystal ball that will enable us to uncover future terrorist plots. But even in the most wildly optimistic projections, data mining isn’t tenable for that purpose. We’re not trading privacy for security; we’re giving up privacy and getting no security in return.

Most people first learned about data mining in November 2002, when news broke about a massive government data mining program called Total Information Awareness. The basic idea was as audacious as it was repellent: suck up as much data as possible about everyone, sift through it with massive computers, and investigate patterns that might indicate terrorist plots. Americans across the political spectrum denounced the program, and in September 2003, Congress eliminated its funding and closed its offices.

But TIA didn’t die. According to The National Journal, it just changed its name and moved inside the Defense Department.

This shouldn’t be a surprise. In May 2004, the General Accounting Office published a report that listed 122 different federal government data mining programs that used people’s personal information. This list didn’t include classified programs, like the NSA’s eavesdropping effort, or state-run programs like MATRIX.

The promise of data mining is compelling, and convinces many. But it’s wrong. We’re not going to find terrorist plots through systems like this, and we’re going to waste valuable resources chasing down false alarms. To understand why, we have to look at the economics of the system.

Security is always a trade-off, and for a system to be worthwhile, the advantages have to be greater than the disadvantages. A national security data mining program is going to find some percentage of real attacks, and some percentage of false alarms. If the benefits of finding and stopping those attacks outweigh the cost—in money, liberties, etc.—then the system is a good one. If not, then you’d be better off spending that cost elsewhere.

Data mining works best when there’s a well-defined profile you’re searching for, a reasonable number of attacks per year, and a low cost of false alarms. Credit card fraud is one of data mining’s success stories: all credit card companies data mine their transaction databases, looking for spending patterns that indicate a stolen card. Many credit card thieves share a pattern—purchase expensive luxury goods, purchase things that can be easily fenced, etc.—and data mining systems can minimize the losses in many cases by shutting down the card. In addition, the cost of false alarms is only a phone call to the cardholder asking him to verify a couple of purchases. The cardholders don’t even resent these phone calls—as long as they’re infrequent—so the cost is just a few minutes of operator time.

Terrorist plots are different. There is no well-defined profile, and attacks are very rare. Taken together, these facts mean that data mining systems won’t uncover any terrorist plots until they are very accurate, and that even very accurate systems will be so flooded with false alarms that they will be useless.

All data mining systems fail in two different ways: false positives and false negatives. A false positive is when the system identifies a terrorist plot that really isn’t one. A false negative is when the system misses an actual terrorist plot. Depending on how you “tune” your detection algorithms, you can err on one side or the other: you can increase the number of false positives to ensure that you are less likely to miss an actual terrorist plot, or you can reduce the number of false positives at the expense of missing terrorist plots.

To reduce both those numbers, you need a well-defined profile. And that’s a problem when it comes to terrorism. In hindsight, it was really easy to connect the 9/11 dots and point to the warning signs, but it’s much harder before the fact. Certainly, there are common warning signs that many terrorist plots share, but each is unique, as well. The better you can define what you’re looking for, the better your results will be. Data mining for terrorist plots is going to be sloppy, and it’s going to be hard to find anything useful.

Data mining is like searching for a needle in a haystack. There are 900 million credit cards in circulation in the United States. According to the FTC September 2003 Identity Theft Survey Report, about 1% (10 million) cards are stolen and fraudulently used each year. Terrorism is different. There are trillions of connections between people and events—things that the data mining system will have to “look at”—and very few plots. This rarity makes even accurate identification systems useless.

Let’s look at some numbers. We’ll be optimistic. We’ll assume the system has a 1 in 100 false positive rate (99% accurate), and a 1 in 1,000 false negative rate (99.9% accurate).

Assume one trillion possible indicators to sift through: that’s about ten events—e-mails, phone calls, purchases, web surfings, whatever—per person in the U.S. per day. Also assume that 10 of them are actually terrorists plotting.

This unrealistically-accurate system will generate one billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999% and you’re still chasing 2,750 false alarms per day—but that will inevitably raise your false negatives, and you’re going to miss some of those ten real plots.

This isn’t anything new. In statistics, it’s called the “base rate fallacy,” and it applies in other domains as well. For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population. Terrorist attacks are also rare, any “test” is going to result in an endless stream of false alarms.

This is exactly the sort of thing we saw with the NSA’s eavesdropping program: the New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm.

And the cost was enormous: not just the cost of the FBI agents running around chasing dead-end leads instead of doing things that might actually make us safer, but also the cost in civil liberties. The fundamental freedoms that make our country the envy of the world are valuable, and not something that we should throw away lightly.

Data mining can work. It helps Visa keep the costs of fraud down, just as it helps Amazon.com show me books that I might want to buy, and Google show me advertising I’m more likely to be interested in. But these are all instances where the cost of false positives is low—a phone call from a Visa operator, or an uninteresting ad—and in systems that have value even if there is a high number of false negatives.

Finding terrorism plots is not a problem that lends itself to data mining. It’s a needle-in-a-haystack problem, and throwing more hay on the pile doesn’t make that problem any easier. We’d be far better off putting people in charge of investigating potential plots and letting them direct the computers, instead of putting the computers in charge and letting them decide who should be investigated.

This essay originally appeared on Wired.com.

Posted on March 9, 2006 at 7:44 AMView Comments

The Terrorist Threat of Paying Your Credit Card Balance

This article shows how badly terrorist profiling can go wrong:

They paid down some debt. The balance on their JCPenney Platinum MasterCard had gotten to an unhealthy level. So they sent in a large payment, a check for $6,522.

And an alarm went off. A red flag went up. The Soehnges’ behavior was found questionable.

And all they did was pay down their debt. They didn’t call a suspected terrorist on their cell phone. They didn’t try to sneak a machine gun through customs.

They just paid a hefty chunk of their credit card balance. And they learned how frighteningly wide the net of suspicion has been cast.

After sending in the check, they checked online to see if their account had been duly credited. They learned that the check had arrived, but the amount available for credit on their account hadn’t changed.

So Deana Soehnge called the credit-card company. Then Walter called.

“When you mess with my money, I want to know why,” he said.

They both learned the same astounding piece of information about the little things that can set the threat sensors to beeping and blinking.

They were told, as they moved up the managerial ladder at the call center, that the amount they had sent in was much larger than their normal monthly payment. And if the increase hits a certain percentage higher than that normal payment, Homeland Security has to be notified. And the money doesn’t move until the threat alert is lifted.

The article goes on to blame something called the Bank Privacy Act, but that’s not correct. The culprit here is the amendments made to the Bank Secrecy Act by the USA Patriot Act, Sections 351 and 352. There’s a general discussion here, and the Federal Register here.

There has been some rumbling on the net that this story is badly garbled—or even a hoax—but certainly this kind of thing is what financial institutions are required to report under the Patriot Act.

Remember, all the time spent chasing down silly false alarms is time wasted. Finding terrorist plots is a signal-to-noise problem, and stuff like this substantially decreases that ratio: it adds a lot of noise without adding enough signal. It makes us less safe, because it makes terrorist plots harder to find.

Posted on March 6, 2006 at 10:45 AMView Comments

Secure Flight Suspended

The TSA has announced that Secure Flight, its comprehensive program to match airline passangers against terrorist watch lists, has been suspended:

And because of security concerns, the government is going back to the drawing board with the program called Secure Flight after spending nearly four years and $150 million on it, the Senate Commerce Committee was told.

I have written about this program extensively, most recently here. It’s an absolute mess in every way, and doesn’t make us safer.

But don’t think this is the end. Under Section 4012 of the Intelligence Reform and Terrorism Prevention Act, Congress mandated the TSA put in place a program to screen every domestic passenger against the watch list. Until Congress repeals that mandate, these postponements and suspensions are the best we can hope for. Expect it all to come back under a different name—and a clean record in the eyes of those not paying close attention—soon.

EDITED TO ADD (2/15): Ed Felton has some good commentary:

Instead of sticking to this more modest plan, Secure Flight became a vehicle for pie-in-the-sky plans about data mining and automatic identification of terrorists from consumer databases. As the program’s goals grew more ambitious and collided with practical design and deployment challenges, the program lost focus and seemed to have a different rationale and plan from one month to the next.

Posted on February 13, 2006 at 6:09 AMView Comments

Data Mining and Amazon Wishlists

Data Mining 101: Finding Subversives with Amazon Wishlists.

Now, imagine the false alarms and abuses that are possible if you have lots more data, and lots more computers to slice and dice it.

Of course, there are applications where this sort of data mining makes a whole lot of sense. But finding terrorists isn’t one of them. It’s a needle-in-a-haystack problem, and piling on more hay doesn’t help matters much.

Posted on January 5, 2006 at 6:15 AMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.