## Entries Tagged "data mining"

Page 5 of 7

### Data Mining and Terrorism

Nice article from CIO Magazine about data mining and terrorism.

### Terrorists, Data Mining, and the Base Rate Fallacy

I have already explained why NSA-style wholesale surveillance data-mining systems are useless for finding terrorists. Here’s a more formal explanation:

Floyd Rudmin, a professor at a Norwegian university, applies the mathematics of conditional probability, known as Bayes’ Theorem, to demonstrate that the NSA’s surveillance cannot successfully detect terrorists unless both the percentage of terrorists in the population and the accuracy rate of their identification are far higher than they are. He correctly concludes that “NSA’s surveillance system is useless for finding terrorists.”

The surveillance is, however, useful for monitoring political opposition and stymieing the activities of those who do not believe the government’s propaganda.

What is the probability that people are terrorists given that NSA’s mass surveillance identifies them as terrorists? If the probability is zero (p=0.00), then they certainly are not terrorists, and NSA was wasting resources and damaging the lives of innocent citizens. If the probability is one (p=1.00), then they definitely are terrorists, and NSA has saved the day. If the probability is fifty-fifty (p=0.50), that is the same as guessing the flip of a coin. The conditional probability that people are terrorists given that the NSA surveillance system says they are, that had better be very near to one (p=1.00) and very far from zero (p=0.00).

The mathematics of conditional probability were figured out by the Scottish logician Thomas Bayes. If you Google “Bayes’ Theorem”, you will get more than a million hits. Bayes’ Theorem is taught in all elementary statistics classes. Everyone at NSA certainly knows Bayes’ Theorem.

To know if mass surveillance will work, Bayes’ theorem requires three estimations:

1. The base-rate for terrorists, i.e. what proportion of the population are terrorists;
2. The accuracy rate, i.e., the probability that real terrorists will be identified by NSA;
3. The misidentification rate, i.e., the probability that innocent citizens will be misidentified by NSA as terrorists.

No matter how sophisticated and super-duper are NSA’s methods for identifying terrorists, no matter how big and fast are NSA’s computers, NSA’s accuracy rate will never be 100% and their misidentification rate will never be 0%. That fact, plus the extremely low base-rate for terrorists, means it is logically impossible for mass surveillance to be an effective way to find terrorists.

I will not put Bayes’ computational formula here. It is available in all elementary statistics books and is on the web should any readers be interested. But I will compute some conditional probabilities that people are terrorists given that NSA’s system of mass surveillance identifies them to be terrorists.

The US Census shows that there are about 300 million people living in the USA.

Suppose that there are 1,000 terrorists there as well, which is probably a high estimate. The base-rate would be 1 terrorist per 300,000 people. In percentages, that is .00033%, which is way less than 1%. Suppose that NSA surveillance has an accuracy rate of .40, which means that 40% of real terrorists in the USA will be identified by NSA’s monitoring of everyone’s email and phone calls. This is probably a high estimate, considering that terrorists are doing their best to avoid detection. There is no evidence thus far that NSA has been so successful at finding terrorists. And suppose NSA’s misidentification rate is .0001, which means that .01% of innocent people will be misidentified as terrorists, at least until they are investigated, detained and interrogated. Note that .01% of the US population is 30,000 people. With these suppositions, then the probability that people are terrorists given that NSA’s system of surveillance identifies them as terrorists is only p=0.0132, which is near zero, very far from one. Ergo, NSA’s surveillance system is useless for finding terrorists.

Suppose that NSA’s system is more accurate than .40, let’s say, .70, which means that 70% of terrorists in the USA will be found by mass monitoring of phone calls and email messages. Then, by Bayes’ Theorem, the probability that a person is a terrorist if targeted by NSA is still only p=0.0228, which is near zero, far from one, and useless.

Suppose that NSA’s system is really, really, really good, really, really good, with an accuracy rate of .90, and a misidentification rate of .00001, which means that only 3,000 innocent people are misidentified as terrorists. With these suppositions, then the probability that people are terrorists given that NSA’s system of surveillance identifies them as terrorists is only p=0.2308, which is far from one and well below flipping a coin. NSA’s domestic monitoring of everyone’s email and phone calls is useless for finding terrorists.

As an exercise to the reader, you can use the same analysis to show that data mining is an excellent tool for finding stolen credit cards, or stolen cell phones. Data mining is by no means useless; it’s just useless for this particular application.

### Privacy-Enhanced Data Mining

There are a variety of encryption technologies that allow you to analyze data without knowing details of the data:

Largely by employing the head-spinning principles of cryptography, the researchers say they can ensure that law enforcement, intelligence agencies and private companies can sift through huge databases without seeing names and identifying details in the records.

For example, manifests of airplane passengers could be compared with terrorist watch lists — without airline staff or government agents seeing the actual names on the other side’s list. Only if a match were made would a computer alert each side to uncloak the record and probe further.

“If it’s possible to anonymize data and produce … the same results as clear text, why not?” John Bliss, a privacy lawyer in IBM’s “entity analytics” unit, told a recent workshop on the subject at Harvard University.

This is nothing new. I’ve seen papers on this sort of stuff since the late 1980s. The problem is that no one in law enforcement has any incentive to use them. Privacy is rarely a technological problem; it’s far more often a social or economic problem.

### Data Mining Software from IBM

In the long term, corporate data mining efforts are more of a privacy risk than government data mining efforts. And here’s an off-the-shelf product from IBM:

IBM Entity Analytic Solutions (EAS) is unique identity disambiguation software that provides public sector organizations or commercial enterprises with the ability to recognize and mitigate the incidence of fraud, threat and risk. This IBM EAS offering provides insight on demand, and in context, on “who is who,” “who knows who,” and “anonymously.”

This industry-leading, patented technology enables enterprise-wide identity insight, full attribution and self-correction in real time, and scales to process hundreds of millions of entities — all while accumulating context about those identities. It is the only software in the market that provides in-context information regarding non-obvious and obvious relationships that may exist between identities and can do it anonymously to enhance privacy of information.

For most businesses and government agencies, it is important to figure out when a person is using more than one identity Package (that is, name, address, phone number, social insurance number and other such personal attributes) intentionally or unintentionally. Identity resolution software can help determine when two or more different looking identity packages are describing the same person, even if the data is inconsistent. For example, by comparing names, addresses, phone numbers, social insurance numbers and other personal information across different records, this software might reveal that three customers calling themselves Tom R., Thomas Rogers, and T. Rogers are really just the same person.

It may also be useful for organizations to know with whom such a person associates. Relationship resolution software can process resolved identity data to find out whether people have worked for some of the same companies, for example. This would be useful to an organization that tracks down terrorists, but it can also help businesses such as banks, for example, to see whether the Hope Smith who just applied for a loan is related to Rock Smith, the account holder with a sterling credit rating.

### The Problems with Data Mining

Great op-ed in The New York Times on why the NSA’s data mining efforts won’t work, by Jonathan Farley, math professor at Harvard.

The simplest reason is that we’re all connected. Not in the Haight-Ashbury/Timothy Leary/late-period Beatles kind of way, but in the sense of the Kevin Bacon game. The sociologist Stanley Milgram made this clear in the 1960’s when he took pairs of people unknown to each other, separated by a continent, and asked one of the pair to send a package to the other — but only by passing the package to a person he knew, who could then send the package only to someone he knew, and so on. On average, it took only six mailings — the famous six degrees of separation — for the package to reach its intended destination.

Looked at this way, President Bush is only a few steps away from Osama bin Laden (in the 1970’s he ran a company partly financed by the American representative for one of the Qaeda leader’s brothers). And terrorist hermits like the Unabomber are connected to only a very few people. So much for finding the guilty by association.

A second problem with the spy agency’s apparent methodology lies in the way terrorist groups operate and what scientists call the “strength of weak ties.” As the military scientist Robert Spulak has described it to me, you might not see your college roommate for 10 years, but if he were to call you up and ask to stay in your apartment, you’d let him. This is the principle under which sleeper cells operate: there is no communication for years. Thus for the most dangerous threats, the links between nodes that the agency is looking for simply might not exist.

(This, by him, is also worth reading.)

### Data Mining for Terrorists

In the post 9/11 world, there’s much focus on connecting the dots. Many believe that data mining is the crystal ball that will enable us to uncover future terrorist plots. But even in the most wildly optimistic projections, data mining isn’t tenable for that purpose. We’re not trading privacy for security; we’re giving up privacy and getting no security in return.

Most people first learned about data mining in November 2002, when news broke about a massive government data mining program called Total Information Awareness. The basic idea was as audacious as it was repellent: suck up as much data as possible about everyone, sift through it with massive computers, and investigate patterns that might indicate terrorist plots. Americans across the political spectrum denounced the program, and in September 2003, Congress eliminated its funding and closed its offices.

But TIA didn’t die. According to The National Journal, it just changed its name and moved inside the Defense Department.

This shouldn’t be a surprise. In May 2004, the General Accounting Office published a report that listed 122 different federal government data mining programs that used people’s personal information. This list didn’t include classified programs, like the NSA’s eavesdropping effort, or state-run programs like MATRIX.

The promise of data mining is compelling, and convinces many. But it’s wrong. We’re not going to find terrorist plots through systems like this, and we’re going to waste valuable resources chasing down false alarms. To understand why, we have to look at the economics of the system.

Security is always a trade-off, and for a system to be worthwhile, the advantages have to be greater than the disadvantages. A national security data mining program is going to find some percentage of real attacks, and some percentage of false alarms. If the benefits of finding and stopping those attacks outweigh the cost — in money, liberties, etc. — then the system is a good one. If not, then you’d be better off spending that cost elsewhere.

Data mining works best when there’s a well-defined profile you’re searching for, a reasonable number of attacks per year, and a low cost of false alarms. Credit card fraud is one of data mining’s success stories: all credit card companies data mine their transaction databases, looking for spending patterns that indicate a stolen card. Many credit card thieves share a pattern — purchase expensive luxury goods, purchase things that can be easily fenced, etc. — and data mining systems can minimize the losses in many cases by shutting down the card. In addition, the cost of false alarms is only a phone call to the cardholder asking him to verify a couple of purchases. The cardholders don’t even resent these phone calls — as long as they’re infrequent — so the cost is just a few minutes of operator time.

Terrorist plots are different. There is no well-defined profile, and attacks are very rare. Taken together, these facts mean that data mining systems won’t uncover any terrorist plots until they are very accurate, and that even very accurate systems will be so flooded with false alarms that they will be useless.

All data mining systems fail in two different ways: false positives and false negatives. A false positive is when the system identifies a terrorist plot that really isn’t one. A false negative is when the system misses an actual terrorist plot. Depending on how you “tune” your detection algorithms, you can err on one side or the other: you can increase the number of false positives to ensure that you are less likely to miss an actual terrorist plot, or you can reduce the number of false positives at the expense of missing terrorist plots.

To reduce both those numbers, you need a well-defined profile. And that’s a problem when it comes to terrorism. In hindsight, it was really easy to connect the 9/11 dots and point to the warning signs, but it’s much harder before the fact. Certainly, there are common warning signs that many terrorist plots share, but each is unique, as well. The better you can define what you’re looking for, the better your results will be. Data mining for terrorist plots is going to be sloppy, and it’s going to be hard to find anything useful.

Data mining is like searching for a needle in a haystack. There are 900 million credit cards in circulation in the United States. According to the FTC September 2003 Identity Theft Survey Report, about 1% (10 million) cards are stolen and fraudulently used each year. Terrorism is different. There are trillions of connections between people and events — things that the data mining system will have to “look at” — and very few plots. This rarity makes even accurate identification systems useless.

Let’s look at some numbers. We’ll be optimistic. We’ll assume the system has a 1 in 100 false positive rate (99% accurate), and a 1 in 1,000 false negative rate (99.9% accurate).

Assume one trillion possible indicators to sift through: that’s about ten events — e-mails, phone calls, purchases, web surfings, whatever — per person in the U.S. per day. Also assume that 10 of them are actually terrorists plotting.

This unrealistically-accurate system will generate one billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999% and you’re still chasing 2,750 false alarms per day — but that will inevitably raise your false negatives, and you’re going to miss some of those ten real plots.

This isn’t anything new. In statistics, it’s called the “base rate fallacy,” and it applies in other domains as well. For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population. Terrorist attacks are also rare, any “test” is going to result in an endless stream of false alarms.

This is exactly the sort of thing we saw with the NSA’s eavesdropping program: the New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm.

And the cost was enormous: not just the cost of the FBI agents running around chasing dead-end leads instead of doing things that might actually make us safer, but also the cost in civil liberties. The fundamental freedoms that make our country the envy of the world are valuable, and not something that we should throw away lightly.

Data mining can work. It helps Visa keep the costs of fraud down, just as it helps Amazon.com show me books that I might want to buy, and Google show me advertising I’m more likely to be interested in. But these are all instances where the cost of false positives is low — a phone call from a Visa operator, or an uninteresting ad — and in systems that have value even if there is a high number of false negatives.

Finding terrorism plots is not a problem that lends itself to data mining. It’s a needle-in-a-haystack problem, and throwing more hay on the pile doesn’t make that problem any easier. We’d be far better off putting people in charge of investigating potential plots and letting them direct the computers, instead of putting the computers in charge and letting them decide who should be investigated.

This essay originally appeared on Wired.com.

### The Terrorist Threat of Paying Your Credit Card Balance

They paid down some debt. The balance on their JCPenney Platinum MasterCard had gotten to an unhealthy level. So they sent in a large payment, a check for \$6,522.

And an alarm went off. A red flag went up. The Soehnges’ behavior was found questionable.

And all they did was pay down their debt. They didn’t call a suspected terrorist on their cell phone. They didn’t try to sneak a machine gun through customs.

They just paid a hefty chunk of their credit card balance. And they learned how frighteningly wide the net of suspicion has been cast.

After sending in the check, they checked online to see if their account had been duly credited. They learned that the check had arrived, but the amount available for credit on their account hadn’t changed.

So Deana Soehnge called the credit-card company. Then Walter called.

“When you mess with my money, I want to know why,” he said.

They both learned the same astounding piece of information about the little things that can set the threat sensors to beeping and blinking.

They were told, as they moved up the managerial ladder at the call center, that the amount they had sent in was much larger than their normal monthly payment. And if the increase hits a certain percentage higher than that normal payment, Homeland Security has to be notified. And the money doesn’t move until the threat alert is lifted.

The article goes on to blame something called the Bank Privacy Act, but that’s not correct. The culprit here is the amendments made to the Bank Secrecy Act by the USA Patriot Act, Sections 351 and 352. There’s a general discussion here, and the Federal Register here.

There has been some rumbling on the net that this story is badly garbled — or even a hoax — but certainly this kind of thing is what financial institutions are required to report under the Patriot Act.

Remember, all the time spent chasing down silly false alarms is time wasted. Finding terrorist plots is a signal-to-noise problem, and stuff like this substantially decreases that ratio: it adds a lot of noise without adding enough signal. It makes us less safe, because it makes terrorist plots harder to find.

### Secure Flight Suspended

The TSA has announced that Secure Flight, its comprehensive program to match airline passangers against terrorist watch lists, has been suspended:

And because of security concerns, the government is going back to the drawing board with the program called Secure Flight after spending nearly four years and \$150 million on it, the Senate Commerce Committee was told.

I have written about this program extensively, most recently here. It’s an absolute mess in every way, and doesn’t make us safer.

But don’t think this is the end. Under Section 4012 of the Intelligence Reform and Terrorism Prevention Act, Congress mandated the TSA put in place a program to screen every domestic passenger against the watch list. Until Congress repeals that mandate, these postponements and suspensions are the best we can hope for. Expect it all to come back under a different name — and a clean record in the eyes of those not paying close attention — soon.

EDITED TO ADD (2/15): Ed Felton has some good commentary:

Instead of sticking to this more modest plan, Secure Flight became a vehicle for pie-in-the-sky plans about data mining and automatic identification of terrorists from consumer databases. As the program’s goals grew more ambitious and collided with practical design and deployment challenges, the program lost focus and seemed to have a different rationale and plan from one month to the next.

### Data Mining and Amazon Wishlists

Now, imagine the false alarms and abuses that are possible if you have lots more data, and lots more computers to slice and dice it.

Of course, there are applications where this sort of data mining makes a whole lot of sense. But finding terrorists isn’t one of them. It’s a needle-in-a-haystack problem, and piling on more hay doesn’t help matters much.

### NSA and Bush's Illegal Eavesdropping

When President Bush directed the National Security Agency to secretly eavesdrop on American citizens, he transferred an authority previously under the purview of the Justice Department to the Defense Department and bypassed the very laws put in place to protect Americans against widespread government eavesdropping. The reason may have been to tap the NSA’s capability for data-mining and widespread surveillance.

Illegal wiretapping of Americans is nothing new. In the 1950s and ’60s, in a program called “Project Shamrock,” the NSA intercepted every single telegram coming into or going out of the United States. It conducted eavesdropping without a warrant on behalf of the CIA and other agencies. Much of this became public during the 1975 Church Committee hearings and resulted in the now famous Foreign Intelligence Surveillance Act (FISA) of 1978.

The purpose of this law was to protect the American people by regulating government eavesdropping. Like many laws limiting the power of government, it relies on checks and balances: one branch of the government watching the other. The law established a secret court, the Foreign Intelligence Surveillance Court (FISC), and empowered it to approve national-security-related eavesdropping warrants. The Justice Department can request FISA warrants to monitor foreign communications as well as communications by American citizens, provided that they meet certain minimal criteria.

The FISC issued about 500 FISA warrants per year from 1979 through 1995, and has slowly increased subsequently — 1,758 were issued in 2004. The process is designed for speed and even has provisions where the Justice Department can wiretap first and ask for permission later. In all that time, only four warrant requests were ever rejected: all in 2003. (We don’t know any details, of course, as the court proceedings are secret.)

FISA warrants are carried out by the FBI, but in the days immediately after the terrorist attacks, there was a widespread perception in Washington that the FBI wasn’t up to dealing with these new threats — they couldn’t uncover plots in a timely manner. So instead the Bush administration turned to the NSA. They had the tools, the expertise, the experience, and so they were given the mission.

The NSA’s ability to eavesdrop on communications is exemplified by a technological capability called Echelon. Echelon is the world’s largest information “vacuum cleaner,” sucking up a staggering amount of voice, fax, and data communications — satellite, microwave, fiber-optic, cellular and everything else — from all over the world: an estimated 3 billion communications per day. These communications are then processed through sophisticated data-mining technologies, which look for simple phrases like “assassinate the president” as well as more complicated communications patterns.

Supposedly Echelon only covers communications outside of the United States. Although there is no evidence that the Bush administration has employed Echelon to monitor communications to and from the U.S., this surveillance capability is probably exactly what the president wanted and may explain why the administration sought to bypass the FISA process of acquiring a warrant for searches.

Perhaps the NSA just didn’t have any experience submitting FISA warrants, so Bush unilaterally waived that requirement. And perhaps Bush thought FISA was a hindrance — in 2002 there was a widespread but false believe that the FISC got in the way of the investigation of Zacarias Moussaoui (the presumed “20th hijacker”) — and bypassed the court for that reason.

Most likely, Bush wanted a whole new surveillance paradigm. You can think of the FBI’s capabilities as “retail surveillance”: It eavesdrops on a particular person or phone. The NSA, on the other hand, conducts “wholesale surveillance.” It, or more exactly its computers, listens to everything. An example might be to feed the computers every voice, fax, and e-mail communication looking for the name “Ayman al-Zawahiri.” This type of surveillance is more along the lines of Project Shamrock, and not legal under FISA. As Sen. Jay Rockefeller wrote in a secret memo after being briefed on the program, it raises “profound oversight issues.”

It is also unclear whether Echelon-style eavesdropping would prevent terrorist attacks. In the months before 9/11, Echelon noticed considerable “chatter”: bits of conversation suggesting some sort of imminent attack. But because much of the planning for 9/11 occurred face-to-face, analysts were unable to learn details.

The fundamental issue here is security, but it’s not the security most people think of. James Madison famously said: “If men were angels, no government would be necessary. If angels were to govern men, neither external nor internal controls on government would be necessary.” Terrorism is a serious risk to our nation, but an even greater threat is the centralization of American political power in the hands of any single branch of the government.

Over 200 years ago, the framers of the U.S. Constitution established an ingenious security device against tyrannical government: they divided government power among three different bodies. A carefully thought out system of checks and balances in the executive branch, the legislative branch, and the judicial branch, ensured that no single branch became too powerful.

After watching tyrannies rise and fall throughout Europe, this seemed like a prudent way to form a government. Courts monitor the actions of police. Congress passes laws that even the president must follow. Since 9/11, the United States has seen an enormous power grab by the executive branch. It’s time we brought back the security system that’s protected us from government for over 200 years.

A version of this essay originally appeared in Salon.

I wrote another essay about the legal and constitutional implications of this. The Minneapolis Star Tribune will publish it either Wednesday or Thursday, and I will post it here at that time.

I didn’t talk about the political dynamics in either essay, but they’re fascinating. The White House kept this secret, but they briefed at least six people outside the administration. The current and former chief justices of the FISC knew about this. Last Sunday’s Washington Post reported that both of them had misgivings about the program, but neither did anything about it. The White House also briefed the Committee Chairs and Ranking Members of the House and Senate Intelligence Committees, and they didn’t do anything about it. (Although Sen. Rockefeller wrote a bizarre I’m-not-going-down-with-you memo to Cheney and for his files.)

Cheney was on television this weekend citing this minimal disclosure as evidence that Congress acquiesced to the program. I see it as evidence of something else: if people from both the Legislative and the Judiciary branches knowingly permitted unlawful surveillance by the Executive branch, then the current system of checks and balances isn’t working.

It’s also evidence about how secretive this administration is. None of the other FISC judges, and none of the other House or Senate Intelligence Committee members, were told about this,­ even under clearance. And if there’s one thing these people hate, it’s being kept in the dark on a matter within their jurisdiction. That’s why Senator Feinstein, a member of the Senate Intelligence Committee, was so upset yesterday. And it’s pushing Senator Specter, and some of the Republicans in these Judiciary committees, further into the civil liberties camp.

There are about a zillion links worth reading, but here are some of them you might not yet have seen. Some good newspaper commentaries. An excellent legal analysis. Three blog posts. Four more blog posts. Daniel Solove on FISA. Two legal analyses. An interesting “Democracy Now” commentary, including interesting comments on the NSA’s capabilities by James Bamford. And finally, my 2004 essay on the security of checks and balances.

“Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.” — William Pitt, House of Commons, 11/18/1783.

Sidebar photo of Bruce Schneier by Joe MacInnis.