Entries Tagged "data mining"

Page 4 of 7

Risks of Data Reuse

We learned the news in March: Contrary to decades of denials, the U.S. Census Bureau used individual records to round up Japanese-Americans during World War II.

The Census Bureau normally is prohibited by law from revealing data that could be linked to specific individuals; the law exists to encourage people to answer census questions accurately and without fear. And while the Second War Powers Act of 1942 temporarily suspended that protection in order to locate Japanese-Americans, the Census Bureau had maintained that it only provided general information about neighborhoods.

New research proves they were lying.

The whole incident serves as a poignant illustration of one of the thorniest problems of the information age: data collected for one purpose and then used for another, or “data reuse.”

When we think about our personal data, what bothers us most is generally not the initial collection and use, but the secondary uses. I personally appreciate it when Amazon.com suggests books that might interest me, based on books I have already bought. I like it that my airline knows what type of seat and meal I prefer, and my hotel chain keeps records of my room preferences. I don’t mind that my automatic road-toll collection tag is tied to my credit card, and that I get billed automatically. I even like the detailed summary of my purchases that my credit card company sends me at the end of every year. What I don’t want, though, is any of these companies selling that data to brokers, or for law enforcement to be allowed to paw through those records without a warrant.

There are two bothersome issues about data reuse. First, we lose control of our data. In all of the examples above, there is an implied agreement between the data collector and me: It gets the data in order to provide me with some sort of service. Once the data collector sells it to a broker, though, it’s out of my hands. It might show up on some telemarketer’s screen, or in a detailed report to a potential employer, or as part of a data-mining system to evaluate my personal terrorism risk. It becomes part of my data shadow, which always follows me around but I can never see.

This, of course, affects our willingness to give up personal data in the first place. The reason U.S. census data was declared off-limits for other uses was to placate Americans’ fears and assure them that they could answer questions truthfully. How accurate would you be in filling out your census forms if you knew the FBI would be mining the data, looking for terrorists? How would it affect your supermarket purchases if you knew people were examining them and making judgments about your lifestyle? I know many people who engage in data poisoning: deliberately lying on forms in order to propagate erroneous data. I’m sure many of them would stop that practice if they could be sure that the data was only used for the purpose for which it was collected.

The second issue about data reuse is error rates. All data has errors, and different uses can tolerate different amounts of error. The sorts of marketing databases you can buy on the web, for example, are notoriously error-filled. That’s OK; if the database of ultra-affluent Americans of a particular ethnicity you just bought has a 10 percent error rate, you can factor that cost into your marketing campaign. But that same database, with that same error rate, might be useless for law enforcement purposes.

Understanding error rates and how they propagate is vital when evaluating any system that reuses data, especially for law enforcement purposes. A few years ago, the Transportation Security Administration’s follow-on watch list system, Secure Flight, was going to use commercial data to give people a terrorism risk score and determine how much they were going to be questioned or searched at the airport. People rightly rebelled against the thought of being judged in secret, but there was much less discussion about whether the commercial data from credit bureaus was accurate enough for this application.

An even more egregious example of error-rate problems occurred in 2000, when the Florida Division of Elections contracted with Database Technologies (since merged with ChoicePoint) to remove convicted felons from the voting rolls. The databases used were filled with errors and the matching procedures were sloppy, which resulted in thousands of disenfranchised voters—mostly black—and almost certainly changed a presidential election result.

Of course, there are beneficial uses of secondary data. Take, for example, personal medical data. It’s personal and intimate, yet valuable to society in aggregate. Think of what we could do with a database of everyone’s health information: massive studies examining the long-term effects of different drugs and treatment options, different environmental factors, different lifestyle choices. There’s an enormous amount of important research potential hidden in that data, and it’s worth figuring out how to get at it without compromising individual privacy.

This is largely a matter of legislation. Technology alone can never protect our rights. There are just too many reasons not to trust it, and too many ways to subvert it. Data privacy ultimately stems from our laws, and strong legal protections are fundamental to protecting our information against abuse. But at the same time, technology is still vital.

Both the Japanese internment and the Florida voting-roll purge demonstrate that laws can change … and sometimes change quickly. We need to build systems with privacy-enhancing technologies that limit data collection wherever possible. Data that is never collected cannot be reused. Data that is collected anonymously, or deleted immediately after it is used, is much harder to reuse. It’s easy to build systems that collect data on everything—it’s what computers naturally do—but it’s far better to take the time to understand what data is needed and why, and only collect that.

History will record what we, here in the early decades of the information age, did to foster freedom, liberty and democracy. Did we build information technologies that protected people’s freedoms even during times when society tried to subvert them? Or did we build technologies that could easily be modified to watch and control? It’s bad civic hygiene to build an infrastructure that can be used to facilitate a police state.

This article originally appeared on Wired.com

Posted on June 28, 2007 at 8:34 AMView Comments

"Data Mining and the Security-Liberty Debate"

Good paper: “Data Mining and the Security-Liberty Debate,” by Daniel J. Solove.

Abstract: In this essay, written for a symposium on surveillance for the University of Chicago Law Review, I examine some common difficulties in the way that liberty is balanced against security in the context of data mining. Countless discussions about the trade-offs between security and liberty begin by taking a security proposal and then weighing it against what it would cost our civil liberties. Often, the liberty interests are cast as individual rights and balanced against the security interests, which are cast in terms of the safety of society as a whole. Courts and commentators defer to the government’s assertions about the effectiveness of the security interest. In the context of data mining, the liberty interest is limited by narrow understandings of privacy that neglect to account for many privacy problems. As a result, the balancing concludes with a victory in favor of the security interest. But as I argue, important dimensions of data mining’s security benefits require more scrutiny, and the privacy concerns are significantly greater than currently acknowledged. These problems have undermined the balancing process and skewed the results toward the security side of the scale.

My only complaint: it’s not a liberty vs. security debate. Liberty is security. It’s a liberty vs. control debate.

Posted on June 12, 2007 at 7:11 AMView Comments

Is Big Brother a Big Deal?

Big Brother isn’t what he used to be. George Orwell extrapolated his totalitarian state from the 1940s. Today’s information society looks nothing like Orwell’s world, and watching and intimidating a population today isn’t anything like what Winston Smith experienced.

Data collection in 1984 was deliberate; today’s is inadvertent. In the information society, we generate data naturally. In Orwell’s world, people were naturally anonymous; today, we leave digital footprints everywhere.

1984‘s police state was centralized; today’s is decentralized. Your phone company knows who you talk to, your credit card company knows where you shop and Netflix knows what you watch. Your ISP can read your email, your cell phone can track your movements and your supermarket can monitor your purchasing patterns. There’s no single government entity bringing this together, but there doesn’t have to be. As Neal Stephenson said, the threat is no longer Big Brother, but instead thousands of Little Brothers.

1984‘s Big Brother was run by the state; today’s Big Brother is market driven. Data brokers like ChoicePoint and credit bureaus like Experian aren’t trying to build a police state; they’re just trying to turn a profit. Of course these companies will take advantage of a national ID; they’d be stupid not to. And the correlations, data mining and precise categorizing they can do is why the U.S. government buys commercial data from them.

1984-style police states required lots of people. East Germany employed one informant for every 66 citizens. Today, there’s no reason to have anyone watch anyone else; computers can do the work of people.

1984-style police states were expensive. Today, data storage is constantly getting cheaper. If some data is too expensive to save today, it’ll be affordable in a few years.

And finally, the police state of 1984 was deliberately constructed, while today’s is naturally emergent. There’s no reason to postulate a malicious police force and a government trying to subvert our freedoms. Computerized processes naturally throw off personalized data; companies save it for marketing purposes, and even the most well-intentioned law enforcement agency will make use of it.

Of course, Orwell’s Big Brother had a ruthless efficiency that’s hard to imagine in a government today. But that completely misses the point. A sloppy and inefficient police state is no reason to cheer; watch the movie Brazil and see how scary it can be. You can also see hints of what it might look like in our completely dysfunctional “no-fly” list and useless projects to secretly categorize people according to potential terrorist risk. Police states are inherently inefficient. There’s no reason to assume today’s will be any more effective.

The fear isn’t an Orwellian government deliberately creating the ultimate totalitarian state, although with the U.S.’s programs of phone-record surveillance, illegal wiretapping, massive data mining, a national ID card no one wants and Patriot Act abuses, one can make that case. It’s that we’re doing it ourselves, as a natural byproduct of the information society.We’re building the computer infrastructure that makes it easy for governments, corporations, criminal organizations and even teenage hackers to record everything we do, and—yes—even change our votes. And we will continue to do so unless we pass laws regulating the creation, use, protection, resale and disposal of personal data. It’s precisely the attitude that trivializes the problem that creates it.

This essay appeared in the May issue of Information Security, as the second half of a point/counterpoint with Marcus Ranum. Here’s his half.

Posted on May 11, 2007 at 9:19 AMView Comments

NSA Hiring Data Miners

Certainly looks that way:

The Algorithm Developer will work with massive amounts of inter-related data and develop and implement algorithms to search, sort and find patterns and hidden relationships in the data. The preferred candidate would be required to be able to work closely with Analysts to develop Rapid Operational Prototypes. The candidate would have the availability of existing algorithms as a model to begin.

Posted on January 24, 2007 at 2:57 PMView Comments

DHS Privacy Office Report on MATRIX

The Privacy Office of the Department of Homeland Security has issued a report on MATRIX: The Multistate Anti-Terrorism Information Exchange. MATRIX is a now-defunct data mining and data sharing program among federal, state, and local law enforcement agencies, one of the many data-mining programs going on in government (TIA—Total Information Awareness—being the most famous, and Tangram being the newest).

The report is short, and very critical of the program’s inattention to privacy and lack of transparency. That’s probably why it was released to the public just before Christmas, burying it in the media.

Posted on January 3, 2007 at 11:58 AMView Comments

CATO Report on Data Mining and Terrorism

Definitely worth reading:

Though data mining has many valuable uses, it is not well suited to the terrorist discovery problem. It would be unfortunate if data mining for terrorism discovery had currency within national security, law enforcement, and technology circles because pursuing this use of data mining would waste taxpayer dollars, needlessly infringe on privacy and civil liberties, and misdirect the valuable time and energy of the men and women in the national security community.

Posted on December 13, 2006 at 1:38 PMView Comments

New U.S. Customs Database on Trucks and Travellers

It’s yet another massive government surveillance program:

US Customs and Border Protection issued a notice in the Federal Register yesterday which detailed the agency’s massive database that keeps risk assessments on every traveler entering or leaving the country. Citizens who are concerned that their information is inaccurate are all but out of luck: the system “may not be accessed under the Privacy Act for the purpose of contesting the content of the record.”

The system in question is the Automated Targeting System, which is associated with the previously-existing Treasury Enforcement Communications System. TECS was built to screen people and assets that moved in and out of the US, and its database contains more than one billion records that are accessible by more than 30,000 users at 1,800 sites around the country. Customs has adapted parts of the TECS system to its own use and now plans to screen all passengers, inbound and outbound cargo, and ships.

The system creates a risk assessment for each person or item in the database. The assessment is generated from information gleaned from federal and commercial databases, provided by people themselves as they cross the border, and the Passenger Name Record information recorded by airlines. This risk assessment will be maintained for up to 40 years and can be pulled up by agents at a moment’s notice in order to evaluate potential threats against the US.

If you leave the country, the government will suddenly know a lot about you. The Passenger Name Record alone contains names, addresses, telephone numbers, itineraries, frequent-flier information, e-mail addresses—even the name of your travel agent. And this information can be shared with plenty of people:

  • Federal, state, local, tribal, or foreign governments
  • A court, magistrate, or administrative tribunal
  • Third parties during the course of a law enforcement investigation
  • Congressional office in response to an inquiry
  • Contractors, grantees, experts, consultants, students, and others performing or working on a contract, service, or grant
  • Any organization or person who might be a target of terrorist activity or conspiracy
  • The United States Department of Justice
  • The National Archives and Records Administration
  • Federal or foreign government intelligence or counterterrorism agencies
  • Agencies or people when it appears that the security or confidentiality of their information has been compromised.

That’s a lot of people who could be looking at your information and your government-designed risk assessment. The one person who won’t be looking at that information is you. The entire system is exempt from inspection and correction under provision 552a (j)(2) and (k)(2) of US Code Title 5, which allows such exemptions when the data in question involves law enforcement or intelligence information.

This means you can’t review your data for accuracy, and you can’t correct any errors.

But the system can be used to give you a risk assessment score, which presumably will affect how you’re treated when you return to the U.S.

I’ve already explained why data mining does not find terrorists or terrorist plots. So have actual math professors. And we’ve seen this kind of “risk assessment score” idea and the problems it causes with Secure Flight.

This needs some mainstream press attention.

EDITED TO ADD (11/4): More commentary here, here, and here.

EDITED TO ADD (11/5): It’s buried in the back pages, but at least The Washington Post wrote about it.

Posted on November 4, 2006 at 9:19 AMView Comments

Total Information Awareness Is Back

Remember Total Information Awareness?

In November 2002, the New York Times reported that the Defense Advanced Research Projects Agency (DARPA) was developing a tracking system called “Total Information Awareness” (TIA), which was intended to detect terrorists through analyzing troves of information. The system, developed under the direction of John Poindexter, then-director of DARPA’s Information Awareness Office, was envisioned to give law enforcement access to private data without suspicion of wrongdoing or a warrant.

TIA purported to capture the “information signature” of people so that the government could track potential terrorists and criminals involved in “low-intensity/low-density” forms of warfare and crime. The goal was to track individuals through collecting as much information about them as possible and using computer algorithms and human analysis to detect potential activity.

The project called for the development of “revolutionary technology for ultra-large all-source information repositories,” which would contain information from multiple sources to create a “virtual, centralized, grand database.” This database would be populated by transaction data contained in current databases such as financial records, medical records, communication records, and travel records as well as new sources of information. Also fed into the database would be intelligence data.

The public found it so abhorrent, and objected so forcefully, that Congress killed funding for the program in September 2003.

None of us thought that meant the end of TIA, only that it would turn into a classified program and be renamed. Well, the program is now called Tangram, and it is classified:

The government’s top intelligence agency is building a computerized system to search very large stores of information for patterns of activity that look like terrorist planning. The system, which is run by the Office of the Director of National Intelligence, is in the early research phases and is being tested, in part, with government intelligence that may contain information on U.S. citizens and other people inside the country.

It encompasses existing profiling and detection systems, including those that create “suspicion scores” for suspected terrorists by analyzing very large databases of government intelligence, as well as records of individuals’ private communications, financial transactions, and other everyday activities.

The information about Tangram comes from a government document looking for contractors to help design and build the system.

DefenseTech writes:

The document, which is a description of the Tangram program for potential contractors, describes other, existing profiling and detection systems that haven’t moved beyond so-called “guilt-by-association models,” which link suspected terrorists to potential associates, but apparently don’t tell analysts much about why those links are significant. Tangram wants to improve upon these methods, as well as investigate the effectiveness of other detection links such as “collective inferencing,” which attempt to create suspicion scores of entire networks of people simultaneously.

Data mining for terrorists has always been a dumb idea. And the existence of Tangram illustrates the problem with Congress trying to stop a program by killing its funding; it just comes back under a different name.

Posted on October 31, 2006 at 6:59 AMView Comments

Sidebar photo of Bruce Schneier by Joe MacInnis.