Strong Laws, Smart Tech Can Stop Abusive 'Data Reuse'

Bruce Schneier
Wired
June 28, 2007

We learned the news in March: Contrary to decades of denials, the U.S. Census Bureau used individual records to round up Japanese-Americans during World War II.

The Census Bureau normally is prohibited by law from revealing data that could be linked to specific individuals; the law exists to encourage people to answer census questions accurately and without fear. And while the Second War Powers Act of 1942 temporarily suspended that protection in order to locate Japanese-Americans, the Census Bureau had maintained that it only provided general information about neighborhoods.

New research proves they were lying.

The whole incident serves as a poignant illustration of one of the thorniest problems of the information age: data collected for one purpose and then used for another, or “data reuse.”

When we think about our personal data, what bothers us most is generally not the initial collection and use, but the secondary uses. I personally appreciate it when Amazon.com suggests books that might interest me, based on books I have already bought. I like it that my airline knows what type of seat and meal I prefer, and my hotel chain keeps records of my room preferences. I don’t mind that my automatic road-toll collection tag is tied to my credit card, and that I get billed automatically. I even like the detailed summary of my purchases that my credit card company sends me at the end of every year. What I don’t want, though, is any of these companies selling that data to brokers, or for law enforcement to be allowed to paw through those records without a warrant.

There are two bothersome issues about data reuse. First, we lose control of our data. In all of the examples above, there is an implied agreement between the data collector and me: It gets the data in order to provide me with some sort of service. Once the data collector sells it to a broker, though, it’s out of my hands. It might show up on some telemarketer’s screen, or in a detailed report to a potential employer, or as part of a data-mining system to evaluate my personal terrorism risk. It becomes part of my data shadow, which always follows me around but I can never see.

This, of course, affects our willingness to give up personal data in the first place. The reason U.S. census data was declared off-limits for other uses was to placate Americans’ fears and assure them that they could answer questions truthfully. How accurate would you be in filling out your census forms if you knew the FBI would be mining the data, looking for terrorists? How would it affect your supermarket purchases if you knew people were examining them and making judgments about your lifestyle? I know many people who engage in data poisoning: deliberately lying on forms in order to propagate erroneous data. I’m sure many of them would stop that practice if they could be sure that the data was only used for the purpose for which it was collected.

The second issue about data reuse is error rates. All data has errors, and different uses can tolerate different amounts of error. The sorts of marketing databases you can buy on the web, for example, are notoriously error-filled. That’s OK; if the database of ultra-affluent Americans of a particular ethnicity you just bought has a 10 percent error rate, you can factor that cost into your marketing campaign. But that same database, with that same error rate, might be useless for law enforcement purposes.

Understanding error rates and how they propagate is vital when evaluating any system that reuses data, especially for law enforcement purposes. A few years ago, the Transportation Security Administration’s follow-on watch list system, Secure Flight, was going to use commercial data to give people a terrorism risk score and determine how much they were going to be questioned or searched at the airport. People rightly rebelled against the thought of being judged in secret, but there was much less discussion about whether the commercial data from credit bureaus was accurate enough for this application.

An even more egregious example of error-rate problems occurred in 2000, when the Florida Division of Elections contracted with Database Technologies (since merged with ChoicePoint) to remove convicted felons from the voting rolls. The databases used were filled with errors and the matching procedures were sloppy, which resulted in thousands of disenfranchised voters—mostly black—and almost certainly changed a presidential election result.

Of course, there are beneficial uses of secondary data. Take, for example, personal medical data. It’s personal and intimate, yet valuable to society in aggregate. Think of what we could do with a database of everyone’s health information: massive studies examining the long-term effects of different drugs and treatment options, different environmental factors, different lifestyle choices. There’s an enormous amount of important research potential hidden in that data, and it’s worth figuring out how to get at it without compromising individual privacy.

This is largely a matter of legislation. Technology alone can never protect our rights. There are just too many reasons not to trust it, and too many ways to subvert it. Data privacy ultimately stems from our laws, and strong legal protections are fundamental to protecting our information against abuse. But at the same time, technology is still vital.

Both the Japanese internment and the Florida voting-roll purge demonstrate that laws can change … and sometimes change quickly. We need to build systems with privacy-enhancing technologies that limit data collection wherever possible. Data that is never collected cannot be reused. Data that is collected anonymously, or deleted immediately after it is used, is much harder to reuse. It’s easy to build systems that collect data on everything—it’s what computers naturally do—but it’s far better to take the time to understand what data is needed and why, and only collect that.

History will record what we, here in the early decades of the information age, did to foster freedom, liberty and democracy. Did we build information technologies that protected people’s freedoms even during times when society tried to subvert them? Or did we build technologies that could easily be modified to watch and control? It’s bad civic hygiene to build an infrastructure that can be used to facilitate a police state.

Categories: Laws and Regulations, Privacy and Surveillance

Tags: Wired