Entries Tagged "search engines"
Page 3 of 4
TrackMeNot runs in Firefox as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and MSN. It hides users’ actual search trails in a cloud of indistinguishable ‘ghost’ queries, making it difficult, if not impossible, to aggregate such data into accurate or identifying user profiles. TrackMeNot integrates into the Firefox ‘Tools’ menu and includes a variety of user-configurable options.
Let’s count the ways this doesn’t work.
One, it doesn’t hide your searches. If the government wants to know who’s been searching on “al Qaeda recruitment centers,” it won’t matter that you’ve made ten thousand other searches as well—you’ll be targeted.
Two, it’s too easy to spot. There are only 1,673 search terms in the program’s dictionary. Here, as a random example, are the program’s “G” words:
gag, gagged, gagging, gags, gas, gaseous, gases, gassed, gasses, gassing, gen, generate, generated, generates, generating, gens, gig, gigs, gillion, gillions, glass, glasses, glitch, glitched, glitches, glitching, glob, globed, globing, globs, glue, glues, gnarlier, gnarliest, gnarly, gobble, gobbled, gobbles, gobbling, golden, goldener, goldenest, gonk, gonked, gonking, gonks, gonzo, gopher, gophers, gorp, gorps, gotcha, gotchas, gribble, gribbles, grind, grinding, grinds, grok, grokked, grokking, groks, ground, grovel, groveled, groveling, grovelled, grovelling, grovels, grue, grues, grunge, grunges, gun, gunned, gunning, guns, guru, gurus
The program’s authors claim that this list is temporary, and that there will eventually be a TrackMeNot server with an ever-changing word list. Of course, that list can be monitored by any analysis program—as could any queries to that server.
In any case, every twelve seconds—exactly—the program picks a random pair of words and sends it to either AOL, Yahoo, MSN, or Google. My guess is that your searches contain more than two words, you don’t send them out in precise twelve-second intervals, and you favor one search engine over the others.
Three, some of the program’s searches are worse than yours. The dictionary includes:
HIV, atomic, bomb, bible, bibles, bombing, bombs, boxes, choke, choked, chokes, choking, chain, crackers, empire, evil, erotics, erotices, fingers, knobs, kicking, harier, hamster, hairs, legal, letterbomb, letterbombs, mailbomb, mailbombing, mailbombs, rapes, raping, rape, raper, rapist, virgin, warez, warezes, whack, whacked, whacker, whacking, whackers, whacks, pistols
Does anyone reall think that searches on “erotic rape,” “mailbombing bibles,” and “choking virgins” will make their legitimate searches less noteworthy?
And four, it wastes a whole lot of bandwidth. A query every twelve seconds translates into 2,400 queries a day, assuming an eight-hour workday. A typical Google response is about 25K, so we’re talking 60 megabytes of additional traffic daily. Imagine if everyone in the company used it.
I suppose this kind of thing would stop someone who has a paper printout of your searches and is looking through them manually, but it’s not going to hamper computer analysis very much. Or anyone who isn’t lazy. But it wouldn’t be hard for a computer profiling program to ignore these searches.
Imagine a cop pulls you over for speeding. As he approaches, you realize you left your wallet at home. Without your driver’s license, you could be in a lot of trouble. When he approaches, you roll down your window and shout. “Hello Officer! I don’t have insurance on this vehicle! This car is stolen! I have weed in my glovebox! I don’t have my driver’s license! I just hit an old lady minutes ago! I’ve been running stop lights all morning! I have a dead body in my trunk! This car doesn’t pass the emissions tests! I’m not allowed to drive because I am under house arrest! My gas tank runs on the blood of children!” You stop to catch a breath, confident you have supplied so much information to the cop that you can’t possibly be caught for not having your license now.
Yes, data mining is a signal-to-noise problem. But artificial noise like this isn’t going to help much. If I were going to improve on this idea, I would make the plugin watch the user’s search patterns. I would make it send queries only to the search engines the user does, only when he is actually online doing things. I would randomize the timing. (There’s a comment to that effect in the code, so presumably this will be fixed in a later version of the program.) And I would make it monitor the web pages the user looks at, and send queries based on keywords it finds on those pages. And I would make it send queries in the form the user tends to use, whether it be single words, pairs of words, or whatever.
But honestly, I don’t know that I would use it even then. The way serious people protect their web-searching privacy is through anonymization. Use Tor for serious web anonymization. Or Black Box Search for simple anonymous searching (here’s a Greasemonkey extension that does that automatically.) And set your browser to delete search engine cookies regularly.
AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.
The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.
This is search data for roughly 658,000 anonymized users over a three month period from March to May—about 1/3 of 1 per cent of their total data for that period.
Anyone who wants to play NSA can start datamining for terrorists. Let us know if you find anything.
EDITED TO ADD (8/9): The New York Times:
And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”
It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,” she said, after a reporter read part of the list to her.
It’s easy to blow the cover of CIA agents using the Internet:
The CIA asked the Tribune not to publish her name because she is a covert operative, and the newspaper agreed. But unbeknown to the CIA, her affiliation and those of hundreds of men and women like her have somehow become a matter of public record, thanks to the Internet.
When the Tribune searched a commercial online data service, the result was a virtual directory of more than 2,600 CIA employees, 50 internal agency telephone numbers and the locations of some two dozen secret CIA facilities around the United States.
Only recently has the CIA recognized that in the Internet age its traditional system of providing cover for clandestine employees working overseas is fraught with holes, a discovery that is said to have “horrified” CIA Director Porter Goss.
Seems to be serious:
Not all of the 2,653 employees whose names were produced by the Tribune search are supposed to be working under cover. More than 160 are intelligence analysts, an occupation that is not considered a covert position, and senior CIA executives such as Tenet are included on the list.
Covert employees discovered
But an undisclosed number of those on the list—the CIA would not say how many—are covert employees, and some are known to hold jobs that could make them terrorist targets.
Other potential targets include at least some of the two dozen CIA facilities uncovered by the Tribune search. Most are in northern Virginia, within a few miles of the agency’s headquarters. Several are in Florida, Ohio, Pennsylvania, Utah and Washington state. There is one in Chicago.
Some are heavily guarded. Others appear to be unguarded private residences that bear no outward indication of any affiliation with the CIA.
A senior U.S. official, reacting to the computer searches that produced the names and addresses, said, “I don’t know whether Al Qaeda could do this, but the Chinese could.”
Over the past 20 years, there’s been a sea change in the battle for personal privacy.
The pervasiveness of computers has resulted in the almost constant surveillance of everyone, with profound implications for our society and our freedoms. Corporations and the police are both using this new trove of surveillance data. We as a society need to understand the technological trends and discuss their implications. If we ignore the problem and leave it to the “market,” we’ll all find that we have almost no privacy left.
Most people think of surveillance in terms of police procedure: Follow that car, watch that person, listen in on his phone conversations. This kind of surveillance still occurs. But today’s surveillance is more like the NSA’s model, recently turned against Americans: Eavesdrop on every phone call, listening for certain keywords. It’s still surveillance, but it’s wholesale surveillance.
Wholesale surveillance is a whole new world. It’s not “follow that car,” it’s “follow every car.” The National Security Agency can eavesdrop on every phone call, looking for patterns of communication or keywords that might indicate a conversation between terrorists. Many airports collect the license plates of every car in their parking lots, and can use that database to locate suspicious or abandoned cars. Several cities have stationary or car-mounted license-plate scanners that keep records of every car that passes, and save that data for later analysis.
More and more, we leave a trail of electronic footprints as we go through our daily lives. We used to walk into a bookstore, browse, and buy a book with cash. Now we visit Amazon, and all of our browsing and purchases are recorded. We used to throw a quarter in a toll booth; now EZ Pass records the date and time our car passed through the booth. Data about us are collected when we make a phone call, send an e-mail message, make a purchase with our credit card, or visit a website.
Much has been written about RFID chips and how they can be used to track people. People can also be tracked by their cell phones, their Bluetooth devices, and their WiFi-enabled computers. In some cities, video cameras capture our image hundreds of times a day.
The common thread here is computers. Computers are involved more and more in our transactions, and data are byproducts of these transactions. As computer memory becomes cheaper, more and more of these electronic footprints are being saved. And as processing becomes cheaper, more and more of it is being cross-indexed and correlated, and then used for secondary purposes.
Information about us has value. It has value to the police, but it also has value to corporations. The Justice Department wants details of Google searches, so they can look for patterns that might help find child pornographers. Google uses that same data so it can deliver context-sensitive advertising messages. The city of Baltimore uses aerial photography to surveil every house, looking for building permit violations. A national lawn-care company uses the same data to better market its services. The phone company keeps detailed call records for billing purposes; the police use them to catch bad guys.
In the dot-com bust, the customer database was often the only salable asset a company had. Companies like Experian and Acxiom are in the business of buying and reselling this sort of data, and their customers are both corporate and government.
Computers are getting smaller and cheaper every year, and these trends will continue. Here’s just one example of the digital footprints we leave:
It would take about 100 megabytes of storage to record everything the fastest typist input to his computer in a year. That’s a single flash memory chip today, and one could imagine computer manufacturers offering this as a reliability feature. Recording everything the average user does on the Internet requires more memory: 4 to 8 gigabytes a year. That’s a lot, but “record everything” is Gmail’s model, and it’s probably only a few years before ISPs offer this service.
The typical person uses 500 cell phone minutes a month; that translates to 5 gigabytes a year to save it all. My iPod can store 12 times that data. A “life recorder” you can wear on your lapel that constantly records is still a few generations off: 200 gigabytes/year for audio and 700 gigabytes/year for video. It’ll be sold as a security device, so that no one can attack you without being recorded. When that happens, will not wearing a life recorder be used as evidence that someone is up to no good, just as prosecutors today use the fact that someone left his cell phone at home as evidence that he didn’t want to be tracked?
In a sense, we’re living in a unique time in history. Identification checks are common, but they still require us to whip out our ID. Soon it’ll happen automatically, either through an RFID chip in our wallet or face-recognition from cameras. And those cameras, now visible, will shrink to the point where we won’t even see them.
We’re never going to stop the march of technology, but we can enact legislation to protect our privacy: comprehensive laws regulating what can be done with personal information about us, and more privacy protection from the police. Today, personal information about you is not yours; it’s owned by the collector. There are laws protecting specific pieces of personal data—videotape rental records, health care information—but nothing like the broad privacy protection laws you find in European countries. That’s really the only solution; leaving the market to sort this out will result in even more invasive wholesale surveillance.
Most of us are happy to give out personal information in exchange for specific services. What we object to is the surreptitious collection of personal information, and the secondary use of information once it’s collected: the buying and selling of our information behind our back.
In some ways, this tidal wave of data is the pollution problem of the information age. All information processes produce it. If we ignore the problem, it will stay around forever. And the only way to successfully deal with it is to pass laws regulating its generation, use and eventual disposal.
This essay was originally published in the Minneapolis Star-Tribune.
Daniel Solove on Google and privacy:
A New York Times editorial observes:
At a North Carolina strangulation-murder trial this month, prosecutors announced an unusual piece of evidence: Google searches allegedly done by the defendant that included the words “neck” and “snap.” The data were taken from the defendant’s computer, prosecutors say. But it might have come directly from Google, which—unbeknownst to many users—keeps records of every search on its site, in ways that can be traced back to individuals.
This is an interesting fact—Google keeps records of every search in a way that can be traceable to individuals. The op-ed goes on to say:
The government can gain access to Google’s data storehouse simply by presenting a valid warrant or subpoena. . . .
Solove goes on to argue that if companies like Google want to collect people’s data (even if people are willing to supply it), the least they can do is fight for greater protections against government access to that data. While this won’t address all the problems, it would be a step forward to see companies like Google use their power to foster meaningful legislative change.
EDITED TO ADD (12/3): Here’s an op ed from The Boston Globe on the same topic.
We all know that Google can be used to find all sorts of sensitive data, but here’s a new twist on that:
A Spanish astronomer has admitted he accessed internet telescope logs of another astronomer’s observations of a giant object orbiting beyond Neptune but denies doing anything wrong.
Jose-Luis Ortiz of the Institute of Astrophysics of Andalusia in Granada told New Scientist that it was “perfectly legitimate” because he found the logs on a publicly available website via a Google search. But Mike Brown, the Caltech astronomer whose logs Ortiz uncovered, claims that accessing the information was at least “unethical” and may, if Ortiz misused the data, have crossed the line into scientific fraud.
Sidebar photo of Bruce Schneier by Joe MacInnis.