In the wake of AOL’s publication of search data, and the New York Times article demonstrating how easy it is to figure out who did the searching, we have TrackMeNot:
TrackMeNot runs in Firefox as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and MSN. It hides users’ actual search trails in a cloud of indistinguishable ‘ghost’ queries, making it difficult, if not impossible, to aggregate such data into accurate or identifying user profiles. TrackMeNot integrates into the Firefox ‘Tools’ menu and includes a variety of user-configurable options.
Let’s count the ways this doesn’t work.
One, it doesn’t hide your searches. If the government wants to know who’s been searching on “al Qaeda recruitment centers,” it won’t matter that you’ve made ten thousand other searches as well—you’ll be targeted.
Two, it’s too easy to spot. There are only 1,673 search terms in the program’s dictionary. Here, as a random example, are the program’s “G” words:
gag, gagged, gagging, gags, gas, gaseous, gases, gassed, gasses, gassing, gen, generate, generated, generates, generating, gens, gig, gigs, gillion, gillions, glass, glasses, glitch, glitched, glitches, glitching, glob, globed, globing, globs, glue, glues, gnarlier, gnarliest, gnarly, gobble, gobbled, gobbles, gobbling, golden, goldener, goldenest, gonk, gonked, gonking, gonks, gonzo, gopher, gophers, gorp, gorps, gotcha, gotchas, gribble, gribbles, grind, grinding, grinds, grok, grokked, grokking, groks, ground, grovel, groveled, groveling, grovelled, grovelling, grovels, grue, grues, grunge, grunges, gun, gunned, gunning, guns, guru, gurus
The program’s authors claim that this list is temporary, and that there will eventually be a TrackMeNot server with an ever-changing word list. Of course, that list can be monitored by any analysis program—as could any queries to that server.
In any case, every twelve seconds—exactly—the program picks a random pair of words and sends it to either AOL, Yahoo, MSN, or Google. My guess is that your searches contain more than two words, you don’t send them out in precise twelve-second intervals, and you favor one search engine over the others.
Three, some of the program’s searches are worse than yours. The dictionary includes:
HIV, atomic, bomb, bible, bibles, bombing, bombs, boxes, choke, choked, chokes, choking, chain, crackers, empire, evil, erotics, erotices, fingers, knobs, kicking, harier, hamster, hairs, legal, letterbomb, letterbombs, mailbomb, mailbombing, mailbombs, rapes, raping, rape, raper, rapist, virgin, warez, warezes, whack, whacked, whacker, whacking, whackers, whacks, pistols
Does anyone reall think that searches on “erotic rape,” “mailbombing bibles,” and “choking virgins” will make their legitimate searches less noteworthy?
And four, it wastes a whole lot of bandwidth. A query every twelve seconds translates into 2,400 queries a day, assuming an eight-hour workday. A typical Google response is about 25K, so we’re talking 60 megabytes of additional traffic daily. Imagine if everyone in the company used it.
I suppose this kind of thing would stop someone who has a paper printout of your searches and is looking through them manually, but it’s not going to hamper computer analysis very much. Or anyone who isn’t lazy. But it wouldn’t be hard for a computer profiling program to ignore these searches.
As one commentator put it:
Imagine a cop pulls you over for speeding. As he approaches, you realize you left your wallet at home. Without your driver’s license, you could be in a lot of trouble. When he approaches, you roll down your window and shout. “Hello Officer! I don’t have insurance on this vehicle! This car is stolen! I have weed in my glovebox! I don’t have my driver’s license! I just hit an old lady minutes ago! I’ve been running stop lights all morning! I have a dead body in my trunk! This car doesn’t pass the emissions tests! I’m not allowed to drive because I am under house arrest! My gas tank runs on the blood of children!” You stop to catch a breath, confident you have supplied so much information to the cop that you can’t possibly be caught for not having your license now.
Yes, data mining is a signal-to-noise problem. But artificial noise like this isn’t going to help much. If I were going to improve on this idea, I would make the plugin watch the user’s search patterns. I would make it send queries only to the search engines the user does, only when he is actually online doing things. I would randomize the timing. (There’s a comment to that effect in the code, so presumably this will be fixed in a later version of the program.) And I would make it monitor the web pages the user looks at, and send queries based on keywords it finds on those pages. And I would make it send queries in the form the user tends to use, whether it be single words, pairs of words, or whatever.
But honestly, I don’t know that I would use it even then. The way serious people protect their web-searching privacy is through anonymization. Use Tor for serious web anonymization. Or Black Box Search for simple anonymous searching (here’s a Greasemonkey extension that does that automatically.) And set your browser to delete search engine cookies regularly.
Posted on August 23, 2006 at 6:53 AM •