TrackMeNot
In the wake of AOL’s publication of search data, and the New York Times article demonstrating how easy it is to figure out who did the searching, we have TrackMeNot:
TrackMeNot runs in Firefox as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and MSN. It hides users’ actual search trails in a cloud of indistinguishable ‘ghost’ queries, making it difficult, if not impossible, to aggregate such data into accurate or identifying user profiles. TrackMeNot integrates into the Firefox ‘Tools’ menu and includes a variety of user-configurable options.
Let’s count the ways this doesn’t work.
One, it doesn’t hide your searches. If the government wants to know who’s been searching on “al Qaeda recruitment centers,” it won’t matter that you’ve made ten thousand other searches as well—you’ll be targeted.
Two, it’s too easy to spot. There are only 1,673 search terms in the program’s dictionary. Here, as a random example, are the program’s “G” words:
gag, gagged, gagging, gags, gas, gaseous, gases, gassed, gasses, gassing, gen, generate, generated, generates, generating, gens, gig, gigs, gillion, gillions, glass, glasses, glitch, glitched, glitches, glitching, glob, globed, globing, globs, glue, glues, gnarlier, gnarliest, gnarly, gobble, gobbled, gobbles, gobbling, golden, goldener, goldenest, gonk, gonked, gonking, gonks, gonzo, gopher, gophers, gorp, gorps, gotcha, gotchas, gribble, gribbles, grind, grinding, grinds, grok, grokked, grokking, groks, ground, grovel, groveled, groveling, grovelled, grovelling, grovels, grue, grues, grunge, grunges, gun, gunned, gunning, guns, guru, gurus
The program’s authors claim that this list is temporary, and that there will eventually be a TrackMeNot server with an ever-changing word list. Of course, that list can be monitored by any analysis program—as could any queries to that server.
In any case, every twelve seconds—exactly—the program picks a random pair of words and sends it to either AOL, Yahoo, MSN, or Google. My guess is that your searches contain more than two words, you don’t send them out in precise twelve-second intervals, and you favor one search engine over the others.
Three, some of the program’s searches are worse than yours. The dictionary includes:
HIV, atomic, bomb, bible, bibles, bombing, bombs, boxes, choke, choked, chokes, choking, chain, crackers, empire, evil, erotics, erotices, fingers, knobs, kicking, harier, hamster, hairs, legal, letterbomb, letterbombs, mailbomb, mailbombing, mailbombs, rapes, raping, rape, raper, rapist, virgin, warez, warezes, whack, whacked, whacker, whacking, whackers, whacks, pistols
Does anyone reall think that searches on “erotic rape,” “mailbombing bibles,” and “choking virgins” will make their legitimate searches less noteworthy?
And four, it wastes a whole lot of bandwidth. A query every twelve seconds translates into 2,400 queries a day, assuming an eight-hour workday. A typical Google response is about 25K, so we’re talking 60 megabytes of additional traffic daily. Imagine if everyone in the company used it.
I suppose this kind of thing would stop someone who has a paper printout of your searches and is looking through them manually, but it’s not going to hamper computer analysis very much. Or anyone who isn’t lazy. But it wouldn’t be hard for a computer profiling program to ignore these searches.
Imagine a cop pulls you over for speeding. As he approaches, you realize you left your wallet at home. Without your driver’s license, you could be in a lot of trouble. When he approaches, you roll down your window and shout. “Hello Officer! I don’t have insurance on this vehicle! This car is stolen! I have weed in my glovebox! I don’t have my driver’s license! I just hit an old lady minutes ago! I’ve been running stop lights all morning! I have a dead body in my trunk! This car doesn’t pass the emissions tests! I’m not allowed to drive because I am under house arrest! My gas tank runs on the blood of children!” You stop to catch a breath, confident you have supplied so much information to the cop that you can’t possibly be caught for not having your license now.
Yes, data mining is a signal-to-noise problem. But artificial noise like this isn’t going to help much. If I were going to improve on this idea, I would make the plugin watch the user’s search patterns. I would make it send queries only to the search engines the user does, only when he is actually online doing things. I would randomize the timing. (There’s a comment to that effect in the code, so presumably this will be fixed in a later version of the program.) And I would make it monitor the web pages the user looks at, and send queries based on keywords it finds on those pages. And I would make it send queries in the form the user tends to use, whether it be single words, pairs of words, or whatever.
But honestly, I don’t know that I would use it even then. The way serious people protect their web-searching privacy is through anonymization. Use Tor for serious web anonymization. Or Black Box Search for simple anonymous searching (here’s a Greasemonkey extension that does that automatically.) And set your browser to delete search engine cookies regularly.
Grant Gould • August 23, 2006 7:13 AM
One approach that occurs to me is to have the “extra” queries be actual queries from other users. This would spread out incriminating queries over many users, bumping up the false-positive rate of any data-mining — as you’ve said, it’s the false-positive rate that is limiting in data-mining analyses.
In addition, there’s no reason to randomize the timing — the plugin could just fire off a large burst of queries every time the user fires one, as timing analysis is unlikely to help much in picking out the real query unless the user is typo-fixing, and that would be detectable under any amount of noise.
(Also — is it reasonable to criticise the bandwidth use of fake-query methods, then recommed TOR which moves many times as many bits per query?)
Of course the best answer of all would be for someone to put together a web search business that doesn’t track its users or store their queries, or does so only in some sort of third-party-controlled out-of-jurisdiction drop-safe. But that doesn’t seem to fit anyone’s business models at the moment.