AOL Releases Massive Amount of Search Data

From TechCrunch:

AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with "buy ecstasy" and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

This is search data for roughly 658,000 anonymized users over a three month period from March to May -- about 1/3 of 1 per cent of their total data for that period.

Now AOL says it was all a mistake. They pulled the data, but it's still still out there -- and probably will be forever. And there's some pretty scary stuff in it.

You can read more on Slashdot and elsewhere.

Anyone who wants to play NSA can start datamining for terrorists. Let us know if you find anything.

EDITED TO ADD (8/9): The New York Times:

And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for "landscapers in Lilburn, Ga," several people with the last name Arnold and "homes sold in shadow lake subdivision gwinnett county georgia."

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. "Those are my searches," she said, after a reporter read part of the list to her.

Posted on August 8, 2006 at 11:02 AM • 41 Comments

Comments

regular_readerAugust 8, 2006 11:39 AM

I'm interested in examples of user identification. That is, what are some of the ways people can be solidly identified from their searches? Who's got hard examples?

Searching on your own name does not seem good enough to me.

Mike SchiraldiAugust 8, 2006 11:42 AM

With basic grep skills, you can easily find credit card numbers and SSNs in the dataset. Also, there are at least two username/password/website triplets.

another_bruceAugust 8, 2006 11:47 AM

aol is the short bus of the internet. i don't watch "who wants to be a millionaire" very often, but the funniest thing: one of the lifelines is "ask the audience" and the show displays the audience's answers, and separately, the answers of the aol audience. there is a discrepancy between the two displays, and 99% of the time it runs in only one direction. the user featured in the link "wife killer" clearly doesn't know how to frame search terms to get maximum useful results without a lot of extraneous material. lol@aol's corporate wanker apologizing on there, tell us, wanker, did anyone lose their job over this?

Nicholas WeaverAugust 8, 2006 12:49 PM

As a researcher, this data is amazingly useful. You can potentially do all sorts of interesting corrolation/understanding.

I think the big problem with the search was the open words. If instead, the words were all replaced with "word1 word2 word3", and the domains, (namely, the same algorithm as the user replacement), you could do MOST of the interesting things, but not all.

After all, it really IS useful to know how many people will type in the SEARCH as "asiansexygoddess.com" rather than going there directly.

Its probably as much related to the typosquatting problem: type the url into the web crawler and it is far more likely to fix the typos!

Jeff HeardAugust 8, 2006 12:54 PM

Speaking up for my colleagues, I have to say that these kinds of unfiltered collections are absolutely neccessary for research in information retrieval and natural language processing, not simply the for the scary stuff, but for people who are interested in modern language usage and better search results. That they released it onto a public site was unwise; however I would hate to see this kind of collection not be released in a controlled manner in the future, as it would make mine and many of my colleagues research that much harder. It has been shown time and time again that contrived queries and small, reduced, or filtered relevance sets, such as those published by NIST are useless in determining the actual performance of a search engine.

JosephAugust 8, 2006 12:55 PM

What I would find most interesting is if there are any references to *my* name, or company, or cause, or whatever. Is there anybody out there searching for me? I might just download all 500 megs to find out-

Petréa MitchellAugust 8, 2006 1:02 PM

So how do you tell the difference between the person who types "buy ecstasy" or "how to kill your wife" who is planning to commit an actual crime, and an aspiring mystery writer?

SMAugust 8, 2006 1:16 PM

Well it seems that this has generated enough noise to serve certain people in a particular way. More of the "we need to monitor yuo for your own protection. see how many bad people are out there so it is ok to give up your privacy". I think this may be staged for this purpose...

Jeff HeardAugust 8, 2006 1:20 PM

Petrea, you can't, of course. There's no way for even the most sophisticated data-mining system armed with an incredibly diverse corpus of training data to take 2-6 word queries (the average user query length +- one standard deviation) and classify you as criminal/non-criminal. The best it will do is provide some ranking of the individuals on this list.

Without a clear fix on the criminality of any of the individuals in the list, you can't figure out how far down the list you need to go before you've looked at all the most likely criminals. What you have is a big list that you have to look through by hand.

What you would have to do to improve that is have the search records of several dozen known criminals injected into the collection and see where they come up on the ranking. The improvement, however, would probably not be all that great due to the gigantic amount of white noise in natural language data, especially the peculiar "natural" pidgin language of web-search.

BrianAugust 8, 2006 1:46 PM

A buddy of mine has a blog where he posts the occasional photo and story. He has few enough hits that it is entertaining to check the referers, to see what search strings led people to his obscure blog.

One day, one of the search strings was " naked". The search came from the same city where was working. We figure it was one of her coworkers. has opted not to speculate on which coworker, because the whole thing is creepy enough as is.

We move in small circles.

BrianAugust 8, 2006 1:48 PM

Oh bugger. In my previous post I used angle brackets, which Bruce's blog stripped out. I wonder if it set off the IDS, too? Here is a corrected version:

A buddy of mine has a blog where he posts the occasional photo and story. He has few enough hits that it is entertaining to check the referers, to see what search strings led people to his obscure blog.

One day, one of the search strings was "-female friend- naked". The search came from the same city where -female friend- was working. We figure it was one of her coworkers. -female friend- has opted not to speculate on which coworker, because the whole thing is creepy enough as is.

We move in small circles.

larryAugust 8, 2006 1:49 PM

@Joseph
>What I would find most interesting is if there are any references to *my* name, or company, or cause, or whatever. Is there anybody out there searching for me? I might just download all 500 megs to find out-

Web crawlers have probably indexed the whole database by now, so just google for your name and aol:
joseph yournamehere aol

Andre LePlumeAugust 8, 2006 2:14 PM

@Jeff Heard:

Wouldn't the natural language processing research be equally doable without the "anonymous identifier" AOL used in the datasets? Or is it important in this area to understand whether different queries came from the same source?

McGavinAugust 8, 2006 2:49 PM

"AOL username has been changed to a random ID number"


AOL needs to hire a Cryptanalyst.

AnonymousAugust 8, 2006 2:51 PM

@Andre LePlume

For most applications, yes. However, there is an interesting thread of research (not mine) going on in the security world called "misuse detection" which aims at detecting identity theft inside a term-oriented interaction environment, such as a search engine (pardon the conflation of both search and NL -- they often go together and this research focuses more on NL, it's just the example I'm giving)

One of the things you must do is build a profile of terms that go with an identity. These are created using terms from queries and of course, must be keyed to an identity like the one AOL gave along with each query.

Once this profile is built, anyone claiming the identity that goes with the profile is checked against it whenever they issue a query, and if they fail to match up sufficiently, they are sent a secure message which tells them that someone might be committing identity theft against them.

One of the real problems with the line of research is obtaining useful profiles. Profiles were created selecting a number of queries from NIST and arbitrarily assigning identifiers to them, but accuracy of the system is highly dependent on the diversity of the profile (it can be too high or too low), so a realistic profile like one that could be generated from this dataset would eliminate a major stumbling block.

This kind of software can be used in all kinds of applications, from preventing corporate espionage (think about the Acxiom SSN/CCN/Name/Address theft case from a few years back) to showing supporting evidence to the RIAA that in fact you hate Britney Spears and wouldn't be caught dead downloading it despite the fact that they have an IP address claiming to be you. I'm sure the research has nefarious uses as well, but the intent of it is actually to *prevent*
identity theft.

I Am Not Your Lab RatAugust 8, 2006 3:06 PM

Nicholas Weaver wrote:
>As a researcher, this data is amazingly useful. You can potentially do all sorts of interesting corrolation/understanding.

There are ways to gather this kind of data for the legitimate research that folks like you and Jeff Heard do.

As I'm sure you know, researchers in many other fields go through the time and effort to get informed consent from their subjects. As a matter of ethics as well as law, in some cases.

Technology makes it much easier for folks like you to collect your data than it is for people in medical fields. You can create a browser search-bar plug-in and get people to use it, after consenting to having their searches revealed. Or you could go hat-in-hand to Google et. al, and ask them to partner with you in explicitly offering research participation to their users.

My new mantra: "Convenient use is no excuse."

Seth FinkelsteinAugust 8, 2006 3:13 PM

This could be a great "teachable moment" to educate the public about all the flaws of data-mining, all the ways people leap to conclusions from scanty data. Look as the mini-panic here (over e.g. the "how to kill your wife" string), and imagine if the topic was terrorism and people's careers were on the line.

another_bruceAugust 8, 2006 3:35 PM

what we need is a screensaver that will also do random searches during spare cpu cycles. take that, you nosy researchers!

derfAugust 8, 2006 4:26 PM

Lots of interesting stuff. Someone should correlate the scary searches and date data with prominent TV show and movie themes. Was there a CSI episode about wife killing near the "how to kill your wife" search?

Jeff HeardAugust 8, 2006 5:28 PM

@I am not your lab rat.

I agree with you, and once again, I was not condoning AOL's mistake, simply saying that this kind of data has legitimate use.

I also take some amount of exception in the implication that our field does not do informed consent research. On the contrary, I've been a part of a dozen or so informed-consent studies that have to do with our natural language processing research and will probably be a part of many more. We *do* this. Most or at least many of us *get* it.

However, the big search companies opening their doors to outside researchers levels the playing field for a lot of us, and we like that. It could have been done better, much better. Web-search is not a dead research field because it's been done, but because we don't have access to the volumes and kinds of data that the big guys do.

What you may not think of is that AOL, Google, Yahoo, Amazon, and your flavor of the day search engine keep their query logs and researchers use them internally in ways that would have most privacy advocates cringing. They do it to keep up with each other. Individuals inside the company don't need special security clearance to view these logs, and the logs as an aggregate are largely considered the intellectual property of the company, not the people who issued the queries to the system.

Once again, this is not a legitimizing statement. It's more of a caveat emptor. The internet's dangerous. Anything sufficiently powerful is dangerous. Use it with caution. Just because Google refused to release their data to the FBI doesn't mean that the people inside Google are briefed and vetted on their stances to users' personal privacy and that there are big padlocks on the servers so that people
who aren't sufficiently privacy-conscious can't get a query log. There may be, but I'd venture a guess, having worked inside a large publicly-traded growth company centered around individuals' data, that there aren't. People go home with 40 or 50,000 queries to test some new project that they're going to unveil to the boss.

There are several other interesting points to make here, and one of them is to pin down exactly what is a query in terms of property analogy? Is it like trash, in that it's public property once the user has issued it? After all, the little box which everyone hides as soon as they install their browser warns that anything you transmit over this line is insecure, etcetera. Is it like a transaction with a librarian, privileged private communication? After all, it's communication with the same intent. Or is it the property of the entity that listened to it? After all, this is the model most often used by makers of software that runs on the internet.

They've all got their after-alls, but what do you think? My vote goes with the librarian and that a search company should keep to the same code of ethics as a library, but I can see arguments for the other sides that are compelling.

-- J

I Am Not Your Lab RatAugust 8, 2006 7:43 PM

@Jeff Heard

I can see that you do "get it", and I'm sure from Nicholas Weaver's post that he does as well, so I'm sorry for the misdirected lecture on ethics.

I was reacting to the tone of response I've seen in much coverage that seems to minimize the individual privacy rights issue as being an impediment to research. As if that were even a valid consideration, without informed consent.

Clearly the AOL release is waking a few people to the mismatch between their expectation of privacy online and the reality you describe.

(I'm also personally aware of practices relating to HIPPA that would shock insurance customers who think their medical privacy gets any consideration.)

I would have thought by now there would be methods and policies for sanitizing datasets containing individuals' private data for use by researchers and all those marketing consultants who keep losing laptops. (Seriously, isn't Nicholas Weaver's obscurity idea published in some journal by now?)

Like security, respect for privacy seems to require a cherished place in a corporation's culture in order to survive.

But AOL, Yahoo, and even Google won't pay the cost and inconvenience of protecting individual privacy if we, the public, don't demand it.

I, for one, want to have a reasonable expectation of privacy. I'm pretty sure Google can find way to use aggregate data and still meet that expectation.

EvanAugust 8, 2006 9:57 PM

A Face Is Exposed for AOL Searcher No. 4417749:
http://www.nytimes.com/2006/08/09/technology/09aol.html?ex=1312776000&en=996f61c946da4d34&ei=5088&partner=rssnyt&emc=rss

"And search by search, click by click, the identity of AOL user No. 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga,��? several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.��?

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches,��? she said, after a reporter read part of the list to her. "

Jon SowdenAugust 8, 2006 11:04 PM

From the NYT link
"... AOL spokesman, Andrew Weinstein, ... said he knew of no other cases thus far where users had been identified as a result of the search data ...��?
One _confirmed_ hit in ~ 3 days. Guess what just becamne the newest national US sport ...

Curt SampsonAugust 9, 2006 2:39 AM

I do almost all of my searches via Google (though I've been thinking about changing that, for various reasons).

Earlier this year, I removed all of my Google cookies, disabled cookies from Google sites (well, all sites are disabled by default), and started doing most of my searches through scroogle.com, using Google only very rarely (such as for the odd image search, or a define: search).

I'd been wondering how paranoid this seems to others. Probably a lot less, now.

Martin IngramAugust 9, 2006 2:52 AM

This was a really dumb thing to do and is indicitive of a a more general underlying problem. This is all about data classification and the different labels that the players involved would put on the data.

AOL clearly regarded (for a while) the data as not being sensitive and so felt free to publish it publicly. Users had a different, and understandable, view that this information was confidential. Hence all the fireworks.

This is not the first time an organisation gets this wrong and I doubt it will be the last.

Martin.

TOMBOTAugust 9, 2006 6:33 AM

I thought we just got done (for the umpteenth time) talking about how Data Mining To Find Terrorists is not a worthwhile pursuit. So, Bruce, why are you linking to some bored (and dumb) websurfer's search terms? What does that prove?

TOMBOTAugust 9, 2006 6:39 AM

Obviously what we have to look for is evidence of a power curve at work in seemingly random instances of "poop" and "steak and cheese" amidst the endless streams of misspelled symptoms of sociopathy, and then we've got our man. Note that normally "poop" is a state occurring after "steak and cheese" but this individual has reversed that order. Very interesting.

CJAugust 9, 2006 10:30 AM

@Nicholas:

"After all, it really IS useful to know how many people will type in the SEARCH as "asiansexygoddess.com" rather than going there directly."

Actually, there could be a good reason for that. If you're browsing from a mobile phone, and you do a search through Google, any links you visit from those search results will be nicely formatted for your tiny screen by Google.

bobAugust 9, 2006 10:58 AM

The advertising infrastructure types keep saying they want this information to make advertising more focused and effective. Ok, fine; so when will it start being such? Most of the spam I get is complete horseshirt that I havent the remotest interest in.

another_bruceAugust 9, 2006 11:24 AM

today's random search terms, pick one from column "a", one from column "b", one from column "c":

column "a"

anorexic
bulimic
counterfeit
deflowered
ergonomic
fractal
gorgeous
holistic
independent
jejune

column "b"

katabatic
luminous
mnemonic
nihilistic
orgasmic
preposterous
quintessential
revolutionary
supplemental
tantric

column "c"

ubermensch
vagina
worcestershire
xylophone
yacht
zymurgy
antimatter
botulism
cephalopod (!)
dracula

have fun!

JoeAugust 14, 2006 9:07 AM

If you are on AOL and try to write to email their Privacy people to find out if you are one of the people they betrayed, they send back the standard press release that they have plastered all over the news. I asked them if I was one of the people should shore myself up against ID theft, and I asked them if they plan to allow users to OPT OUT of their future search data collection. I have already written to them three times just to see if they are responsible enough to warn the people they placed naked on the net. All I get from them is the same press release response. The other thing is, while all the web news outlets covered this revolting occurence, AOL chose not to cover their own news. Despicable. A dumpable ISP for sure. Can you imagine anybody ever using their search engine again? Cha ching!

RicardoAugust 15, 2006 5:01 AM

This AOL release of data was a real wake up call for me. I downloaded the data and reviewed some of it. It is FULL of - err - interesting stuff. It should be possible to find the true IDs of quite a few searchers ... which could cause them some severe embarassment ... especially those who query teacher training courses and then view "speciality websites".

I kill my Google cookies whenever I exit Firefox but I would now love to hide my IP address from Google etc ... but I feel that using a meta-portal is probably worse. The owners could well be gangsters hoping that many illegal queries will go through their portal, this providing blackmail opportunties. At least Google is flooded with ordinary queries and is run by a real corporation.

Just think what could happen if one of my kids becomes a politician in 20 years time ... a Google database mining exercise could show that he liked searching for, say, "large pregnant women" when he was 14. One political career over. My kids are going to get a lecture from me on the risks to their careers/lives from data mining!

AOL have in fact done the world a great favour ... they have warned us that are whole lives are now being monitored and the data is being stashed away, probably for eternity.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.