Schneier on Security
A blog covering security and security technology.
« Me on Full-Body Scanners in Airports |
| IEDs Are Now Weapons of Mass Destruction »
May 21, 2009
On the Anonymity of Home/Work Location Pairs
Philippe Golle and Kurt Partridge of PARC have a cute paper on the anonymity of geo-location data. They analyze data from the U.S. Census and show that for the average person, knowing their approximate home and work locations -- to a block level -- identifies them uniquely.
Even if we look at the much coarser granularity of a census tract -- tracts correspond roughly to ZIP codes; there are on average 1,500 people per census tract -- for the average person, there are only around 20 other people who share the same home and work location. There's more: 5% of people are uniquely identified by their home and work locations even if it is known only at the census tract level. One reason for this is that people who live and work in very different areas (say, different counties) are much more easily identifiable, as one might expect.
"On the Anonymity of Home/Work Location Pairs," by Philippe Golle and Kurt Partridge:
Many applications benefit from user location data, but location data raises privacy concerns. Anonymization can protect privacy, but identities can sometimes be inferred from supposedly anonymous data. This paper studies a new attack on the anonymity of location data. We show that if the approximate locations of an individual's home and workplace can both be deduced from a location trace, then the median size of the individual's anonymity set in the U.S. working population is 1, 21 and 34,980, for locations known at the granularity of a census block, census track and county respectively. The location data of people who live and work in different regions can be re-identified even more easily. Our results show that the threat of re-identification for location data is much greater when the individual's home and work locations can both be deduced from the data. To preserve anonymity, we offer guidance for obfuscating location traces before they are disclosed.
This is all very troubling, given the number of location-based services springing up and the number of databases that are collecting location data.
Posted on May 21, 2009 at 6:15 AM
• 39 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Wolfram Alpha will most likely get the census data. Extrapolate ...
This is a subject I have been acutly aware of for over 10 years now, and I'd assumed it was common knowledge.
My knowledge is based around tracking mobile phone handsets as they pass from cell to cell for collecting traffic (as in roads) information for census type activities for town planning etc.
There is also another aspect that people need to be aware of which is time and geloc.
The data supposadly is not traceable to an individual user as all the identifing information has been randomly changed (but remains the same for each journy). However it is but a statisitical moments work to nail people cold.
The reason is most people are at a certain place (their home) untill a certain time in the morning, they then leave (with their mobile phone).
The handover messages as the phone moves from cell site to cell site provide a fairly acurate map of their journy to work. All of this journy usually happens within a quite narrow time frame.
So simply looking for a journy along the same route aproximatly at the same time has the user identified.
If a user changes their mobile phone then you do not need to know any details other than a new number is following the same (partial) route at the same times over a couple of days to have the information confirmed with a very high degree of confidence.
Hidding user identification is a very very hard task and in some cases it just is not possible without making the data you need usless.
Oh one thing I forgot to mention.
If you take another source of information such as credit/charge card purchase times and compare you get a lot of realy usefull information.
All you need now to make the world into a goldfish bowl is to have RFIDs in cloths credit/charge cards passports etc "sing out" to doorway type readers, tie the info up with cell data and suddenly you know almost everything a "peeping tom" / "pervert" / "stalker" wants to know...
Two things come to mind, first is that, besides privacy implications, this speaks to the futility of carpooling.
Second, I wonder if the data was restricted to those for whom there exist distinct home/work location information? I know in my neighborhood, there are significant numbers of people who A) telecommute, B) are self-employed, or C) are retired. All of those people share the same home/work pair.
So this is why census takers are getting gps measurements of everyone's front door?
@Lazlo "All of those people share the same home/work pair."
Wouldn't that reduce the data set for those of us with distinctly separate home and work locations? There's about 1000 people in my neighborhood, and I'd say 500 are retired. That means that there are only 500 data-sets to sift through, and some of those may be telecommuters or house wives.
I once worked on a heathcare site, that aspired to be anonymous.
However, I realized that if you are able to connect just age, gender and zip code (no name, no phone, no address, no SSN, no login, no email necessary), you can uniquely identify plenty of people with utter certainty, and some with great confidence.
After all, how many 82 year old women live in a particular zip code? A fair number if it's in Manhattan or Los Angeles; very few -- perhaps even one -- if it's in a sparsely-populated rural area.
Another problem -- if you store data in unconnected tables (e.g. no foreign keys), and even if different databases, you can use timestamps. Let's say a row was inserted in database_one at 3:45 am, and another was inserted in database_two at approximately the same time. If the site is low-traffic, it's quite possible that this was one, single person visiting the site during off-hours.
@Lazlo - It only means that carpooling is futile if people doing carpooling are not willing to pick up anyone who's more than a block or two from their house.
Sometimes carpooling will work, sometimes it doesn't. I live 10 miles from work, and it isn't until I get within 3 miles of work that I have anyone anywhere near my route that works where I do, so it doesn't much work for me. So I ride a bike instead.
I actually think that the GPS sampling has to do with error checking. A great many districts are not compliant (and probably won't be for a very long time) with E911 street/road addressing standards. Having GPS data allows the statistical pre-sampling (as they are not permitted to post-sample the data) process to be refined. This then permits them to miscount fewer people--and to have a better "idiot check" on their data.
Collecting the GPS information also allows the Tiger Line and Shapefile data (which is often used by public safety agencies) to be updated without much extra cost (for instance as opposed to buying it from Navteq).
As for this somehow being the end of privacy... I've long accepted that this kind of analysis is possible. Privacy for many people is not an end in and of itself, but a means to an end. By controlling the information that we make it EASY for other people to know we structure working relationships around varying degrees of trust. This autonomy is for most people what they think of as privacy--the mechanism for selectively disseminating data. (This is oddly the reason why many young people think nothing of posting all sorts of stuff on social networking sites--they made the choice to put it up there, so their "privacy" alarms remain undisturbed.)
I just noticed that your "posted by" times aren't taking into account local DST rules.
I guess I'm lucky in both living and working in some of the highest density locations in the U.S. I'm willing to bet that hundreds, perhaps thousands, of people both live and work in the same zip codes as I do.
Doesn't surprise me, to be honest. Might be something about the accuracy of urban postal codes in Canada, but typing "(my postal code) to (my work postal code)" into Google Maps almost literally gives me a route from my door to my workplace (less than a quarter-block off my apartment, dead on for my workplace). Of the 100 or so people in close range of either destination, I'm reasonably sure I'm the only overlap. Of the coworkers I know well, I can't think of any who even live within 10 blocks of each other unless they're already connected somehow (roommates, married, blood relatives, etc.).
"After all, how many 82 year old women live in a particular zip code? A fair number if it's in Manhattan or Los Angeles; very few -- perhaps even one -- if it's in a sparsely-populated rural area."
I thought people-per-zip was fairly constant. That is, as population density increases, zip geographic area decreases. Is that not true.
This should not be surprising to any Google user. A few separate pieces of data can be used to find nearly anything.
True story: We found a dog a while back near our cabin. It had one tag on, with the Vet's name & # (a place a hundred miles away), the dog's name, and a second phone # etched on the back of the tag.
Sunday afternoon, so there was no access to the vet. The other number was disconnected. Plugged it into google and got a hit for a defunct hair salon. Took that name to the state's business license site and got the owner's name (still with that bad phne number). The name was relatively uncommon, and another google search with the vet office and the name gave me a street address. Still no phone, though.
Plugged the dog's name and the owner's name into google and listing 1 was the owner's wedding pictures, complete with the dog, on a family web site. The owners were on their honeymoon.
The wedding party, family, etc were all listed. Searching by name and location where we found the dog came up with the mom-in-law's house, only a 1/4 mile from where we found the dog. Called her up, and made her day. She'd been in a panic about that dog.
So did you tell the mom-in-law the story of how you tracked her down? Then was she in a panic about how easy it was?
Anonymity is an illusion at best.
But, I disagree with this assumption as a generality because it only works in dense population areas. Where I live, there are only two zip codes for 50K+ people. And there are 200+ that work in the same building as me. So, probably 40% live in the same zip code.
When I dealt with this kind of information it was common to not return results if there were fewer than some number (usually 5) of results.
That is, if you wanted to know the average income of everyone in the zip code with an income over 500,000, if there were only 4 people with that criteria you would not get any response. This prevented someone from create a specific query to find out about a particular individual. This, of course, doesn't prevent the DB creator from finding out the information, but should prevent users.
The problem now is that people are getting access to the full datasets instead of summary data. The fact that the data is available at the block level is a huge problem.
Also, even at the zip code level there are issues. My workplace has its own zip code, so work zip code uniquely identifies workplace in some situations.
@Rich: Re: ZipCode Area
Zip code size is limited by population in dense areas, but is limited by area in rural areas. Zip codes basically identify the particular post office that delivers your mail.
Google will happily translate any street address into a latitude/longitude pair (for Google Maps) with an error not much larger than a GPS device would produce. If the census is recording GPS positions, it doesn't seem to me to be a privacy issue, since it provides no more information than your street address does in most cases.
@ Joe: "I once worked on a heathcare site, that aspired to be anonymous. However, I realized that if you are able to connect just age, gender and zip code...you can uniquely identify plenty of people with utter certainty, and some with great confidence."
Under HIPAA all geographic subdivisions smaller than a state are considered identifiers. Dates smaller than one year and ages greater than 89 are also identifiers.
A little location data goes a long way? Interesting, and not too surprising.
I've always thought that the ideal way for a location service to work, from the standpoint of personal privacy, was for the infrastructure to allow you to *locate yourself* and then to selectively expose/share that data. This is different than the way many location systems work today, where central collection and correlation of data allow a service provider to hold so much information, not to mention share it with their business partners.
Let me locate myself, and then ask questions like "are there coffee shops" or "are there gas stations" nearby. Let me locate myself and then share that selectively with my friends. This sort of selective exposure of "presence" data serves the users best, in my opinion. It may suffer from the need to have many location-related messages being sent and processed. Still, I have the impression that some newer services are considering this sort of model. We'll see.
Regardless of how a user service works, though, it's hard to prevent the carrier or service provider, who has their own interest in the location data and already has a unique identifier for you, from collecting and storing the data.
@ BF Skinner,
"So...privacy is dead?"
Apparently only if you are of that age...
And according to AlanS it's 90 when you officialy do not respond to queries about your health ;)
Hippa Hippa horay I'm more than half way to the point of not being "put out of my privacy" 8)
Yup. It seems I quite regularly run into demos of location services that will record your GPS track. They say it's all carefully anonymized.
I say, "so nobody will be able to tell me apart from all the other people who go to my house and office every day?"
They don't get it. The lure of the location services is too seductive, and the lure of just doing them in the cloud (so your device sends your location to the cloud, which does something with it including storing it, rather than your device knowing your location and fetching data to help you) is also too tempting.
Of course, even the "device knows where you are and fetches" reveals your location if it is just constantly fetching data on where you are at the moment. To be protected, your device has to be pre-fetching data for entire zones before you go to them, which eats more bandwidth.
Canadian postal codes are very accurate. Ours is shared by just 3 houses.
I bet they didn't account for large aggregations.
No college town is going to be susceptible to this sort of analysis (at least, for people who work at the college). Heck, there's literally dozens of people who work within three blocks of my house (let alone my zip code) who work the same place I do.
Why are the census takers trying to insist they get coordinates from ONLY front doors? Do you have any special insight for that?
It is upsetting many people who have fences, and dogs.
(I know my own dog is going to pitch a fit. It has taken me years to convince him the nexus of all evil does not live in the UPS man.)
So the point is - there is a big (HUGE) punch to your privacy if you start using location-based services. Uh... well, I sorta saw it coming. It kinda makes sense.
IMHO, it is a natural part of such services. You can not have your cake and eat it too ;)
Also, if you are REALLY concerned about your privacy, why not circumvent census in some manner, and choose living and workplaces in a manner that complicates profiling and search?
I am disagree with this assumption as a generality because it only works in dense population areas.
Well, anonymity and privacy are being phased out in our society, which is too bad. It does hit at the core of the society. We will survive but what will come out of it is not clear at this point.
"Why are the census takers trying to insist they get coordinates from ONLY front doors? Do you have any special insight for that?"
I'll give you a clue their employer does not trust them...
Some houses are trouble, the census person knows before they even open the gate that they don't want anything to do with the house or what is in it.
Imagine if you will you are the census person, as you aproach the house you hear mad frenzied barking. Now as the dogs owner you know little Snuffels is just a great big ball of fluff and fun. But to the census collector it's potentialy a trip in an ambulance...
So don't go near the door just fill in the form yourself after all having done a couple of hundred houses you could probably answer the question just by one look at the house.
Now the man in the census does not like his minnions shirking of just because of a little woof action. They figure if you actually have to stand on the door step then you are going to do your job instead of faking it...
Further In the UK it is a legal requirment to fill in the census form, but most people do not want to know for any reason due to the Thatcher Gov stupidity over the "poll tax".
So the census bod has to go back to houses that have not given the forms back over and over again, to get the form to get their daily dollar...
As a census collector at what point do you give up and write your time off?
Now your smarter than average census taker realises that rather than just stuff the form into the letter box if they knock on the door then if somebody opens the door and takes the form meakly the chances are they are going to fill it in and be there the next time the census collector calls. In other words their are not going to be a problem.
Now if you don't answer the door when the census collector knocks either you are out or you are trouble.
The census bod makes a note of the house number lightly in pencil etc on the top of the form and puts it back in their bag.
The next day they try to redeliver the form, if there is no answer at the door then you definatly are more trouble than the money they are getting.
So they take the form home and fill it in themselves instead of the house holder...
As for your lovable ball of fluff and fun, they probably know more about the UPS man than you do...
As for the Census Man now there is the embodyment of all evil that Government legislation & taxation reperesents...
So if you "new world" types are going to through a fit over a little discounted tea taxation your god alone knows what you are capable of.
Therefor by definition the Census Man is going to make Terminator 1 look cute and fluffy which is most definatly going to upset little Snuffels big time ;)
A cable program on police investigations, called "first forty eight hours" recently showed dallas police investigating a homocide that occured in front of a number of people.
police were very frustrated by the lack of willing witnesses, until a 12 year old girl identified the shooter and said she saw him shooting. The television show showed the girl with her face blurred, but she wore very distinctive clothing and the investigator addressed her as "brianna" The shooter has a street name of "K cotty" the apparent milieu of the people seems like one of those communities with the "no snitching" trend going around.
This girls identity was as good as revealed to all who live close to her in that community, though it is obscure to the huge televison audience. How many people in that community would recognize her by the distinctive pants and the identifiers, 12 year old, and brianna. This was an actual case.
That is a smart restriction. Thank you for the correction.
Hopefully my memory is cloudy about the granularity of location (maybe it was just state, and not zip?) and that we stored age range, rather than age. It was four years ago.
But I do recall the timestamp-correlation problem being real, and flagging it to my managers.
And I do think that the general assertion -- that it's easy to pinpoint someone with surprisingly little "anonymous" data -- still stands, and is significantly underestimated by folks.
@Kelly ...It has taken me years to convince him the nexus of all evil does not live in the UPS man.
Ah, but the UPS man fits the ideal definition of a dog defending his turf. An "intruder" invades the dogs' territory. Dog puts up a valiant defense. The intruder runs away (actually walks away to save face, but the dog knows better). And this little scene gets repeated over and over reinforcing the dogs' belief that it's properly defending its territory.
@telewatcher: maybe 48hrs used a fake name and gave her a sweatshirt for the interview...
@JimFive: Well I used to work in a building that had four (five digit) zip codes...
I guess it was a good idea for the Germans to eliminate their census in 1987.
As a crew leader for those people taking GPS locations at every front door I may be able to shed some light on what's up with that. We are recording the location of every spot people can live. This includes houses, boats, tents, even cardboard boxes under the freeway overpass. Not all of those locations have addresses. The locations that have addresses we can send census survey packages to, the ones that don't we will need to deliver them ourselves. Also the 45% of the people who won't do their civic duty and fill out the survey by mail we have to call or visit in person. It's good to know where they live when we do that.
We also check and correct the maps, adding new streets, deleting streets that are not there, and correcting street names. Oh, and we do not keep any information on buildings that are non-residential. So we won't have a GPS spot for your business unless someone is living there.
As for the security of the information. We're not allowed to share any personally identifying information with anyone else, not the police, INS or even our own management. Even if you're running a crack house, we can't tell anyone.
Incidently, this is the same thing the Census has done for every decennial census in the past. The only difference is that it's being collected with GPS and a hand held computer instead of by manually drawing dots on a paper map.
While I'm not allowed to discuss the accuracy of the data or problems we've had collecting it, I will say that I'm not terribly worried about it being used to track me.
Since when is it wrong to assert your rights to privacy, CensusGirl? I give all the necessary information annually, with a big check, to the IRS. Are you asserting I am lying to them on a regular basis? If so, charge me. If not, keep your boondoggle to yourself, and leave me alone.
If you want privacy, you will need to turn off your gadgets and move. You will have peace and quiet... Until some night at 10:20pm, when CensusGirl shows up...
Here's an idea: Take the battery out of your device and carry it separately. Install it when you want to use the device. This is a hassle, for sure, and some devices (eg Blackberry) take forever to start up after such an indignity. But you have it when you want to use it and it is not tattling on you when you don't.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.