Deanonymizing Taxi Passenger and Fare Data

Interesting essay on the sorts of things you can learn from anonymized taxi passenger and fare data.

Posted on October 22, 2014 at 5:54 AM • 4 Comments

Comments

nonneeOctober 22, 2014 1:00 PM

When you have big data, you can always infer something from it that will lead to some deanomymizing. Get enough tweets and you can find profiles from each user solely based on what he speaks about, or whom he follows.

In WWII, Britains would hear encoded transmissions from the Germans, and just based on the voice tone they could know when a troop was moving or not.

blackboxOctober 22, 2014 1:31 PM

Beautiful example for those who struggle to get their heads around why big data anonymity breaches really affect you as an individual.

Having said that, I am a sucker for data analysis and can't wait to play with the data myself. Here are the torrents with the original data set (for those who've sadly realized that the andresmh.com link is broken):

http://chriswhong.com/wp-content/uploads/2014/06/nycTaxiTripData2013.torrent
http://chriswhong.com/wp-content/uploads/2014/06/nycTaxiFareData2013.torrent

Clive RobinsonOctober 22, 2014 4:07 PM

@ nonee,

In WWII, Britains would hear encoded transmissions from the Germans, and just based on the voice tone they could know when a troop was moving or not.

Err no, the Germans used morse or rtty for encoded messages, voice was for tactical use such as for tank commanders and pilots.

What the "Brits" did do in WWII was learn the operator "fist" and transmitter "tone" which did change when on the move etc.

What it also did was show attempts by the Germans at deception, one of the ways the V1 rocket testing site was found was because the Germans moved a signals unit, but tried to cover it up with dummy traffic between just a handfull of operators and transmitters. You can read more about it in Prof R.V.Jones 1973 book.

x0017AOctober 23, 2014 3:44 PM

Somewhat tangential to the security/disclosure aspect of this, I find it interesting that neither of the celebrities tipped their drivers, assuming he just picked two at random and didn't specifically look for ones who didn't tip. Given that these people are known to be well-off and have a public image to maintain, I find it more plausible to believe that the drivers neglected to record a cash tip than that the passengers didn't tip the driver. I'm thinking that adds at least a bit of fuzziness to some of the things you can calculate (e.g. the drivers' annual income, which is probably something like 5-15% higher than what you'd calculate from the fares).

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.