Apple's Differential Privacy

At the Apple Worldwide Developers Conference earlier this week, Apple talked about something called “differential privacy.” We know very little about the details, but it seems to be an anonymization technique designed to collect user data without revealing personal information.

What we know about anonymization is that it’s much harder than people think, and it’s likely that this technique will be full of privacy vulnerabilities. (See, for example, the excellent work of Latanya Sweeney.) As expected, security experts are skeptical. Here’s Matt Green trying to figure it out.

So while I applaud Apple for trying to improve privacy within its business models, I would like some more transparency and some more public scrutiny.

EDITED TO ADD (6/17): Adam Shostack comments. And more commentary from Tom’s Guide.

EDITED TO ADD (6/17): Here’s a slide deck on privacy from the WWDC.

Tags: anonymity, Apple, data collection, privacy

Posted on June 16, 2016 at 9:30 PM • 15 Comments

Comments

Wael • June 16, 2016 11:46 PM

I would like some more transparency and some more public scrutiny.

Perhaps at the right time the information will be shared.

1. The more information you intend to “ask” of your database, the more noise has to be injected in order to minimize the privacy leakage. This means that in DP there is generally a fundamental tradeoff between accuracy and privacy, which can be a big problem when training complex ML models.
1. Once data has been leaked, it’s gone. Once you’ve leaked as much data as your calculations tell you is safe, you can’t keep going — at least not without risking your users’ privacy. At this point, the best solution may be to just to destroy the database and start over. If such a thing is possible.

While correct, this isn’t complete.

Hunter S. • June 17, 2016 1:30 AM

Like any large corporate entity, Apple is trying to have its cake and eat it too. At your expense, naturally.

Tim Cook must pay those marketing execs pretty well for them to conjure up such an artful deception:

“Yes we’re spying on you (but, trust me, we’re not) because the money is so good. Oh, and would you like to donate some blood to Apple while you’re at it?”

No doubt Mr. Cook offers a known security celebrity a hat tip for helping to spread Apple’s latest discharge of propaganda to all the rubes. Well played!

ianf • June 17, 2016 2:06 AM

Yes, it’d be better for Apple to do nothing, arrest development of its data-handling, because then at least they wouldn’t be accused of being opaque by privacy warrior security experts. All those for whom plugging holes is the butter on their daily bread (apologies for overt procarbohydratism!)

blake • June 17, 2016 3:58 AM

Rarely is the question asked: Is our big data learning?

Companies Not Saving Your Data –
https://www.schneier.com/blog/archives/2016/05/companies_not_s.html

I believe that all this data isn’t nearly as valuable as the big-data people are promising … companies are recognizing that it is also a liability

Maybe user tracking is like thermonuclear war, where the only winning move is not to play.

Clive Robinson • June 17, 2016 4:40 AM

Apple say the following,

To obscure an individual’s identity, Differential Privacy adds mathematical noise to a small sample of the individual’s usage pattern.

Which is both inacurate and misleading.

The constraint on the noise is that it’s magnitude and sequence is such that over two or more real data items/subjects,

The differences added by the noise sum to –or close to– zero.

This is the fatal flaw, whilst this can appear to be done for discrete data points, a problem arises for continuous data.

In general continuous data from physical world activities has limited variability with resepct to the sampling rate or time.

For example think in terms of a drunk stagering around an empty parking lot and a CCTV camera taking discrete images. Depending on the frequency of the images (one over the sample time) the drunk can move in any random direction. But only over a very small distance. Secondly the drunk’s direction though containing a random component is based in a large part on the direction and velocity of his previous movements only in part due to inertia. You do not expect the drunk to pop up around the lot like a “whack-o-mole” in a totaly random way, even if the sampling time was large.

If you try adding noise differential or otherwise you find that unless it’s taylored to the individual drunks moves and velocity, the noise can be mapped out. Further even if it is, if you have two or more drunks other factors arise where you can determin how the noise is added and thus remove it.

I could go on and explain how other asspects apply, but as a general rule the more charecteristics you have to any given data subject the easier it is to strip the noise. That is there is a correlation between data items. Some are lose like “blue eyes and blond hair” or “pale skin, freckles, easy sunburn and red hair”. Some more interesting like the relationship of skirt/trouser wearing and a data subjects sex, and some rules tight.

The point is when you add your noise to make your differential data set, you do it in a limited model with just a few rules of your choosing. When another person views your data set they then apply rules of their chosing from an unlimited set and the difference between your rules and their rules alows them to determine your rule set and peel the noise off like layers of an onion.

To get anonymity you actually either have to destroy data or encrypt it in a unique and reliable maner, either way it’s of no use for analysis. We know this from cryptanalysis of stream cipher systems, with the likes of “messages in depth”, “known message structure” “known message content” and other attacks such as traffic analysis.

The desire to have the two data sets produce similar results, is “the loose thread” that an analyst can “pull on to unravel” the garment” and “leave the data subject naked”. And the more details in the database the easier it is to unravel and the more devistating for the data subject.

Clive Robinson • June 17, 2016 4:55 AM

@ Blake,

It’s funny you should mention the earlier thread…

Because as I said there,

So some companies will keep with collecting data and some will stop collecting it for now. What will almost certainly happen as the smell of “free money” is almost irresistible to modern corporates is that some way will be found to externalize this risk to a third party, probably through a large data aggregator who can most easily give the data a value added twist.

De-anonymisation is a part of the “value added twist” large data aggregators can easily do, if we do not have ways to stop them. And as I noted above, there is no way to have usefull data sets and not be able to de-anonymize them, it’s that “loose thread”, ‘that will snag and run, and leave a data subject undone’.

HULK • June 17, 2016 5:28 AM

ME MUST HAVE FRIDAY SQUID!

Simon Leinen • June 17, 2016 6:45 AM

Come on folks, Differential Privacy is not an invention of Apple’s marketing department. It’s a pretty well-defined concept with ample documentation in the scientific literature, and has been around for about ten years. Please check out the Wikipedia page at least, or some of the introductory articles in places like Communications of the ACM.

Yes, it will be important to understand how Apple will build this into actual useful systems, and to characterize to what extent these applications of the method actually protect users’ privacy. But for now I would give them the benefit of the doubt and applaud them for looking for scientifically sound approaches to privacy-respecting analysis of user data, rather than just saying “oh but we anonymize everything” or “but it’s only metadata” or “but you gave us permission” or “trust us”.

Mailman • June 17, 2016 8:21 AM

I read about differential privacy two years ago in blog posts by Anthony Tockar, here:

https://research.neustar.biz/2014/09/08/differential-privacy-the-basics/
https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

Carlos • June 17, 2016 9:33 AM

I dunno, for the purpose they state it smells of bullshit.

I mean, I get that DP is probably a good idea when you’re doing general telemetry where individual opinions are irrelevant and you’re only interested in averages and other stats.

You know, the kind of telemetry Microsoft collects in Windows 10, so DP is something that Microsoft should absolutely be doing to the telemetry data.

But Apple is saying they’ll uses this data to provide better suggestions to individuals, and that’s a completely different use case.

An example: suppose that I never use the poop emoji, and never ever type the word “banana” on my iDevice. But suppose also that most other iDevice users simple love the poop emoji and can’t stop talking about bananas. Now, when Apple uses this statistical data to provide me with emoji and typing suggestions, because they went out of their way not to know me, they’ll suggest I use poop emojis, and wil suggest “banana” when I start to type “ba”.

And that’s not really a better user experience.

Orwellian • June 17, 2016 10:20 PM

Bottom line: Transmitting data over any communication system that the user does not encrypt before point of entry is no different from passing a note in a classroom and hoping no one will read the note while passing it along. Given the transmitted data’s individualized & aggregated potential sigint value to governments & corporations, any expectation of privacy is foolishly naive, as Edward Snowden has made abundantly clear.

I suspect that old-school tradecraft of transmitting information will continue to enjoy a resurgence into the foreseeable future, and that the window of opportunity will continue to expand for a private, upgraded commercial version of US Mail service where you can self-encrypt the information content, which with enough popularized demand, can give rise to another Silicon Valley billion-dollar unicorn start-up.

Sean • June 23, 2016 2:58 PM

“As expected, security experts are skeptical.”

The comment is misleading. The article quotes only Matthew Green, who is an excellent, and respected researcher, but is an “expert” not “experts.”

Clive Robinson • June 23, 2016 4:31 PM

@ Sean,

The article quotes only Matthew Green, who is an excellent, and respected researcher, but is an “expert” not “experts.”

Well it’s not just Matt Green, who is skeptical, by inference it’s also Bruce Schneier as well.

The simple fact is Apple are trying to go from Zero to All customers in as short a time as possible with a very complex project that currently only has theoretical not practical underlying assumptions.

Just about every project that had even a small amount of complexity in it had to go through several iterations before it got close to the desired requirments. This project not only has a lot of complexity, much of the design requirments are and will remain unknown untill it gets used by a lot of people.

Thus what ever code Apple produces, it has a very high probability that it will have not just coding errors but design and specification errors. Those are three very large red flags for security.

So I suspect there will be further experts “stating the obvious” as time goes on.

Does this mean that Apple should not do this project, well… It would be better from a security asspect if they did not collect the data in the first place. Secondly any database of such information is a single point of failure / attack. To date anonymizing systems have not fared well, in that where people have tried seriously they have generally succeeded in striping some or all anonymity away. However somebody has to put their toe in the water at some point over such technology. It is a matter of personal opinion as to if Apple will be any better than the likes of Facebook, Google, IBM, Microsoft et al at developing this ide securely. All that can be said is that so far Apple appear to take their users privacy more seriously than the others.

But the one thing I can certainly say is that, the probability of new attack methods against this project being developed is very high. Thus overall I suspect Apple will even with the best will in the world initialy fail to produce a secure system. The reason for this is because they can not design against currently unknown classes of attack, except by luck.

Now the $64,000 question is “Am I being skeptical or pragmatic?”. I’d like to think the latter, even though I come across as the former on this technology.

Curious • June 24, 2016 8:52 AM

I wonder if Apple’s “thing” here might depend on this “tech” to be more like a paradigm, than, anything technological. That is, for “this” idea to work, maybe it is assumed that it could only work as an ideal, and thus require some kind of totalitarian data collection scheme/stunt for “it” to work “in theory”.

I am intentionally vague, here, in order to try sketch up this notion of mine about this problem of things perhaps not being as they seem; more importantly, about how there might be undisclosed assumptions on Apple’s part, for how they will want things to work. Or in other words: not being a technologist myself, it wouldn’t surprise me if ‘Differential Privacy’ is just a byproduct of something more intrusive, but with ‘Differential Privacy’ perhaps being understood as being something of an oxymoron in the end (opposite meanings sort of).

Apple's Differential Privacy

Comments

Leave a comment Cancel reply