Schneier on Security
A blog covering security and security technology.
« P ≠ NP? |
| Apple JailBreakMe Vulnerability »
August 10, 2010
A Revised Taxonomy of Social Networking Data
Lately I've been reading about user security and privacy -- control, really -- on social networking sites. The issues are hard and the solutions harder, but I'm seeing a lot of confusion in even forming the questions. Social networking sites deal with several different types of user data, and it's essential to separate them.
Below is my taxonomy of social networking data, which I first presented at the Internet Governance Forum meeting last November, and again -- revised -- at an OECD workshop on the role of Internet intermediaries in June.
- Service data is the data you give to a social networking site in order to use it. Such data might include your legal name, your age, and your credit-card number.
- Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
- Entrusted data is what you post on other people's pages. It's basically the same stuff as disclosed data, but the difference is that you don't have control over the data once you post it -- another user does.
- Incidental data is what other people post about you: a paragraph about you that someone else writes, a picture of you that someone else takes and posts. Again, it's basically the same stuff as disclosed data, but the difference is that you don't have control over it, and you didn't create it in the first place.
- Behavioral data is data the site collects about your habits by recording what you do and who you do it with. It might include games you play, topics you write about, news articles you access (and what that says about your political leanings), and so on.
- Derived data is data about you that is derived from all the other data. For example, if 80 percent of your friends self-identify as gay, you're likely gay yourself.
There are other ways to look at user data. Some of it you give to the social networking site in confidence, expecting the site to safeguard the data. Some of it you publish openly and others use it to find you. And some of it you share only within an enumerated circle of other users. At the receiving end, social networking sites can monetize all of it: generally by selling targeted advertising.
Different social networking sites give users different rights for each data type. Some are always private, some can be made private, and some are always public. Some can be edited or deleted -- I know one site that allows entrusted data to be edited or deleted within a 24-hour period -- and some cannot. Some can be viewed and some cannot.
It's also clear that users should have different rights with respect to each data type. We should be allowed to export, change, and delete disclosed data, even if the social networking sites don't want us to. It's less clear what rights we have for entrusted data -- and far less clear for incidental data. If you post pictures from a party with me in them, can I demand you remove those pictures -- or at least blur out my face? (Go look up the conviction of three Google executives in Italian court over a YouTube video.) And what about behavioral data? It's frequently a critical part of a social networking site's business model. We often don't mind if a site uses it to target advertisements, but are less sanguine when it sells data to third parties.
As we continue our conversations about what sorts of fundamental rights people have with respect to their data, and more countries contemplate regulation on social networking sites and user data, it will be important to keep this taxonomy in mind. The sorts of things that would be suitable for one type of data might be completely unworkable and inappropriate for another.
This essay previously appeared in IEEE Security & Privacy.
Edited to add: this post has been translated into Portuguese.
Posted on August 10, 2010 at 6:51 AM
• 39 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
I have a tiny suggestion about the presentation.
If I understand this correctly then the entrusted data part could use an addition that makes it clear that I not only entrust data to others but that they can also entrust data to me. I am sure that was implied but it took me a moment to work it out.
"Disclosed data" is too broad a category, IMO. When someone posts something to their Facebook profile, but their profile is not public, even though they have disclosed it it doesn't mean that they are OK with it being available on the public interwebs.
"Disclosed data is what you post on your own pages: blog entries, photographs, messages, comments, and so on."
I have a problem with this in that it should be broken down into subgroups that fall into a matrix of
1, Object tagging info.
2, Aid Memoir info.
3, Work in progress.
By "Private", "Restricted", "limited", "Public", "Global".
Object tagging info : is effectively meta data about objects such as object names creation dates upload dates and other info embedded into objects such as camera info in some picture files etc.
Aid Memoir info : is additional usually private information a person might add to an object to help them remember associated information at a later date.
Work in progress : Some services don't allow a user to "build bit by bit" thus others get to see something as published that is not yet complete.
Published : is information the person consents to others seeing within the limits they have prescribed.
The minimum limits a person should have are,
Private : Not available to any other person.
Restricted : Only available to "named individuals" who cannot provide secondary access.
Limited : Available to a named "group" or "community" where secondary access within the group is alowed, but not across groups even though the person who owns the data and the person making secondary access available share several groups the data owner has put the data in.
Public : Available to any user of the service.
Global : Available outside of the service.
Persistent Embarrassment Data -- the cross collection of materials that will haunt you the rest of your life.
This taxonomy is useful to anyone who handles other people's data, too.
The above is one of the articles about the Google case in Italy. The Google executives are accused of not having Google react fast enough to a request to take down the video in question. However they reacted within 2 hours to a request from police which seems pretty good.
We have DMCA requests to take down material that infringes copyright and the police are the ones that should be requesting that illegal material be taken down.
It's not uncommon for someone to want to have things taken down because they are embarrassing - for example when a friend wanted a picture of himself smoking a cigar taken off the net because his parents thought he didn't smoke. But it's vastly different from protecting a victim of a crime from the continued attack of having the crime shown on Youtube for the amusement of other bad people - which is what started the Youtube case in Italy.
What about behavioral data that is automatically made public? I'm thinking of geopositioning data that might come with a picture uploaded from your phone. Not sure what category that falls into. The norm seems to be to assume it will be public and accept that but I'm not at all sure that's a good idea.
Useful way to look at where data comes from, just as it is useful to find various ways to express how we'd like to restrict its dissemination. However, it assumes everybody has one and only one "identity", here that figment of administration called their legal name.
I think this is not tenable in the long term simply because even with only one legal, synthetic identity, people have multiple actual "faces", just as they have multiple groups of family, friends, acquintances, co-workers, and so on. If you're to codify this then it ought to be expressible such that everyone else is none the wiser. And that includes the social website provider.
One might perhaps allow law enforcement to lift the veil under specific circumstances, but that is assuming a righteous government, which a truly righteous government doesn't do. And anyway, even if deemed to be undesirable, it's still a possibility, so the theory must be capable of coping with people sporting multiple identities, synthetic or otherwise.
The one thing missing on social sites is a time-to-live field for "entrusted" data. Allowing a sharing expiration date would solve a number of my concerns.
I like the fact that you don't try to introduce more granularity into entrusted data.
Any scheme (such as the "Restricted" mentioned in one of the comments) that assumes there is some way to have data made available to others that cannot be used in some undesirable way is doomed to fail.
Having more granularity within social networking sites about the "level" of association one has with other members is something that is completely lacking. Social networking sites take the richness that is present in our personal relationships with others be they children, parents, siblings, close family, family, close friends, friends, co-workers, and acquaintances and collapses that down into the one-dimensional linearity that is "friends".
There are some crude mechanisms that provide limited capability to distinguish "level" of association, but they are arcane and primitive for right now.
Until this underlying shortcoming is addressed, attempts to limit entrusted information using the same rules we use in the real world are futile.
The interesting question is whether or not a social networking site needs to enforce this distorting of reality in order to gain scale and only then can the granularity be added afterwards. Or is this just an artifact of the newness of social networking within society?
I believe or transition to an open society is underway, and it would seem to be a good thing, in as much as we all need to know that out neighbors are trustworthy. That being said, the magnitude of the potential for deviousness is commensurate with the power to inflict harm, and currently our system of trust affords far too much trust to people who can hide behind a corporate or governmental identity. I've been indoctrinated to the concept that the individual is paramount and the larger systems should be subordinate. This thread of conversation reveals the fallacy that we can create a system to police a public trust in the absence of clear standards and consequences. To date, no corporation or agency of government has been required to compensate an individual for a breach of security in regards to identity theft, let alone any lesser set of circumstance. The difficulty seems to be in determining whether or not individual 'privacy' should have a place at all.
I think the taxonomy Bruce presents may be of little value when it comes to the ongoing privacy debate.
The use of behavioral data to target ads is legitimate. There are many much more sinister uses of social networks data. Stalking, tracking people, blackmail, etc. And sometimes these are done by people in position of power, law enforcement, corporations. This is what we need to limit and control.
How to control 'surveillance' through social netrworks is not an easy question. Legislation may help, privacy settings may help too.
I advocate a somewhat radical approach - make everything explicitly public. Make CCTV feeds public. Use Twitter, instead of e-mail. Use Snoopon.me. People will readjust soon to the brave new world of no privacy and there will be less opportunity for exploiting 'asymmetric information'.
'Incidental' makes trolling sound polite.
Doesn't this taxonomy work with any data out there?
service-data: postal code
disclosed data: the clothes we wear
entrusted data: things we tell others about
incidental data: things others write / tell / know about us without us telling them (in)directly
behavioral data: things people know about us from what we do (meeting the same people every day on the train to work)
derived data: well, whatever others make of all the above data they could get from us (postal code of a poor area -> poor himself?)
Bruce, you should do more thorough writeup on this from the viewpoint on how these sites do or do not currently implement the 'Social contract' that we know from sociology, whether or not there is an analogy there (and to what extent -- where it would deviate from the idea as we know it from Sociology) and whether or not that would be desired or feasible.
-Maxim K: You have not tied the "brave new world of no privacy" to the reduction of informational asymmetry. If anything, youf idea would exagerate info asymmetry by making a stake-holder that keeps a secret more powerful than the average Joe Public (who has no secrets at all and thinks no one else does either!). Further, privacy has an intrinsic value besides as an obstacle for investigation.
I advocate privacy for individuals only. Groups/organizations can create an order of magnitude more mischief, so they should have less privacy. I don't think that gov'ts should ever be allowed any privacy at all.
Two other resources worth introducing into this discussion are the following:
- 'I've Got Nothing to Hide' and Other Misunderstandings of Privacy - Daniel Solove (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=998565)
- Confessions of an Online Stalker - Kashmir Hill (http://www.assemblyjournal.com/2010/07/confessions-of-an-online-stalker/)
With Mr. Solove's paper, it would be interesting to create a matrix of his Privacy Taxonomy on one axis w/Bruce's Social Networking Data Taxonomy on the other axis, and see where the various services sit against these.
Ms. Hill's article adds the twist that while we might not have any problems w/individual pieces of data being collected or exposed, it's the aggregation of that data that gets super creepy, and perhaps beyond our comfort.
Two good resources to include in this discussion.
As image processing AI gets better, derived data might be the scariest, as it would affect your everyday life. You will be labeled based on your actions and/or visual representation in available "public" photos. Your obesity, sexual proclivities, tobacco use, alcohol use, fashion sense (or lack thereof), hobbies, etc. are public record in these images and trollable by various government agencies in addition to all of the businesses seeking to exploit your habits. Health and car insurance rates, for example, could increase if photos of you indicate that your drinking falls outside the norm.
Thanks for such a good post. Concern about privacy is growing, we have to react, and some basic concepts should be drawn in order to make things easier to talk about.
Your essay is a very cleverr and simple point to start. Great.
I wonder if "Donated data" might also be a useful category?
Donated data is what other people post on your page. Your Donated data is another person's Entrusted data (with some overtones of Incidental data) but you now have control over it.
It differs from your own Disclosed data in that it exists in your space until you next have the ability to review it and might be accessed or even archived during that time.
This is a good start but it seems very mono-dimensional; it suffers from a single-user view that is not always followed.
Entrusted data, for example, might be something you post on someone else's site but you also might still retain the ability to modify it because...you are that someone else.
The multi-user approach is far more popular with newer users less used to the old models that tried to manage attribution and ownership. Newer users instead have an abandonment strategy that also does not seem to fit into the taxonomy -- noise generation that comes from play/fake data.
>some of it you share only within an enumerated circle of other users
Seems to me that one of the greatest challenges, once we get past shared understanding of taxonomy etc, is how to design a user interface that makes defining those circles of users as simple as possible. At http://uxmatters.com/mt/archives/2010/04/... I describe how the idea of tagging can be extended for that purpose.
Don't use these sites, and don't post anything anywhere about yourself that you wouldn't want the whole world to know forever. End of problem, and Bruce can go back to working on more important issues that are vital to us, like cryptography, electronic and physical security, etc.
You just posted a public comment on a site. Anonymously, but that's the start.
What about things like Amazon, where supplying information allows you to buy goods ?
What about when more and more people do this, and it becomes harder to find comparatively rare items without buying online ?
The more time goes, and the harder it will be to not post anything, so this thinking isn't wasted work.
There's also a data type that's orthogonal to the ones listed: parasitic data. That's any information that is submitted along with intended data, usually without the knowledge of the user. Examples: EXIF data in an image, user-agent string from a browser, the IP address used, persistent mis-spellings.
There seems to be two kinds of "service data". Some of the service data is required to fulfill legal obligations (e.g., minimum age restrictions). Some of this data would never be disclosed publicly (e.g., a credit card number), but other service data is intended to be disclosed publicly (usually the name and image at a minimum). This is usually to allow others to find you in the network Then there are things like birthdate and location that may or may not be disclosed, but are often asked for at signup time. There are reasons why that information gets asked for such as whether they are adults, or which legal jurisdiction applies to the user. Users should be given a clear message about the disclosure of this "service data" when they sign up.
I recently changed my location in facebook to Pitcairn Island. I'm still waiting to see advertising for adventure travel...
I am not a lawyer, but when someone inputs any data into a record-keeping system that is owned by another person, then that other person acquires ownership of the data. Any data that any person legally acquires about any other person becomes their own personal property.
Except for some laws and regulations which govern the use and disclosure of some data such as medical records, and messages which are shielded by attorney-client privilege (not a right), the owner of the data can exercise all customary rights to possess, use, encumber, and dispose of their property as they see fit. Yet, it seems to me that almost every comment on this subject, include Bruce's initial post, inherently assumes that the person who inputs the data has some ultimate control over it which, in fact, they do not have at all.
If you really want privacy to have any meaning, you must campaign to change the laws and precedents of over a hundred years (in the United States), so that recipients of personal data do not own it, and have obligations to respect the ownership of the data by the person to whom it pertains. Until you succeed in doing that, you have done nothing essential, and will never do much that is truly significant in this matter.
First of all the viewpoint you have is of services operating wholly within the US with people in the US. The global reality is very much different, and as we have seen with strong PI legislation in Europe, you can get around it by "off shoring" the data in some way through "call centers" etc.
Secondly if you want to have legislation introduced or changed you have to provide a compelling case. As has been seen with climate change legislation you are at best wrestling with a greased pig in a mud hole.
To provide any kind of case you have to be able to show cause and effect which means you have to have an agreed set of measurments. Which in turn need to have an agreed set of things to be measured.
Currently we don't have an agreed set of things to measure, setting up a framework by which this can be done is often an involved process. The first step is usually agreeing "common terms of refrence" which is what this taxonomy is part of.
Importantly if it is to be of use it has to be global in nature and thus needs to be "culturally aware" either by being "culturally inclusive" or "culturally neutral" the latter being preferred.
@ Kevin McCurley,
I donxt know if you realised that your two types of "service data" fall into the "provable" and "non provable" catagories.
That is age location etc are effectivly physical atributes that have no way of being reliably verified in the intangable "information world".
Where as credit card number details etc are effectivly intangable atributes that can be readily verified in the intangable "information world".
Because there is no verifiable mapping between tangable and intangable atributes a lot of unsound assumptions are made at quite fundemental levels and currently it is a very open research area.
Thus it may be better to lump them together for the moment untill they can be sorted out and better charecterised. It may well be the case that what we currently consider service data may not be such on further investigation.
> Entrusted data, for example, might be something you
> post on someone else's site but you also might still
> retain the ability to modify it because...you are that
> someone else.
+5 points to Davi, good observation.
Also, Bruce, I'm not sure where relationship data fits in here. Given that you're talking about social networking data, that seems to be a pretty foundational class you need in your taxonomy.
"Alice knows Bob" (or "Alice hates Bob", or "Alice works with Bob", etc.) isn't exactly disclosed data in this taxonomy, nor is it entrusted, incidental, or behavioral. It's not necessarily owned by a user, because it involves two entities.
In some cases (like Facebook), you have only one way to define your relationship with another entity, and both entities have to affirm the relationship.
In other cases (like, say, a customer review site), you have many different ways to define your relationship ("I hated this restaurant","Best Place in Town for ribs!"), and one entity (the reviewed party) doesn't always have a right to qualifying the relationship. In some other cases, (say, a site that reviews college professors), they may.
> The one thing missing on social sites is
> a time-to-live field for "entrusted" data.
The problem with a TTL field for entrusted data is that honoring it is inherently an honor-system proposition.
It is well understood that DNS resolvers often hold cached answers well past their specified TTL, not out of any malicious intent or because they want to keep obsolete data, but simply because speedy expiration is not a high priority compared to performance. If that's true of mere cached technical data that can be easily retrieved again and are of no value to anyone if they become obsolete, imagine trying to get everyone on the whole internet to "play nice" about speedy expiration and religious TTL observance when the data in question are no longer available from the source but DO have value to the people who retrieved copies. You may as well argue for the RFC 3514 evil bit.
Once the data are disclosed, you can't put the proverbial feathers back in the pillow. The social networking site (Facebook or whoever) through which you initially disclosed them (or entrusted them or whatever) could make them disappear from their site after a time, sure, but the data would still be "out there", assuming anybody was allowed to access the site while the data were visible.
Fundamentally, a TTL field could only ever give you a false sense of privacy. It could never actually return control of your data to you.
Good write up. I think it's also important for all readers and those analyzing this to overlay the individual control of privacy settings of this data - as that can skew the data sets and categorization.
The taxonomy is interesting, but technologists (and Americans) need to appreciate that *all* personally identifiable information, no matter how it comes to be held by a business, is already subject to information privacy law in many other countries.
If an OSN holds personally identifiable information about its members, then all of that information is subject to privacy law, irrespective of whether it is "service data", "incidental data', "behavioural data" or "derived data". Accordingly (and despite the intuitions of @stardance or the "open society" vision of @gloom), an OSN is tightly constrained in what it can do with behavioural and derived data. Such data -- if personally identifiable -- cannot be used for secondary purposes including advertising without consent, and it cannot be onsold.
Existing European style privacy laws are probably adequate for regulating OSNs, if the laws are applied diligently. And if all concerned remember that it doesn't matter where personal information comes from (including the "public domain"); it still needs to be protected by any organisation that comes to hold it.
A look at California's celebrity-protection laws-- particularly the ones concerning control of one's name and image, and about limiting stalkery paparazzi behavior-- would probabily provide some useful legal terms for the discussion. I haven't been able to find a good guide to those laws online, though.
It would also be interesting to see how the laws define who they apply to. I've been wondering if a broad reading might make them applicable to users of social networks. The problems being faced now by the users are basically the same ones celebrities have been dealing with forever anyway, just with smaller amounts of money being made from each piece of information.
This is a useful taxonomy, although I find the definition of "behavioral data" a bit too broad.
I would make a distinction between "traffic data" generated automatically by the interactions of users with the system:
- who your friends are
- how often you communicate with them, browse their pages
- which communities you belong to
- are you a social hub or a poorly connected marginal node?
- from which locations you access the service, at which times
- which devices you use
- which pages you browse
and the content (consciously) uploaded by users to the network:
- which topics you talk about
- in which language, dialect
- spelling mistakes? slang? articulate?
Also, the technologies that address these two types of data are totally different. While encryption can prevent the network from extracting behavioral data from content, the protection of traffic data is far more expensive and difficult.
A better term for derived data might be "inferred data." Derived implies certainty whereas inferred denotes a probability of being true.
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.