Schneier on Security
A blog covering security and security technology.
« Stabbing People with Stuff You Can Get Through Airport Security |
| Denial-of-Service Attack Against CALEA »
November 19, 2009
A Taxonomy of Social Networking Data
At the Internet Governance Forum in Sharm El Sheikh this week, there was a conversation on social networking data. Someone made the point that there are several different types of data, and it would be useful to separate them. This is my taxonomy of social networking data.
- Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.
- Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.
- Entrusted data. This is what you post on other people's pages. It's basically the same stuff as disclosed data, but the difference is that you don't have control over the data -- someone else does.
- Incidental data. Incidental data is data the other people post about you. Again, it's basically the same stuff as disclosed data, but the difference is that 1) you don't have control over it, and 2) you didn't create it in the first place.
- Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.
Different social networking sites give users different rights for each data type. Some are always private, some can be made private, and some are always public. Some can be edited or deleted -- I know one site that allows entrusted data to be edited or deleted within a 24-hour period -- and some cannot. Some can be viewed and some cannot.
And people should have different rights with respect to each data type. It's clear that people should be allowed to change and delete their disclosed data. It's less clear what rights they have for their entrusted data. And far less clear for their incidental data. If you post pictures of a party with me in them, can I demand you remove those pictures -- or at least blur out my face? And what about behavioral data? It's often a critical part of a social networking site's business model. We often don't mind if they use it to target advertisements, but are probably less sanguine about them selling it to third parties.
As we continue our conversations about what sorts of fundamental rights people have with respect to their data, this taxonomy will be useful.
EDITED TO ADD (12/12): Another categorization centered on destination instead of trust level.
Posted on November 19, 2009 at 12:51 PM
• 46 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Oh, you mean the conference at which members of an NGO had their property confiscated by UN security because they had the temerity to point out that China has a firewall?
As Andreas Weigend (former chief scientist, amazon.com) points out, there is also a huge difference between data given explicitly (2, 3, and 4) and data collected implicitly (5).
I'd say there's also a huge difference between (1,2,3) and (4,5).
As you note, items 2,3,4 are all types of "Disclosed data" - I suggest this be made more explicit by naming as follows:
2. Disclosed data (controlled)
3. Disclosed data (entrusted)
4. Disclosed data (incidental)
There is another type of data you need to add to your list.
That is cross linked or infered data.
You may not post on a social network site that you live or work in any particular place or other details. However time of day you post etc can bring your location down as can other things that give indicators of other places to search.
For instance you may have once posted (as many admins have) to a news group or been mentioned in corperate news (as many execs have). The time of day you post helps bring the location you are at to a time zone. This can then be used to filter out other people with similar names (there's atleast six people with my name that I have tracked down ;)
As in "traffic analysis" sometimes it's not the message contents that are important but the times and places it originates and ends at.
As another example assume the person trying to track you down has credit checking (CCN) and other "marketing target" DB access and reads on your social site that you have just purchased the latest wiz bang 90" home movie system with 10 channel suround sound and UWB networking.
There are not many places you could have obtained it and you may have taken out a credit agrement to buy it, or forgoton to check the "no marketing contact" box on the warenty or other paperwork. Some or all of those details will nail you down cold.
Great taxonomy. I like!
Perhaps worth finding a better term than "incidental disclosure" to describe third-party postings etc. about oneself though; that wasn't an obvious connection to me terminology-wise.
I agree with Clive's comment about "cross-linked"/"inferred" data being different than just "personal" data.
While I am not sure inferences made about data alone (e.g. time zones of postings) warrant a separate class, I do sense there's a substantial difference between a person's disclosed information, and that same information put in a "cross-linked" form with other people's personal information.
I once created a cross-link system to gather/crosslink such data to vastly reduce credit card billing fraud detection system in a prior ecommerce/telecom position.
Without going into all the details, let's just say that service data that relates just to you ("who you called/who called you") is substantially different in nature from the all the data that can be linked to you via all users (named and pseudo-anonymous) of a given service for all time. (Who calls the people who call you and who do they in turn call? ... a full graph that can has distinctly different/greater value when linked with data of many other people using the same/similar service.)
A service-wide graph that links your information with others connected to you in some way (calling/communication patterns, IPs/geographic locations, similar web traffic browsing patterns, credit card aliases used, etc) contains not just your personal information but an potentially exponentially larger amount of context when linked all together.
Cross-linked data can spiral into a mess and get one nowhere, but at times it can be tremendously powerful (cf isolating/identifying friends of Saddam).
Incidental data is more a case of "observational" or "reputational" data.
Bujold described the difference between Honor and Reputation... but things other people post about you is more related to reputation since its authorship is by an observer.
It can be argued that these will have varying relationships to one's "ego"...
@Clive Robinson, GregW
There's a blog on 'de-anonymization' out there (http://33bits.org/) where problems like that are discussed. For example, its author has a paper 'De-anonymizing Social Networks' out (http://randomwalker.info/social-networks/).
Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.
We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.
Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy "sybil" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.
@ John Campbell,
"It can be argued that these will have varying relationships to one's "ego"...
Hmm "personality type" possibly some people "live inside their own heads" (techi types) others "live inside the heads of others" (your always "networking" managers / execs / politicos / con artists /etc).
Depending on your usage of "ego" you could say the former have none whilst the latter are all ego.
Techi types tend to have "social communications" issues which tends to get them (unfairly) marked down by others. Where as your networking types tend to be very good at social communications (and not a lot else) which tends to get them (unfairly) marked up by others.
The desire to be "top dog" etc (ie egotistical) is not in most cases related to ability to do a job or social communications ability (you only have to watch the X-factor to see when ego/self belife is baddly misplaced).
From the way you phrased your use of "ego" you could also argue that the "ego" has an inverse relationship to ability.
(A point many will have sympathy for as their "networking" bretherin steal the credit for their work).
"Our de-anonymization algorithm is based purely on the network topology... ...is robust to noise and all existing defenses,... ...and the adversary's auxiliary information is small."
And people ask me why I pay in cash and don't "twitter" or "facebook" etc etc ;)
All these kinds of information also have differing levels of veracity (insofar as that term means anything any more). The fact that someone associates your name with a photo or a note or a link may or may not mean that it actually has anything to do with you. And information posted on sites may or may not be factual (and may draw, via links, third and fourth parties who have no direct association with a particular social networking site.)
I am amused, for example, to note that the overwhelming number of people following me on Twitter (which I don't really use) are malware bots. Not sure exactly what that says about me, though.
"Perhaps worth finding a better term than 'incidental disclosure' to describe third-party postings etc. about oneself though; that wasn't an obvious connection to me terminology-wise."
I would love a better term, but I can't think of one. I'll take suggestions.
":Oh, you mean the conference at which members of an NGO had their property confiscated by UN security because they had the temerity to point out that China has a firewall?"
Yes, that's the meeting. I hadn't arrived at that point, but everyone there was deliberately not talking about the incident.
"There is another type of data you need to add to your list. That is cross linked or infered data."
That feels like a subset of behavioral data to me.
following @uqbar, I'd suggest that 2,3 & 4 are all varients on disclosed data, and that you're missing the fourth data-type in the 2d grid where 'posted by you / posted by someone else' is one axis and 'posted in your area of control / posted in someone else's area of control'.
The important thing about information about you that is posted by someone else in an area outside your control it that (compared to the other three data-types) you're less likely to be aware of the disclosure, and so less likely to be able to do anything about it, even if the ability to in some way object to the disclosure exists.
"That feels like a subset of behavioral data to me.
Your type 5 is one half of it (ie traffic analysis on what you do,) that may be visable just to the admins of the "social network" site, or those that have access restricted or otherwise.
An example of the former is where for instance the site owner/admin "outs sockpuppets". The admins have access to data that the site does not normaly show (IP address / etc).
An example of the latter has been seen on some "social network" sites. What you think is limited to just a small group (say your family) can become available to many (through their friends lists). Either directly (your post and the family members post) or indirectly (just the family members half of the post).
The other half of the problem is the actual data that is cross linked to is not on your list. In a more general case (ie not just "social networking") it would be a subset of your type 4 data.
That is it is not on a "social network" site at all, or you have "deleted it" but the site has not, or issues to do with metadata.
Examples of the first could be a business web site such as a newspaper's online edition, an e-commerce site (such as Amazon) where you have rated a product, or a "black hat" site that has posted / made available your bank / CC details. Or as I previously noted commercialy available data on you such as credit rating or marketing DBs.
An example of the second is "orphaned data". Where "social network" sites use many servers to build a page. Photos have still been visable on their original URL on the photo server even though the link refrence has been removed on the HTML body server.
An example of the third type is unautherised access due to predictable naming in site URLs that enable a private URL to be fairly easily determined.
One example of this is where a "thunb nail" picture URL could be used to find the high resolution image just by changing the end of the URL (which has comercial implications for those wishing to charge for the high resolution images).
Another is where the URL contains a sequence number either put in by the site software or the user (such as uploading files from a digital camera and not changing the file names in the process).
Then there is the possability of infering "missing data" for instance a site admin might decide to delete one or more entries on an open comments page. The fact that each entry is given a unique serial number may be used to determine how many have been removed from public display or if in fact they have been deleted at all or mearly had the links removed.
I find your distinction between disclosed data and entrusted data interesting. Specifically, you describe entrusted data as "the same stuff as disclosed data, but the difference is that you don't have control over the data".
I think the term "control" in this context is a little misleading. Once I publish information on my website it's out there in the wild. While I can remove that information from my website, there is very little I can do to erase any trace of it. Copies are made, web-crawlers store indexes of it, and links hold tidbits about it long after I've removed it.
I think I understand the distinction you are trying to make, but I don't have a good way to articulate it without getting caught up in complex IP and ownership questions.
The most surprising part of this list is that nobody seems to be at all interested in security.
"Service data" are the data that the server should never know. And most of the other data should be encrypted to the recipient group.
By analogy it is like communists discussing economy and innovation without even considering markets as an option. They fail by their very approach.
-> Crosslinked Data or not
I am in favour of a separate class for "Crosslinked Data", to especially distinguish it from Behavioural Data, and argue why it differs from the Incidental Data.
Crosslinked Data contains data that is not covered by Behavioural Data, e.g. data from a second community website that can be related to the subject. That is not "behavioural" in any sense but should fit into the crosslink-terminology.
Crosslinked data contains data that is not covered by Incidental Data. If by incidental we mean data which is explicitly linked to a profile by an other subject, then crosslinked data covers a larger set of data, namely all the data that can be linked to a profile, with or without anybody consciously linking that information. Especially the cases where information is derived by linking different profiles with similar attributes or by transferring attributes from otherwise correlates profiles. (The example of the gay detector as discussed in this blog.)
@Bruce: you give the hint yourself in your phrase:
Great list. I tend to agree with others that 'incidental data' in non-obvious as a label. I think that 'extrinsic data', 'third-party data' or 'external data' work better for me.
@Chris Travers: I readily caught on to the intent there. The control is about what happens leading up to it being published. Entrusted data is handled by someone else as it moves towards being published. Disclosed data doesn't change hands.
Alternatively, assume a forum perfectly obfuscates its content from users who are not logged-in members. Control is now about changing whether this content remains private or goes public. If you run the forum, the information you put there is disclosed. If not, it's entrusted. That's the difference, as I see it.
I don't like "Incidental Data" as a term. "Third-party contributed Data" is the only sensible alternative I can come up with, but I don't like it either.
An interesting thing to try and talk through is the reliability of this data. Is the Disclosed Data more reliable about a person than the Incidental Data?
Maybe Incidental Data needs further subdivision into Data contributed by friends, enemies, and family, mostly emotionally-based, and inequivocal data which tends to be legally defined as factually accurate and given by authoritative sources, e.g. a university verifying that they did admit somebody into a degree, or an online gaming system verifying a high score that a person clocked up - or is that behavioural?
Also, keep in mind that there's shades of control. I'd argue that some social networks give you control over others' contributions about you. So maybe the definition for Incidental Data should be changed to talk about less control, or about data ownership.
Is there really a difference between disclosed and entrusted? You're entrusting your data to the social network site after all and they can do whatever they want with it (and they do).
"If you post pictures of a party with me in them, can I demand you remove those pictures -- or at least blur out my face?"
You should have thought of that before you took a hit of that bong, Bruce :)
"You should have thought of that before you took a hit of that bong, Bruce :)"
It's a point that Politcos are getting paranoid about (google "jackie spliff" ;)
However I think Bruce is becoming more "self image" concious, what with finding out how much it will cost to "tool up" as an "action hero" 8)
Hello all -
I would think that types 2 & 3 would (or *should*) have the same rights as when someone writes a letter; the author holds an implicit copyright to what was written (subject to exceptions for quoted material, etc.).
Type 4 data would be copyrighted again by author and then subject to libel laws.
Type 1 data - well, I would like to see a comprehensive law along the lines of HIPAA or FERPA governing all disclosed data (not just to social networking sites, but to anyone that requires you give your personal information; be it email address, name and phone number, or whatever).
While I can see that someone has an interest in what data is collected about their behavior (type 5), I am a little less sure about what rights they might have to that data, or what control they should have over it. As an analogous question is, what right does a person have to the data collected by a private investigator? However, we regulate PI's, so perhaps similar regulation could be created to hold ISPs and social networking sites (and similar - and not just online, social clubs as well) accountable. This would also include investigatory results (which, if I am understanding some of the above comments, could also be called 'crosslinked data').
Would 'second party observations' apply here?
Family and friends are first level contacts.
Third party implies someone outside formal relationship.
Once you get all the rules in place and enforced, it won't be much of a social networking site anymore, will it? Before the internet, social networking was face to face, we used our own memory (brain) for storage, we transmitted data via voice phone or snail-mail, or with grafitti on sidewalks & walls. The difference is the other parties, their intent, and the effort to process.
"I would think that types 2 & 3 would (or *should*) have the same rights as when someone writes a letter; the author holds an implicit copyright to what was written"
As far as I'm aware under international agrement you have rights to any creative work, unless you have a contractual obligation otherwise (not sure if a sites T&Cs count as a contract in all juresdictions)
"subject to exceptions for quoted material, etc."
Err no you still have rights over a derived work providing you are playing by the rules.
For instance the composer of a piece of music has rights to their work, the writer of the words has their rights and likewise the performer has rights and if further work is added by a studio they might have rights as well.
"Type 4 data would be copyrighted again by author and then subject to libel laws."
The laws relating to libel and slander etc are a nightmare and are widley different from juresdiction to juresdictio. Due to lax rules people use the UK's laws to surpress coment from other juresdictions.
"Type 1 data - well, I would like to see a comprehensive law along the lines of HIPAA or FERPA governing all disclosed data (not just to social networking sites, but to anyone that requires you give your personal information; be it email address, name and phone number, or whatever)."
Again laws differ in the UK and Europe you have some rights over your personal data. In the US the data effectivly belongs to who has collected it. I can not see companies giving up that right without a significant fight.
While I can see that someone has an interest in what data is collected about their behavior (type 5), I am a little less sure about what rights they might have to that data, or what control they should have over it.
It is ill defined in theory you behaviour could be classed as a performance and thus subject to copyright. However I can not see a judge enforcing it unless there is something "of merit" to the claim.
"As an analogous question is, what right does a person have to the data collected by a private investigator?"
Technicaly a PI is a private citizen working at a gainfull occupation (often) as a self employed person. In some places however they are required to take courses and pass qualifications for which they are licensed upto an equivalent of a police officer.
"However, we regulate PI's, so perhaps similar regulation could be created to hold ISPs and social networking sites (and similar - and not just online, social clubs as well) accountable."
In Europe they are talking about taking the "common carrier" "common barer" privalage away and making them the equivalent of publishers...
"This would also include investigatory results (which, if I am understanding some of the above comments, could also be called 'crosslinked data')."
The problem with crosslinked, infered or amalgamated data is that when done correctly it is considerably greater than the sum of it's parts.
You have to ask at what point it becomes "a work" in it's own right even if it is derived.
Nobbody has answers to these questions as yet and it is making large organisations nervous.
"incidental data" could be "referential data"
Disclosed data: created yourself, controlled by yourself
Entrusted data: created yourself, controlled by others
Incidental data: created by others, controlled by others
Where does that leave data created by others, but controlled by yourself,
such as comments posted by others on your blog?
@SvdB: I follow you there.. If we're talking semantics, most social netw. sites give you control on your @tag (ie on the facebook platform, if someone tags you on a photo you can remove the tag, and if someone posts on your wall you can remove the post). That's data posted by others you can control.
But you can't prevent someone calling you an arsehole on his wall, or blog (unless you DMCA his ass) as well as you can't prevent the bad rep induced by some google suggests (which is content posted by others you can't control).
@Bruce: Wouldn't you think interresting to extend this taxonomy to google, which has many 'social' aspects?
@SvdB: I'm not sure if data about you that's under your control but was created by others is important enough to these conversations to require a name. In most ways, it acts like disclosed data.
2 points to you sir for the link, it is both funny and sad.
I supose I ought to give the perp a point for providing absolut proof that "On the Internet knowbody knows your a mutt".
However seeing as he then did one of the more dispicable social crimes (blackmail) to young ladies of a very genourous spirit he gets to be less than a compleat zero.
how about the relationship itself. it is actually data about two people. so it is in multiple categories at the same time. it is at the same time disclosed and entrusted. if studied carefully, it can be used for behavioral profiling of both people (and in aggregated form, it can be used to analyze the whole graph, communities, groups etc.)
@Zith: The same could be said about Entrusted vs. Incidental data.
years ago a classmate posted pictures of a party on our class homepage. i asked him to blur out my face. it bothered me, that he did not ask anybody before he had published the pictures.
these days, the problem is: Is bluring out someones face enough? if i don't find your face, i may know your friends, or a party you have visited, and i possibly know some of your clothes (from the pictures at your profile). now i'm able to find you via your friends, your clothes or your activity. it's an all or nothing problem. more precisely, it is a problem to control the incidental data.
the behavioral data is a law enforement problem.
i think that some of the work that palin and dourish proposed with respect to identity mirrors is relevant to both pictures that are put up of you but also generally with respect to how relational information is distributed and controlled. the authors suggest that privacy practices can be developed if there is a mirroring function which shows all the information that is out there about you. so, not just disclosed data, which is what data protection calls for, but also data from all the other categories that bruce is suggesting. the interesting thing about social networks is that the underlying design often would enable such mirroring e.g., you could be linked in pictures, and this could be the starting point of a negotiation with the person who is putting up the picture. the problem is that the category "personal data" does not enable such practices. this is why this taxonomy is interesting.
so, if we follow palin and dourish behavioral data is not just a matter of law enforcement, but it is also data that users need to be and can be made aware of. i see multiple technical solutions:
- make behavioral analysis results available to the users (this could be sold to the companies that users could also contribute to better analysis result i.e., the user can say, this is not how i would categorize myself or i do not want to be categorized as such. zwick and dholakia have written about this and amazon partially practices this with their recommender system)
- users could collaborate (and this would be easily possible on social networks) to collect and analyze their own data, and try to undermine the data mining practices of the providers. this is even further possible if the companies' data mining practices can be inferred. this can be partially done based on publicly available data (i.e., patents), or simply by assuming that the data that users could collect collaboratively would be a subset of the data, and would allow some counter-analysis. there is a master's thesis on such counter-surveillance practices by james dutrisac and the data mining community have recently published papers about the vulnerability of data mining techniques to attacks. such vulnerabilities could be exploited for the sake of re-categorization and in the worst case corruption of the analysis methods of providers.
if we could explore technologies of feedback and awareness, we would be able to find out, how realistic it would be to develop collaborative models that interact with data mining (if the companies collaborate, positively, if not then maybe not so positively). i think these are interesting research but also practical questions.
Currently I'm doing a research project on social networking and (national) security. I built a wiki to built content of literature sources on this topic. Perhaps people of Bruce' blog have some good suggestions!
The wiki is located here: http://snl.intodit.com/
There is a brief discussion in one of the LinkedIn anti-fraud groups referring to an article in Computerworld about how Wachovia is gathering personal info on Facebook and using it (supposedly) to prevent credit card fraud (see http://www.linkedin.com/news?...
The practice, while not new, does revive the paradox of ethical information-gathering from public sources for legitimate (security-enhancement) business purposes, versus compromising of personal privacy. In the Computerworld article case, a security expert was reportedly "shocked" to learn while traveling abroad that Wachovia had learned how old his daughter was and a bunch of other personal stuff by trolling Facebook and used that info to authenticate him after they had blocked his Visa card from use when traveled overseas.
Where do we draw the line here? Should third parties be legally prohibited from searching social networks for information that could support their security efforts? They are, after all, despite being universally villified for nearly blowing up the world financial system, under constant attack from hackers, fraudsters, robbers, embezzlers, etc. Who should decide what is legal and what is not in the weapons arsenal of an institution attempting to protect its assets???
Another category of data is data regarding non-members that can be cross-correlated by the social networking site.
For instance, when Facebook sends out invitations to non-members, it includes the names of 'other people you may know on Facebook' in the email. It generates those names by searching for the non-member's email address in the address books of members. It's a form of profiling over which the non-member has no control, other than recourse to law perhaps.
This practice is also another way that someone using an assumed identity on the site could be 'outed'.
I always wonder when I take vacation pictures if I should blur the other people out or crop them out before I post them on Facebook. I know people do things they should not and are places where they have said they were not. It is not my place to out them in my vacation picture. But it is a lot of trouble to "take care" of others in that way.
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.