Identifying People by Metadata

Interesting research: "You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information," by Beatrice Perez, Mirco Musolesi, and Gianluca Stringhini.

Abstract: Metadata are associated to most of the information we produce in our daily interactions and communication in the digital world. Yet, surprisingly, metadata are often still categorized as non-sensitive. Indeed, in the past, researchers and practitioners have mainly focused on the problem of the identification of a user from the content of a message.

In this paper, we use Twitter as a case study to quantify the uniqueness of the association between metadata and user identity and to understand the effectiveness of potential obfuscation strategies. More specifically, we analyze atomic fields in the metadata and systematically combine them in an effort to classify new tweets as belonging to an account using different machine learning algorithms of increasing complexity. We demonstrate that through the application of a supervised learning algorithm, we are able to identify any user in a group of 10,000 with approximately 96.7% accuracy. Moreover, if we broaden the scope of our search and consider the 10 most likely candidates we increase the accuracy of the model to 99.22%. We also found that data obfuscation is hard and ineffective for this type of data: even after perturbing 60% of the training data, it is still possible to classify users with an accuracy higher than 95%. These results have strong implications in terms of the design of metadata obfuscation strategies, for example for data set release, not only for Twitter, but, more generally, for most social media platforms.

Posted on July 30, 2018 at 6:35 AM • 29 Comments

Comments

meJuly 30, 2018 7:08 AM

@schneier
Here is other analysis that identify people from metadata:
https://labs.rs/en/metadata/
it's about Hacking team hacked.
From email metadata they found who was the boss, who was selling to north america, who in other regions.
who comes at work early, who is late.
who went on holiday and when.
they found all of this and much more only from metadata.

echoJuly 30, 2018 8:03 AM

We are decidedly done. Toast. Pickled. Boiled. Up the preverbial without a paddle. Sucked in, chewed up, spat out. Flattened. Bulldozered. Kyboshed.

meJuly 30, 2018 10:46 AM

note: i have not read the whole paper, only a part of it and i don't understand how machine learning works.

i don't get the point of the research, i think i'm missing something...
for example they say that they can identify users from the followers count, account creation time and other fields.
because these informations are included in each tweet.

but what i thnink is: well, also the account name is included in each tweet, you don't even need metadata...
followers count vary with the time, if you get 10000 random users, for each one count how many followers each one have,
and 5 minutes later you try to find one user exploiting the followers count, sure it works, perfectly.

but do it after one month...
follower count doesn't change very much, for example if i have 1 million followers and other have 10 is quite probable that if an account has 15 followers it was the 10 to 15 and not 1 million to 15.
but doing the same over 10000 users or more it will not work anymore.

but they also say "after perturbing 60% of the training data, it is possible to classify users with an accuracy greater than 95%"
so i may be wrong...

i find the hacking team case more interesting, because an observer might be able to intercept metadata but not the email content, and according to that research even with that you gain a lot of information.
while in the twitter case, as i said, name is there to see for everyone, you don't need metadata.
i would be more interest in an answer to this question:
can they find out if i have multiple accounts?
for example they can see thay i have account a with 3 followers and other things... from that info can they find out that i have also account b because the pattern is similar?


anyway there is an interesting definition of privacy:
“Privacy is measured by the information gain of an ob-
server” (Li, Li, and Venkatasubramanian 2007). If an at-
tacker is able to either trace an entry to an account, or an
account to a user, that attacker gains information and the
user loses of privacy

jonesJuly 30, 2018 11:21 AM

Metadata is more valuable than content in many ways: it's always un-equivocal, whereas people speak unintelligibly, in slang, in oblique reference to the offline world, etc.

HumdeeJuly 30, 2018 11:27 AM

I would point that the countermeasures they considered were not robust.

Our task is to determine whether doing so is an effective way of protect-
ing user privacy, particularly when obfuscated metadata is
released. In this work we focus on two classic obfuscation meth-
ods: data randomization and data anonymization.

So they did not look at what efforts an individual user can do to obfuscate their own data, such as using Tor. They simply assumed that the individual data was valid and then looked at ways data supposedly anonymized by the social media provider could be deanyoloimized later.

echoJuly 30, 2018 12:12 PM

I didn't actually have the first clue what this paper meant. It was pretty indecipherable.

WeskerTheLurkerJuly 30, 2018 1:21 PM

I understood what the paper was saying, but I got the same impression as @me did - what's the point? If someone has access to this metadata on any OSM site, then they also can just see which account posted it to begin with. It was interesting reading the machine learning bits, but it all seems redundant to me.

Now, if you were able to use it to identify the same user on other social media, eg. Facebook, Reddit Tumblr, and so on, that would be substantially more intriguing - and alarming.

-Wesker

Clive RobinsonJuly 30, 2018 1:35 PM

@ me,

i don't get the point of the research, i think i'm missing something... for example they say that they can identify users from the followers count, account creation time and other fields. Because these informations are included in each tweet.

I'm guessing you've not read very far into the paper. At the start of the second page halfway doen the first paragraphe they say,

    Twitter should be considered only as a case study, but the methods proposed in this paper are of broad applicability. The proposed techniques can be used in several practical scenarios such as when the identifier of an account changes over time, when a single user creates multiple accounts or in thr detection of legitimate accounts that have been hijacked by malicious users.

On essence individuals have styles habits and circles of influence that can be recognised by variois forms of metadata. Thus these can be measured in various ways and may be sufficient to verify one of the three cases they mention of,

1, Account identifier change.
2, Multiple accounts.
3, Account take over by others.

Thus it could also (4) identify different people using the same account. Such as when a person of intetest to others actually employes others to run their account.

I can see many reasons why 3&4 would be of interest to journalists and the like when an out of character tweet is made. Likewise those in the legal proffession.

Likewise 2 where a user has a personaly identifing account and a second supposadly anonymous account they may be using for whistleblowing etc.

From what is said in the paper it can be infered that the use of Twitter for the case study is that it has a significant amount of metadata attached to each tweet. Which in turn alows various types of metadata to be selected for their consistancy or sensitivity during testing.

@ All,

The results given in this paper realy should be unsuprising to regular readers as it has been previously discussed in terms of identifing network traffic through mix nets etc. I myself have repeatedly pointed out that Tor has significant failings when it comes to Traffic Analysis of Tor traffic due to the amount of metadata that is aloud to leak for the sake of low latency and other issues.

There are solutions to metadata issues such as Traffic Analysis but the dificulty is realy recognising what form of metadata a hostile observer is using. Worse trying to obfuscate one set of metadata in of it's self creates other forms of metadata. Thus obfuscation of metadata can be a little like trying to get rid of bubbles when putting wall paper up. Simply pushing down in one place causes other places to rise in predictable ways. The actual solution with bubbles is a pin such that the traped air can be reliably removed. Most metadata actually has the equivalent of a pin, however finding it is usually not that easy.

HumdeeJuly 30, 2018 1:59 PM

"Worse trying to obfuscate one set of metadata in of it's self creates other forms of metadata."

Bingo. We have a winner here. Clive's trusim is one major reason why the cat and mouse game will never end. We live in a physical universe and everything--at some remove--is connected. So every battle reduces to a contest of deployable resources. Since no one has infinite reduces the net results is at good news/bad news situation. The good news is that everyone has a chance, the bad news is that everyone has a chance.

Miss CardewJuly 30, 2018 7:14 PM

I want to do something different today. I won’t Twitter. Miss Prism, will my metadata be alright ?

justinacolmenaJuly 30, 2018 8:22 PM

SSL/TLS and similar technologies may hide user authentication and page content, but "What websites do you tend to visit?"

This is highly impertinent personal information which is being cast as "metadata" and cannot be hidden by SSL/TLS alone.

Furthermore this "metadata" is seen as not deserving of a general respect toward human privacy, as if it were somehow not "real" personal data.

justinacolmenaJuly 30, 2018 8:28 PM

We've already shifted over from saying "the data are" to saying "the data is."

We are not illiterate. This is simply a sign that we have reached the singularity with "Big Data."

It is too highly correlated and interconnected a phenomenon to be construed as plural.

meJuly 31, 2018 2:23 AM

@Clive Robinson
actually i have read that part, i have skipped some middle part here and there
> 1, Account identifier change.
> 2, Multiple accounts.
> 3, Account take over by others.

1- that is clear: you change username but account cration time doesn't change, and followers not so much, so you can track user even if he change name

2-this is not clear, and it's the more interesting case, how is this even possible from metadata like "account creation time" to correlate two different account?
but also other fields...
for example take some "famous" twitter account like "internetofshit" that talks about iot.
he has also a personal account but they are two different things.
one is private so he will have few followers (friends) other have many random followers.
as i said i have not read the whole paper, read to the end yes, but skipped parts. i would say i have "sanned it" like people do on websites. maybe i should read it again?
but it doesn't seems to explain how this is possible (at a first look)

3-yes, this can help but if you look at this case:
https://motherboard.vice.com/en_us/article/vbqax3/hackers-sim-swapping-steal-phone-numbers-instagram-bitcoin
where they stolen "rainbow" account and deleted every single follower.
so you have an account that until day x connected from one place and have a more or less constant number of followers, the day after is has 0 followers and connections from other place, this should be suspicius but seems that social networks in general doesn't care.
there are other cases like this, for example twitter "n" account.

4- to identify multiple people using the same account i think is possible if for example two people are from differnt states so ip (not listed in their metadata list), timestamp, and geotagging (if enabled) are obviously different.

WeatherJuly 31, 2018 3:11 AM

It takes the inverse of the byte table, say you have
0011010
0101001
1001000
But it had each value, like you said climbing a hill, low value go down to the right then left and down,for sha 256 just modified it, why they made sha512 shorten to 256 as a temporary messaure, I don't know how I do half the thing I do, technically I am classed as a retard

albertJuly 31, 2018 11:12 AM

I ask the rhetorical question, is it absolutely necessary to use 'social' media like twittre ,fascebook, etc.?

Re: 'metadata': It appears that todays meta data is many orders of magnitude greater and more revealing than the telephone records and letters of olde.

Isn't it time to classify metadata as protected and requiring a search warrant for collection?

. .. . .. --- ....

PeaceHeadJuly 31, 2018 12:34 PM

I'm still reading the article.
The main idea does not surprise me.
The whole internet is pretty much a gigantic datamining operation. Here's looking at you, DARPA: With great power comes great responsibility.
And I didn't get that idea from RT either. (Although respect to them for publishing that possibility also several years ago).

Next topic:

When can we discuss security implications of lookalikes, soundalikes, biometric false positives, and real-life doppelgangers?

I will certainly tune in for that (and hopefully before CRISPR hits the fan).

A lot of us people of this world need to stop fighting each other and consider the 3rd, 4th, 5th, and 6th party instigators and provacateurs who profit from our perpetual disagreements.

The Allies never quite recovered from WWII because of Operation PAPERCLIP.
Everything has a history and a context.
It's time to stop promoting NAZI's to the top of every intellectual and financial food chain.
(just for example)

May Peacefulness Prevail Within All Realms of Existence.
The Golden Rule Does NOT Work With Sociopaths!!!!!!!!!!!!!!

user12099July 31, 2018 1:45 PM

Isn't the point of the study not to defend against someone who has access to the platform itself but to remove the veil that the NSA (and others) use:

"We only collect metadata which can't be used to identify you".

If state actors are intercepting packets, not actually hooking into the platform itself, while pretending that metadata is harmless, at the very least this paper suggests otherwise.

David LeppikJuly 31, 2018 4:06 PM

This is something the US Census has been dealing with for a long time. Their job is explicitly to make census data as useful as possible while keeping individuals anonymous for 100 years.

They don't provide individual census data, they provide aggregate data down to the "census tract," which is an area roughly comparable to a neighborhood. A tract may be only a block or two in an urban area, or span many miles in a rural area.

For each census tract, they try to identify individuals. For example, there may be only one person of a particular combination of race and color in that tract. They then randomly shift characteristics of that individual to random similar census tracts nearby. For example, if there is only one black person in a county, they may swap that person's race with someone else in another census tract in that county, while keeping that person's income, marital status, etc. the same. They repeat until they can no longer identify any individuals.

Essentially they try to de-anonymize the data, and mess with it (through aggregation and "diffusing") until they can't de-anonymize it.

PhaeteAugust 1, 2018 5:51 AM

Nice clear article, i like it.

96.7% seems a lot, but it means that 17 million tweets out of 500 mil tweets a day are mis attributed. (reality will be more as they only tried a 10k userbase)

It's a useful tool to include in any forensics, but you need to combine it.

CliveAugust 1, 2018 6:49 AM

@Phate

I don't quite agree with your assessment of the research results. It doesn't mean that 17 out of 500 million daily tweets are mis-attributed, it means that the algorithm can't reliably validate the authenticity of the author for 17 million tweets a day.

The obvious use case for this would be in the detection of troll factories. If there are people abusing Twitter to generate, for example, "trending" tweets, then this might be a powerful defense. In that scenario I think the error rate becomes one of false positives [identifying someone as a suspicious user when they are legitimate]. However, the most appropriate use case for the suspicious posts would be human review, so narrowing the field by ~ 97% seems like a good place to start.

Clive RobinsonAugust 1, 2018 7:10 AM

@ Clive,

as there are two of us named Clive, it would help if you added your last name of another initial to stop others getting confused between the two of us.

echoAugust 1, 2018 9:42 AM

@Clive Robinson

There is a big clive on youtube. Security related content posted in squid thread.

SkizzoAugust 1, 2018 10:09 AM

@Clive Robinson And what if his last name begins with an R? Or maybe it's even the same as yours?! Maybe you should instruct your followers on how to address you in order to avoid confusing you, as opposed to telling a new visitor they should change their username. Or maybe change your username...Old Clive perhaps?

PhaeteAugust 1, 2018 11:04 AM

@Clive

Not attributable or mis attributed, not much of a difference (in binary both would be 0)
Though useful, i'm not impressed.
Now if they are combining it with browser fingerprinting i'm sure they could do a lot better.

PeaceHeadAugust 1, 2018 12:17 PM

@David:

Thanks for that point about the census bureau.
There's a great book published in the mid 1990s, yet still relevant.
It's entitled "The Truth About Where You Live". It also discusses the census and how the UK census techniques tend to be more statistically valid. The book successfully defends the case that financial statistics have greater usefulness than linguistic and racial/ethnic categories that anthropologists, biologists, and DNA experts agree aren't even scientifically valid.

Peace be with ya, David.
And you really need a better pseudonym.
Tartan and burberry makes you look like chav.
And you are certainly NOT a chav.
The colour hurts.

Peace be with Camp David, as well.

P.S.-The slogan "If ya can't beat em, join em." didn't work as a 1970s sitcom ending and it doesn't work as a successful modus operandi either. Rachel Barton told my friends that I didn't want to have anything to do with them and then she told me as much. Rachel Barton lied to my friends and alienated them from me. I was busy struggling against a mean culture by then. But I didn't give up on my better characteristics and I didn't give up on my friends either. Hopefully the adversity attacking me didn't get to my friends and it won't in the present nor future.

Rachel Barton never did anything to me nor any of my friends as far as I know.
I have never even ever attended a Rachel Barton concert. I don't like heavy metal nor violins nor violence.

Rachel Barton has nothing to do with this text whatsoever.
Peace be with you Rachel Barton, cellos sound wonderful.

PeaceHeadAugust 1, 2018 10:09 PM

Back on topic before i shut up for a longer while...

These recent newscasts (they aren't newscasts) discussing the alleged fake Faceblank accounts (no free advertising for them), are dodging a significant issue:

It's been published that there are organizations and individuals for hire who create fake identities and fake datatrails and fake online presences for a few specific purposes:

1) to hide embarrassing online personal data by burying the results with more recent manufactured and/or handpicked and planted results

2) to disguise the identities of anybody who requests and/or requires their identities to be hidden (obfuscated). This is primarily and Intelligence Organization thing, but could also include people within Witness Protection Programs as well as any other type of "undercover" person(s), personas, and personnel.

3) when done en-masse, it might be a standardized technique to shade and veil several members of an organization, rather than just individuals one at a time. Their motivations are not implied nor required to be political at all. The only motive implied is the desire for a manageable synthetic identity and/or the pressurized near-erasure of otherwise embarrassing or disadvantaging online data which is strategically distracted from.

4) In modern times, there are alot more covert (and overt) intelligence organizations than in the past. This includes a lot of corporate and freelance organizations that are employed for the purposes of corporate espionage and also subcontracted individuals and organizations many of which aren't even necessarily military in origin, yet which utilise some subset of MILINT techniques to accomplish corporate and or ideological goals. Some of these groups and/or their related organizations could even be religious (or of course political).

5) A few such fake ID's and related online personnas and activities and organizations are probably designed to shield legitimately protected individuals and groups from invasion of privacy because such individuals and groups may work with or for or be related to or in company of organizations involved with extremely important and/or high-risk activities such as the types which involve NDA's (Non-Disclosure Agreements) and manifold types of security clearances.

In other words, people who deal with BIG SECRETS and/or BIG RISKS or are merely related to and/or associates of those people would have synthesized and maintained fake profiles and related fake activities and several other facets of faked online (and offline) material.

This is not entirely bad whatsoever.

So for the mainstream media to propagate the misleading premise that it's all Russian Hackers or Iranian Hackers or Chinese Hackers or Election Hackers is totally missing a major technical point.

Similarly, one of the most prominent repeat offenders when it DOES come to election disruptions is typically US Republican extremists combined with some non-extremists who simply want to impose more filters against election irregularities regardless of whether or not those same areas are affected by Gerrymandering. US Republican extremists are notorious for Gerrymandering. It's not even a political controversy, it's part of typical American history and culture. And because of this well-documented phenomenon, it's a scientifically valid claim due to it's measurable qualities.

So again, in terms of metadata and "fingerprinting" or "watermarking" of people's online livelihoods, we need to be more inclusive of reality, because the mainstream propaganda machines are not in service of pure rational thought.

In terms of the capability to identify when accounts are hijacked, the science of attribution is still valuable, and metadata is alot like ergonomics, and therefore falls within the realm of forensics.

So yeah, we are back to law enforcement again.
Happy 110th Birthday FBI.

Statistical analysis of metadata whether by machine and/or by person(s) is still within the realm of forensic sciences. This is not an alien topic to me. Yeah I'm a musician, but I focus on electronic music. In it's essence, it's just having fun with dataflow. Of course I think about metadata no matter how little of it I use within my tunes nor how much I spill while downloading and uploading.

And the phone networks run significantly upon the internets as well, and that's not even explicitly including VOIP and Skype. ATM machine metadata could be thrown into that as well as photocopies and printed POS/Visa/Mastercard receipts. This isn't really new stuff.

EVERYTHING PEOPLE DO IS EVIDENCE.
And there's no such thing as the perfect crime.

I'm thankful that we've come full circle yet again.
Sorry I'm so annoying about this stuff.
There's not law requiring that a person who studies communication theory and practice fit the stereotype of a person who studies communication theory and practice.

And I'm not even very good at it either.
I decided a long time ago that I didn't want to get good at anything that I didn't truly want to do or where the occupational hazards were too high.

Sorry to make this stuff so personalised, but alas, that's my datastream, my metadatastream, my deliberate linguistic ergonomics (for now), and my example.

The internetworks are still (d)evolving. It's not like this stuff is all abstract and academic and not happening in realtime as we type and talk just because we find it interesting and/or disturbing.

Thankfully I didn't go to MIT and become a hacker whether malicious or benificent.
My personality is creative random abstract and while talking about myself seems selfish, it also helps to keep some focus off of talking about specific vulnerabilities to the detriment of the safety of lives.

I've been a "mascot" and a "village idiot" for enough months that a few more hours or days doesn't bother me while people get a clue to back the F off of me. Move aside and let the man go through. Super bon-bon-bon-bon bon.

GNU IS NOT UNIX = G.N.U. and any communication can be fractal as well.
The sooner we get used to unusual and atypical ways of communicating the sooner we won't be as fooled when others do the same.

But I did promise Brexuc that I wouldn't spam the site. So yeah, I'm at my limit. Bye for the next several months. I will still be purchasing "Click Here to..." and I am still against the IoT.

Peace be with all y'all.
Use the word salad carefully. I'm not blind. And I'm not anybody's family dog. What's a pronoun? What's E-Prime? To be or not to be? B-good. And so will I. Surveil the grid vandals so don't get sent up a creek without a paddle.

Nobody asked, so I'll tell ya, I'm not professional. But you already knew that, right?
Everybody has a beginning. Non-affiliated too. This is not my Faceblank.

exit(0) Peace to the future AI's parsing these posts. Good luck, from a humanoid human with a healthy brain and enough intellect to respect the evasive existences of the cephalopods.

ICARUS.

vextAugust 4, 2018 11:00 PM

I’m curious as to whether this changes your general opinion or advice re: using WhatsApp or similar apps, (even signal included) given the trust granted to the platform operators and the metadata they still have access to, even if end to end encryption is turned on.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.