Identifying People by Their Browsing Histories

Interesting paper: “Replication: Why We Still Can’t Browse in Peace: On the Uniqueness and Reidentifiability of Web Browsing Histories”:

We examine the threat to individuals’ privacy based on the feasibility of reidentifying users through distinctive profiles of their browsing history visible to websites and third parties. This work replicates and extends the 2012 paper Why Johnny Can’t Browse in Peace: On the Uniqueness of Web Browsing History Patterns[48]. The original work demonstrated that browsing profiles are highly distinctive and stable. We reproduce those results and extend the original work to detail the privacy risk posed by the aggregation of browsing histories. Our dataset consists of two weeks of browsing data from ~52,000 Firefox users. Our work replicates the original paper’s core findings by identifying 48,919 distinct browsing profiles, of which 99% are unique. High uniqueness hold seven when histories are truncated to just 100 top sites. We then find that for users who visited 50 or more distinct domains in the two-week data collection period, ~50% can be reidentified using the top 10k sites. Reidentifiability rose to over 80% for users that browsed 150 or more distinct domains. Finally, we observe numerous third parties pervasive enough to gather web histories sufficient to leverage browsing history as an identifier.

One of the authors of the original study comments on the replication.

Posted on August 25, 2020 at 6:28 AM20 Comments

Comments

echo August 25, 2020 7:07 AM

This paper sounds pretty much on the ball. I know I stand out like a sore thumb. While laws exist in theory in practice I find evasion seems more the norm than not whether at technical, policy, or governance levels. It’s very exhausting dealing with this.

Phaete August 25, 2020 9:12 AM

People are ‘creatures of habit’
Put a GPS tracker (phone) on 50k people and you can do the same with their position history.

Record their communications and you can identify them by speaking patterns.
Record their faces and you can identify them
Record their biometrics and ….

All just variations on the theme of “collect enough about feature X to distinguish individuals”.

It’s intrinsic, the features that make us “unique” can be used to distinguish us.

I’m far more concerned about HOW that data gets used then the fact that collecting the data is possible.

joker67 August 25, 2020 11:12 AM

I’m far more concerned about HOW that data gets used
then the fact that collecting the data is possible.

Well, that data shouldn’t be collected in the first place.

Even if it is possible. Even if the data weren’t analysed.

The data shouldn’t be stored, because, as the title says: we have the right to browse “in peace”. When I read a book, it doesn’t store the pages I looked at. Nor should websites.

Johnny August 25, 2020 12:08 PM

The easiest way to combat uniqueness is obfuscation and noise.

The only step forward with privacy is obfuscation. Collection is clearly already compromised, so unpatterned noise must be added to the collection process.

j.c. August 25, 2020 1:39 PM

@Phaete

People are ‘creatures of habit’

Plenty of “bad habits” and “vices” browsing online, for sure, but not just that. People develop “routines” and an “economy of thought” for dealing with the constant overload of information online, including obnoxious and unwanted marketing.

Fix a hot beverage like tea or coffee or cocoa in the same way at the same time every morning, that sort of thing. Commute by car or bike or walk the same way to and from work every day. These habits too are vices.

People are stuck in a rut, trapped by their own routines.

A lot of it is news. People check the national news, local news, news in their specific field of work. Their ways of thinking become old, and they need to learn new ways of thinking to save their own lives.

Michael S August 25, 2020 2:53 PM

The easiest way to combat uniqueness is obfuscation and noise.

It is one of the ways.

In case of Internet browsing, using Tor (www.torproject.org) is a good defense against identifying people, because a new circuit is used to access each domain, and it changes every 10 minutes.

David Leppik August 25, 2020 3:04 PM

Since it is based on only 14 days worth of data, it is impossible to say how stable the search history is over time. I suspect it is fairly stable over the course of a year, but we don’t know from this.

A typical threat model is that after a tracker is deleted, a company such as Alphabet or Facebook creates a new tracker and links the old one to the new one. The paper suggests that this can be done half the time after 50 distinct domains are visited, and 80% of the time after 150 domains are visited. But that’s assuming there are only 50,000 people.

The conclusion is based on a Monte Carlo simulation of re-identification attempts. Crucially, they define re-identification as finding a unique match between two known users. That is, they ignore the possibility of a false positive matching a known user to a new user for whom they have no history. The fewer the users, the less it takes to find a unique match. As they point out, if they only have 1 user, it automatically matches.

That’s not true re-identification, nor do they claim their results scale to millions of users. However, their data is suggestive. They say “Roughly speaking, a 10-fold reduction in the number of users increases reidentifiability by 10%.” Conversely, if they increase the population to 10 million, that 50%/80% re-identification rate should drop to 30%/60%.

So it’s far from perfect. That said, the real threat model is that someone could be identified using browser history along with geolocation, OS/browser version, and other clues. In that case it’s perfectly adequate, especially for advertising where they don’t really care about false positives. More nefarious uses, such as population-wide surveillance (e.g. Uyghurs in China), may not match these results due to different websites being available.

But consider if domain names couldn’t be used to identify individuals. That would imply that all the individuality of our web browsing would be within a small number of domains. That is, I can make myself less distinctive by not visiting Schneier.com, and instead get my security news from Facebook.com. That scenario would simply increase Facebook’s ability to track me, while cloaking me from trackers that Facebook denies.

lurker August 25, 2020 3:55 PM

The original 2012 paper used user-agent strings to identify the OS and browser they wanted to analyze. This newer paper is targeting Firefox. My maths are not good, but it looks like a biassed sample selection…

Clive Robinson August 25, 2020 6:37 PM

@ Johnny, Michael S,

The easiest way to combat uniqueness is obfuscation and noise.

It’s also the least reliable.

Noise that is non synchronus to thr desired signal averages out extreamly quickly.

Depending on what precisely you mean by “obfuscation” likewise can be stripped off.

The way to combat the uniqueness of habits, is not to have recognisable habits. Then whilst everything is unique it has no auto-correlation function thus the signal is noise and if averaged rapidly collapses towards zero as noise does.

But at the end of the day the best way is not to provide any kind of output. So no communications direct or otherwise means no signal and no noise, so nothing to measure or apply signal processing or AI numeric techniques to…

Are there ways to have the benifits of having habits but not generate a signal?

The simple answer is yes you get the sources to use some kind of “Broadcast model”. In the past I’ve discussed how to do some of it as part of a “Fleet Broadcast” system.

However Tor is not suitable for this for a number of reasons, and with certain elements trying to become hundreds of “exit nodes”… It is clear that they have a confident way to gain advantage of being in control of such exit nodes…

Singular Nodals August 25, 2020 7:17 PM

How you browse …

… and how you game

arstechnica.com/gaming/2020/08/sony-could-detect-playstation-users-based-on-how-they-hold-a-controller/

echo August 26, 2020 7:54 AM

@Clive

But at the end of the day the best way is not to provide any kind of output. So no communications direct or otherwise means no signal and no noise, so nothing to measure or apply signal processing or AI numeric techniques to…

Are there ways to have the benifits of having habits but not generate a signal?

The simple answer is yes you get the sources to use some kind of “Broadcast model”. In the past I’ve discussed how to do some of it as part of a “Fleet Broadcast” system.

So basically pay cash in your local highstreet and watch television. If not possible some form of “cut out” would be required like a local payment processor or a distributed technology universal local cache.

Most of the silliness is as far as GDPR concerned unlawful. Even if GDPR was strictly adhered to by default any nation state would be able to monitor transations and traffic and grab this bythe backdoor and feed the espionage back to its own corporations. I have no idea what the exact ratio is between defence and espionage but suspect espionage is at the higher end of expectations more than the lower end. The fact most consumer space surveillance is “legal” (for definitions of legal) doesn’t mean there isn’t a permissive link from somewhere at the top.

It naturally follows that a faux meritocracy built on “fiscal led policy” which subverts human rights and the public interest is by definition a security problem. Hairy terrorists with AK-47s make a nice distraction from the Saville Row suits with grey hair who are very camera shy. But who does more damage?

Of course, if you’re rich you can afford to buy the services of a “cut out” and obfuscate even more with a reassuringly high service charge to avoid giveaway variable billing charges. Do such services exist? I have no idea.

If we’re being silly I suppose you could hire a stooge from the local “aribi” company to provide a “legend”.

Michael S August 26, 2020 8:50 AM

@Clive Robinson

However Tor is not suitable for this for a number of reasons

Why? I thought that using Tor with JavaScipt disabled was a good idea to significantly lower the risk of being tracked and profiled.

It anonimizes the IP address. It blocks tracking cookies. It assigns a separate, randomly generated circuit for every domain the computer connects to. It changes the circuits every 10 minutes. In addition, disabled JavaScript makes it impossible to use tracking scripts on websites.

Is the issue of compromised exit nodes a problem here?

too friendly neighbour August 26, 2020 9:21 AM

@Michael S

Imagine that you are live broadcasting videos. And everytime you stream, there is only one user at a time watching, regularly showing up at 6 pm and leaving at 7 pm. The user name is always different.

Even though I have no proof that it’s always the same person or program enjoying your videos, greeting “welcome back” would most likely be a correct statement.

Tor is great if you hide among many many persons. If you go to sites which are not frequented that often or if you go to very specific pages, the connection would still show some uniqueness.

c1ue August 26, 2020 10:10 AM

I’m shocked, shocked to find that gambling is going on in here!
If the majority of people are uncovered by their browser setup (a la amiunique.org), adding in browser history just makes it even more accurate.
As a digital forensics practitioner – the cookies are more than enough. That’s what Google uses anyway to fingerprint you.

Mr. McGuire August 26, 2020 11:05 AM

  • I want to say one word to you. Just one word.

  • Yes, sir.

  • Are you listening?

  • Yes, I am.

  • Traffic analysis.

Clive Robinson August 26, 2020 3:22 PM

@ echo,

Do such services exist? I have no idea.

Oh yes they exist alright, not only do you pay them but usually the get “trade discounts” and the like you don’t.

The more upmarket ones are call themselves “Personal Shoppers”.

But due to COVID there are a whole bunch of people doing all sorts of shopping for other people in the name of “helping out” they do “shopping runs” for people with disabilities or other reasons why they can not leave home.

But whilst there are big Internet companies now doing “Personal Shopping” for people and they keep all sorts of records, there are others that are local and are more than happy to take cash.

A friend has a cleaner and she is more than happy to buy stuff for him on her credit card as she gets points and cash from him and she’s also kind enough to get some stuff for me and others. She’s joked about where she’s going to go on holiday with the cashback etc.

So yeh ask around you might be quite suprised who will help you and others who will do it for small payment.

But failing that it’s actually not that expensive to set up a limited company in the UK, the down side is all the crap paperwork you have to do that goes with it. Once you have a company the biggest hurdle is getting it a bank account which is why PayPal has been pushing it’s highly dodgy services at start up sole traders and the like. Based on PayPal’s behaviour todate that is a major disaster building…

But interestingly when you have a business you can look at other jurisdictions Eire looks favourably at small businesses and so does Estonia, which also issues e-ID’s that are effectively the equivalent of an EU member passport.

Though the ultimate “no questions asked” financial vehicle has to be a UK Limited Liability Partnership. Thes were set up at the request of Accountants and Lawyers so they could have partnerships but without the “total liability” asspect partnerships usually have. Needless to say if you can get to see “The Panama Papers” or similar such as Private Eye’s rouge gallery of property ownership the number of LLP’s is supprisingly high, and the number of those that claim not to be “actively” doing business in the UK again supprisingly high.

At one time the Isle of Man was a very usefull place but they closed that to “new entrants” the Channel Islands similarly which also had a usefull VAT back door on items below 20GBP (supposadly a concession to flower sellers, but abused realy hard by companies selling the likes of entertainment CD’s and DVD’s a few years back.

There are many many ways for those “in the know” the hard part is “being in the know”. What you need to know though is that such systems are being more and more restricted to “new entrants” so off shore trust funds before a certain date carry on as is you just can not start new ones.

A prime example was back in the 1980’s a UK Building Society was opening card accounts without any proof of ID you could also get a UK driving licence back then again without proof of ID and likewise getting a new national insurance number in a new name was of no difficulty I know of several people that did it for legitimate reasons (getting away from abusive partners etc). These days you can not “get away” that way any more which might account for the increase in people suffering fear and having to hide other ways. Which is why there are some charities that help such people lead a “cash only” life style and give them postal addresses at the other end of the country etc.

Clive Robinson August 26, 2020 3:38 PM

@ Michael S.,

Is the issue of compromised exit nodes a problem here?

Yes they are a problem, for one thing they can do a “Man in the middle” attack on your HTTPS and get to see the plaintext.

Which is possibly why a certain criminal element is trying to put up as many Tor exit nodes as they can (they are related to cryptocurrency theft).

But there are a whole load of other issues involved not least is “low latency” and no “store and forwards” nodes, oh and clients and servers are both outside the network making them easy to identify.

@ Mr. McGuire,

Traffic analysis.

Yup there’s that, as well as any traffic going to Tor input nodes is painting a big fat target on your back. It’s almost the same as sneaking around at night in black plimsoles, black trousers, a black mask, a black and white striped jumper, and carrying a big black bag over your shoulder with written across it in white “SWAG”… It’s just asking to be investigated…

echo August 26, 2020 8:31 PM

@Clive

Ah, yes. I’m aware of 90% of that.

Some loopholes are caused by the “illusion of certainty”. That is a point of view is formed sometimes not on the full evidence by self-proclaimed tough talking and decisive people of action who then go all in. People mock MBAs and the like but they are not the only ones who do this by along chalk. This means various legislators and judges many of whom may have retired or gone on to other things leave this behind and the unthinking machine of state carries on like a mechanical toy.

One difference between the US and UK is the US tends to more “all in” on institution and corporation building hence the CIA and US Marine Corps. The UK tends to smaller structures which rely more on ad-hoc so on a mission to mission basis achieves similar or sometimes superior results for the same or lower cost.

I’m lazy and have ethics which explains in part why I am not rich.

John August 31, 2020 8:45 AM

@joker67,

When I read a book, it doesn’t store the pages I looked at.

Unless, of course, it’s an e-book read on an Amazon device.

Stuart August 31, 2020 10:00 AM

Well, that data shouldn’t be collected in the first place.

Even if it is possible. Even if the data weren’t analysed.

The data shouldn’t be stored, because, as the title says: we have the right to browse “in peace”. When I read a book, it doesn’t store the pages I looked at. Nor should websites.

I want my user-agent to collect this data for me, and store it in the cloud. I find Google’s Timeline extremely useful, as I do their search history feature. If it were possible to unobtrusively log which pages of which books I read at which time, I would definitely want that.

The problem is not with collecting the data and making it available to the person who the data is about. It’s about giving access to that data, or derivatives thereof, to anyone who comes with either advertising money or search warrants.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.