Side-Channel Attacks on Encrypted Web Traffic

Nice paper: “Side-Channel Leaks in Web Applications: a Reality Today, a Challenge Tomorrow,” by Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang.

Abstract. With software-as-a-service becoming mainstream, more and more applications are delivered to the client through the Web. Unlike a desktop application, a web application is split into browser-side and server-side components. A subset of the application’s internal information flows are inevitably exposed on the network. We show that despite encryption, such a side-channel information leak is a realistic and serious threat to user privacy. Specifically, we found that surprisingly detailed sensitive information is being leaked out from a number of high-profile, top-of-the-line web applications in healthcare, taxation, investment and web search: an eavesdropper can infer the illnesses/medications/surgeries of the user, her family income and investment secrets, despite HTTPS protection; a stranger on the street can glean enterprise employees’ web search queries, despite WPA/WPA2 Wi-Fi encryption. More importantly, the root causes of the problem are some fundamental characteristics of web applications: stateful communication, low entropy input for better interaction, and significant traffic distinctions. As a result, the scope of the problem seems industry-wide. We further present a concrete analysis to demonstrate the challenges of mitigating such a threat, which points to the necessity of a disciplined engineering practice for side-channel mitigations in future web application developments.

We already know that eavesdropping on an SSL-encrypted web session can leak a lot of information about the person’s browsing habits. Since the size of both the page requests and the page downloads are different, an eavesdropper can sometimes infer which links the person clicked on and what pages he’s viewing.

This paper extends that work. Ed Felten explains:

The new paper shows that this inference-from-size problem gets much, much worse when pages are using the now-standard AJAX programming methods, in which a web “page” is really a computer program that makes frequent requests to the server for information. With more requests to the server, there are many more opportunities for an eavesdropper to make inferences about what you’re doing—to the point that common applications leak a great deal of private information.

Consider a search engine that autocompletes search queries: when you start to type a query, the search engine gives you a list of suggested queries that start with whatever characters you have typed so far. When you type the first letter of your search query, the search engine page will send that character to the server, and the server will send back a list of suggested completions. Unfortunately, the size of that suggested completion list will depend on which character you typed, so an eavesdropper can use the size of the encrypted response to deduce which letter you typed. When you type the second letter of your query, another request will go to the server, and another encrypted reply will come back, which will again have a distinctive size, allowing the eavesdropper (who already knows the first character you typed) to deduce the second character; and so on. In the end the eavesdropper will know exactly which search query you typed. This attack worked against the Google, Yahoo, and Microsoft Bing search engines.

Many web apps that handle sensitive information seem to be susceptible to similar attacks. The researchers studied a major online tax preparation site (which they don’t name) and found that it leaks a fairly accurate estimate of your Adjusted Gross Income (AGI). This happens because the exact set of questions you have to answer, and the exact data tables used in tax preparation, will vary based on your AGI. To give one example, there is a particular interaction relating to a possible student loan interest calculation, that only happens if your AGI is between $115,000 and $145,000—so that the presence or absence of the distinctively-sized message exchange relating to that calculation tells an eavesdropper whether your AGI is between $115,000 and $145,000. By assembling a set of clues like this, an eavesdropper can get a good fix on your AGI, plus information about your family status, and so on.

For similar reasons, a major online health site leaks information about which medications you are taking, and a major investment site leaks information about your investments.

The paper goes on to talk about mitigation—padding page requests and downloads to a constant size is the obvious one—but they’re difficult and potentially expensive.

Comments

igloo • March 26, 2010 6:36 AM

“plus ça change, plus c’est la même chose!” With direction finding replaced with IP address, traffic size and volume has been a significant side channel since WW1. In that case, traffic size and patterns were used to discover troop movements and future attack points. “Those who ignore history are condemned to repeat it” or something like that.

Clive Robinson • March 26, 2010 7:09 AM

Hmm,

How long have I been saying on this blog that “side channels” are the major risk to security?

Now let me think is it five and a half years…

The big threats to security are not breaking crypto systems but,

1, Side channels (time, size, sequence, temporal).

2, Poor if not non existant protection against EmSec style fault injection.

3, Poor if not non predictable numbers instead of True Random data.

Oh and the ones that code breakers do love,

4, Fixed format data in file formats etc.

5, Use of incorrect crypto modes for the storage of data etc.

Oh by the way have a look at Peter Gutmann’s work on what he calls “Malware as a Service” (MaaS).

Adam • March 26, 2010 9:31 AM

I know it is simplistic measure, but you only have to look at your SPAM and snail mail junk mail to see there is a huge amount of data leakage today.

AppSec • March 26, 2010 9:32 AM

@Clive:
Nice list. Though, I’m thinking 4 is a spin off of number 1. File Formats are just an extension of data size and sequencing.

I am constantly amazed at how “out of the box” people can think and what and when certain pieces of data are relevant.

I mean, if you look at the transmission size, the only time it is relevant is if someone has an understanding of the application. I’m trying to think if there is any value this data has if the person intercepting the data has no concept of what is going on in the application.

Richard • March 26, 2010 9:40 AM

This paper is impressive and inspiring, particularly their investigation on those top-of-line web applications, and their conclusion statement:

The web industry has decisively moved into the era of software-as-a-service. Given this unquestionable context, we envision that research on disciplined web application development methodologies for controlling side-channel leaks is of great importance to protection of online privacy

Clive Robinson • March 26, 2010 9:58 AM

Something else that has been coming to the boil for the past few weeks with SSL etc.

Your browser hides the root CA list from you and this holds something like 100+ CA’s…

Now something else your browser does not do is tell you when a certificate has changed, or uses a new CA as long as it’s on the list…

OK most users don’t care, but what about when it is used to spy on you?

Have a look at,

http://www.crypto.com/blog/spycerts/

Robert • March 26, 2010 10:12 AM

It’s strange how much meta data and side channel attacks have in common. The success of both attacks relies on understanding the nature of the data being requested, things like data formats and timing relationships and user requirements for the data. whats also surprising is the number of IT professionals that simply don’t “get it”

I remember a friend back in the dotcom boom days ran a blog that was filled with rumors about potential high tech M&A’s (takeovers). The guy made a fortune by doing reverse pings on search queries where the IP address pointed back to the Legal departments (representatives) of companies involved in takeovers. The reason it worked is the lawyers would search his site to see if there had been any information leaks (insiders spreading rumors). But it turns out that the manner in which most lawyers searched and documented their results (due diligence) differed significantly from the average users site use “signature”. So the “meta data” (how they used the site) confirmed that they were lawyers and the reverse ping confirmed that they were with the companies they had made queries about. Together this data
accurately pointed him to companies where an M&A was occurring. The date of the quires also gave him accurate information on the date to expect the announcement.

What’s strange is that as it became very clear that he was getting tipped off on mergers, the Lawyers doubled their web search efforts to identify any possible leaks.

In the process of trying to secure all communications the bankers and lawyers developed a system designed to upgrade and harden the companies they were working with. They called in security IT professionals to lock down their own email systems and all kinds of VPN’s got setup and tested between the banks and their customers. What nobody seemed to understand was that the process of making these changes (in and of themselves) creates an identifiable web signature. So in all probability this process “signature” was also being tracked and used as actionable intel.

arctanck • March 26, 2010 11:51 AM

I heard about such attacks, but didn’t really understand how they do it. The papers are very well written.

This means that those Chinese living in China, who connect to some proxy servers via VPN to do their searches and think that they are safe from government’s or ISPs’ monitoring, are also vulnerable?

Normal Chinese will probably not be too bothered. But those who want to search for and look at government censored stuff should rightly be concerned.

Peter A. • March 26, 2010 11:53 AM

@Clive Robinson:

That’s why I memorize a few digits of the fingerprints of the important (banks etc.) sites’ certificates and check them every time I log in. It’s only two clicks away. I also happen to remember what CA issued the certificate and about when the certificate is going to expire. When anything changes I get to double-check this.

Once a bank have changed its cert in the middle of the validity period and to a one from different CA for no apparent reason. It happened to be legitimate, but without the precautions I take I wouldn’t have noticed as the browser doesn’t alert about it.

While it doesn’t protect from all malware and fraud schemes, I think it is a habit worth developing.

Brian • March 26, 2010 11:54 AM

I never knew that encrypting an A v. B would give a different number of bytes…

I can see that for a response. Per se google always returned the top 5 responses with an A as the same list. You could infer the response. But that could be thwarted with personalized results. A recent history search would effectively be useless to the attacker. Of course, random length padding would eliminate that problem, and probably wouldn’t be very costly at all.

I don’t see that working over wifi though, as you don’t know what the destination is.

@Robert

Thats pretty epic. Its something I have never thought about before. Its why I read this blog, to get fantastic new insight like that.

JohnJS • March 26, 2010 12:39 PM

@ Peter A.
If you use Firefox, the “Certificate Patrol” add-on accomplishes the same thing.

DC • March 26, 2010 1:09 PM

Anything that makes software as service and cloud computing less desirable is just fine with me. The first is the attempt to get paid forever for one time work — not terribly moral. The second more often than not is simply an attempt to be paid forever for knowing slightly more than a customer might.

Replace that “might” with “should”….

I find it the height of folly to trust those guys with anything important, and if it’s not important, why do it it all? If they mess up in some way, you don’t even have a particular person’s butt on the line (who you can fire), you’d just get shunted from call center to call center with your complaint falling on the floor. So you’d have to have local backup anyway, just to prevent loss, and that doesn’t address security at all. If you’ve got to do that, then I fail to see any advantage to the “cloud” for nearly all uses.

moo • March 26, 2010 1:23 PM

@Robert: Very interesting!

@Brian: It might help to think of it as a honeypot for lawyers involved in M&A…

Clive Robinson • March 26, 2010 2:31 PM

@ AppSec,

‘Though, I’m thinking 4 is a spin off of number 1. File Formats are just an extension of data size and sequencing.’

Sort off the difference being that 4 & 5 are actually for when looking into encrypted data, as opposed to looking at a block of encrypted data.

An example being an MS Word or PDF file. Certain bytes stay the same or have a very very limited range. Thus if the file is byte by byte encrypted (stream cipher etc) without chaining then the key stream can be found. As we know from WEP that can be fatal in that the state of the Sarry can be worked backwards…

Worse if two copies of the same file are stored on a thumb drive against the same stream and starting point, but change partway through. You can easily workout what the plaintexts are but also a big chunk of the keystream from that point. And in some cases “saw buck” your way backwards from the context…

‘I am constantly amazed at how “out of the box” people can think and what and when certain pieces of data are relevant.’

Simple rule if you don’t know if a piece of data is relevant or not assume it is till you know definitely otherwise (ie better to be safe than sorry).

‘I mean, if you look at the transmission size, the only time it is relevant is if someone has an understanding of the application.’

It is usual to assume your enemy “knows the system” but even when they don’t, they can cross reference with a users previous response times, to take a guess at if it’s text or a picture etc. Repeated blocks that do not change size might well be logo’s etc.

Look at it as using “traffic analysis” to enumerate the system, regularity is the friend of an enemy. And as we know with CSS etc you can expect a lot of regularity in a site web page. So if an attacker only gets an error page they might learn more than you expect.

‘I’m trying to think if there is any value this data has if the person intercepting the data has no concept of what is going on in the application.’

In a one off use no but if they observe repeated uses they can build up patterns that would indicate different to usual activity.

There is a story about how all the pizza delivery boys around the Pentagon new the first Gulf war was just about to kick off simply due to the time people where ordering pizza and the quantities involved.

All unknown information will become useful with time, if only to show it’s irrelevant to what you are looking for.

privy396 • March 26, 2010 6:30 PM

Interesting work!
However I am really wondering to which extend the “query work leaks” attack work. I wish they had done some more tests and provided experimental results.

They are at least 2 scenarios, I can think of, where the attack does not work well:
(1) Google signed-in users get personalized suggestions (from their web history)…and these entries would be hard(er) to predict (personalization in this case helps privacy ;-))…
(2) If a user types quickly, the number of AJAX requests can be reduced (i.e., a request might be sent for 2-3 letters)…and this, again, will make the guessing more difficult!

There is a recent work related to google’s query suggestion… it shows how a user’s search history can be inferred from his web searches and more…
Have a look at “Information Private Information Disclosure from Web Searches (the case of Google Web History)”, available at:
http://planete.inrialpes.fr/projects/private-information-disclosure-from-web-searches/

Mr. • March 26, 2010 7:13 PM

Since the size of both the page requests and the page downloads are different, an eavesdropper can sometimes infer which links the person clicked on and what pages he’s viewing.

So, we even could identify wich sites browsed over VPN not only by download size, but even by timings onto site counters, etc.

Mr. • March 26, 2010 7:19 PM

…
and also we could identify wich sites was browsed by wich computer into VPN by dependancy of processor power, type of browser, hard drive caches, etc.

Robert • March 26, 2010 9:57 PM

@moo
You’re correct he intentionally created a honeypot for M&A lawyers, but this is only half the story. What is more important, in my mind, is the concept that the very process by which we gather / verify information, (“due diligence” in lawyer speak) leaks the exact information we are trying to protect.

I can suggest a dozen ways to use proxies to hide the site making the query, or even to create “mis-information” pointing to a competitor, BUT none of this would change the basic use signature of a lawyer gathering and documenting “all” the information at particular time point. This unusual site use signature will always standout. What is also important is that I don’t necessarily need to be running the “honeypot” site, to observe this unusual site use signature. There is a lot of information buried in google analytics that is only understood (and actionable) when viewed in the context of a given companies current business activities.

Similarly when external security experts are called in to log all the emails etc and “harden” all IT systems ahead of a takeover. It does not take a very bright IT worker to realize that these companies usually announced a takeover 1 month after his involvement. Also what happens when all the net bots that have been established in a given company suddenly get shut-down, there is always information leaked by this change in the “statuesque”

My basic point is that there is actionable information in the very process by which a business plan progresses, and no amount of fancy security can hide it.

parnoid • March 27, 2010 4:15 AM

Remove all CAs from browser #1 that is to be secure. Then use another browser #2 and its CA to get the cert. Then take that cert and put it into the browser #1. One is always compromised or not. One can also compare fingerprint with other sources (people) that are in radically different locations.

Nick P • March 28, 2010 1:39 PM

@ paranoid

That solution has no value. CA’s are standardized to the point that retrieving and verifying one is easy enough. The real problem is that SSL & valid certificate don’t prove anything about your security. That problem requires more difficult solutions and tradeoffs. A performance, usability or application availability tradeoff is common with medium to high assurance software. Just look at the fact that there is a Linux compatible “trusted” OS, XTS-400/STOP, yet the DoD buys Windows and Linux machines (read: holes through and through) in bulk & only uses STOP OS for guards. So, not only are SSL/certs of very limited use, the better options aren’t very marketable.

GregW • March 29, 2010 1:30 PM

I always wondered whether SSH (when using a public/private key pair with a password for server authentication ) and specifically “putty” leaked a similar sort of timing information about keystrokes when entering the SSH/server password, since that always occurs at the beginning of a session and I opened dozens of sessions per day over a period of many years, each time typing the same password.

(If there was a packet sent per keystroke, the timing of the packets would set some bounds on the combinations of keystrokes potentially involved in the password given some knowledge of either generic, or user-specific, typing at normal/fast speeds.)

I figured it was an obvious enough attack that brighter minds than mine had dealt with it, but reading this discussion of SSL vulnerabilities (admittedly with word-suggestions that dont occur in password scenarios), I’m less confident. I just looked up the issue and it looks like vulnerability may vary per-implementation (RFC 4251 section 9.3.9), not due to precautions in the protocol itself. Anyone know more on this topic?

(Now that I think about it, some knowledge about the typist and their speed/patterns is now more visible to the network, with these per-character AJAXy apps everywhere. This information, gathered from non-secured network communications, or from the SSL side-channels described in this paper, could be combined and then used to attack your SSH passwords potentially since the network knows more about you as a typist.)

I never did run across an SSH client that would buffer input lines like one old telnet client I used to use, but that would have been a handy countermeasure to that sort of threat.

PeterW • April 15, 2010 9:55 AM

@GregW use SSH public key auth and disable password-based auth if you care about your security.

Schneier on Security

Side-Channel Attacks on Encrypted Web Traffic

Comments

Leave a comment Cancel reply