Data Leakage from Encrypted Databases

Matthew Green has a super-interesting blog post about information leakage from encrypted databases. It describes the recent work by Paul Grubbs, Marie-Sarah Lacharité, Brice Minaud, and Kenneth G. Paterson.

Even the summary is too much to summarize, so read it.

Tags: academic papers, databases, encryption

Posted on March 1, 2019 at 5:59 AM • 12 Comments

Comments

Jon • March 1, 2019 8:37 AM

Fascinating article.

Had a thought, though, about implementing ‘fuzzy’ database records as a counterattack.

Imagine:
a) The database contains some small percentage (5%?) of records that are ‘fuzzy’ records – they will occasionally (not always!) return from unrelated queries.
b) Query specifics are encrypted (e.g. range), and identical plaintext queries are NOT encrypted into the same ciphertext!
c) The client, when decrypting the returned records, can easily recognize a ‘fuzzy’ record and discard it.

Eventually an attacker could by probability determine which records are ‘fuzzy’ and ignore them, but the hope here is that by adding a small amount of ‘fuzzy’ records it makes the number of samples required very much larger for the same error rate.

And, of course, there’s overhead.

Anyhow, I’m very sure I don’t know how to do the math on this idea. Have fun.

Phaete • March 1, 2019 8:48 AM

Good stuff.

It’s refreshing to say that he has a good writing style.
Good explanations, some humour to keep awake and politically neutral on a topic like this.
I couldn’t find much wrong with the article, just good stuff.

Impossibly Stupid • March 1, 2019 11:43 AM

It’s a well-written article, but it probably isn’t anything new to most people here. It basically boils down to the fact that you can do signal analysis on the available metadata. The more easily you can correlate the query with the response, the more easily you can nullify the value of the encryption. Keeping “live” data secure is inherently a hard problem.

AJWM • March 1, 2019 12:57 PM

@Jon

Not quite the same idea, but I advocate lying, er, making creative entries on forms whenever possible (and when not screwing up the reason you’re filling in the data in the first place). For instance, websites (hello, Facebook) don’t need to know my real birthday. At best they need to know that I’m over 13 or 18. Fuzz the data yourself.

Obviously, not applicable where prohibited by law. Don’t lie on your tax returns, folks.

RealFakeNews • March 1, 2019 3:23 PM

As always, does a site REALLY need that info? If not, but they make it mandatory to supply it, then make something up.

Facebook keeps asking for my mobile number, but I’ll never give it. If it ever becomes mandatory, I’ll buy a burner-phone (cash, of course).

As for protecting a live database – that’s easy. Given most sites require a unique log-in, use that to know which records to decrypt.

The real problem is companies want full access to all records all the time for data-mining and as we all know, that is the polar-opposite to what real security offers.

Until we’re honest about why companies “need” access to this data, then we will always be chasing ghosts.

If it is so important that companies have access to everything, why is it so hard to double-store the data? Once for the user, and once for the company?

I’m reminded of all the discussions of securing communications while allowing the Government access:

Encrypt the data twice: once using the key of the intended recipient and send, then a second time using the Government key and send.

"data base" guy • March 1, 2019 4:25 PM

For those who didn’t read the summary article: this pertains to databases that don’t touch the decryption keys (the keys are only ever client side):

Of course it depends on how big the data set is as to how well this would work… but databases and their search indexes can be segregated into different parts based on access restrictions and decryption keys… download just the parts you need… query them locally, and multiple results from different sources aggregated back together for display to the user (just the top one(s) from each set using a common relevance ordering, not looking up the entire result set)… without a ton of trouble…

This obviously wouldn’t work for something as large as searching google indexing the whole web, but since that’s all public data, it doesn’t need to per se… you only need it for truly private/secret data, which under normal circumstances where each user owns their own data is a far smaller set of data… (the logs of who searched what would be candidates, for example… so if that data was kept client-side-readable/searchable-only then their algorithm for pulling relevant ads based on those previous searches would also need to be client-side)

The issue is when a large company with lots of data (like, say, facebook) considers itself to be the owner of all data it stores, and not each of its users owners of their own data… (this is how US law effectively treats it anyway with its “third party” doctrine)… then a facebook admin (or whoever at facebook) needs to download everything to do any query, and that’s not really practical.

Also companies don’t like this idea of segregation of private data because they like to operate under the principle of “collect it all… just in case it proves useful someday and turns into a cash cow…” so this necessitates a lot of people and processes having full access to everything…

I guess my point is the main problem isn’t in the technology possible with data storage, it’s in the human philosophy behind how the data storage is done and what its ultimate purpose is… (i.e. to serve the company, not the customer or user)

Sancho_P • March 2, 2019 12:06 PM

Very interesting, I think I could understand most of it, thank you.

The presumption here is that an attacker can read query and the corresponding reply for some time, albeit both encrypted, and has some knowledge of the context.
If the attacker could replay or generate their own query – game over, no doubt.
Btw. I wonder why e.g. “Salaries” in the query wasn’t encrypted (like the range was)?

I’m not sure if the eavesdropper must be inside the db server to attack, I think a certain client connection (or two …) would be enough.

However, isn’t that much more complicated as to (repeatedly, continuously) watch a client’s TLS connection to a well known public server / service? The server’s reply for many pages / queries is known in plain (and can be challenged endlessly from all over the world).
For a dedicated attacker, how long would it take to read my connection in plain, to e.g. @schneier.com, just by exploring the metadata? (not that it would make much sense here)

Or is this scenario very different from the db-leakage?

Sancho_P • March 2, 2019 12:10 PM

@”data base” guy

I assume each client has it’s particular key which is translated at the server to a centralized db key?

The download + search has limits also with time critical queries, e.g. the “US Terrorist Watch List” and their 1400+ clients, plus there would be 1400+ vulnerable systems to leak all db knowledge.

However, I agree, the main problem is human “philosophy” (or the lack of), call it business.

Me • March 4, 2019 1:38 PM

@”data base” guy

“Of course it depends on how big the data set is as to how well this would work”

Sadly, no, no it would not. They show that (given certain information and assumptions) the speed with which the data is exposed is proportional only to the degree of tolerance an attacker is willing to have for inaccuracies, or data omitted (specifically not knowing the data at the extremes of the range).

Either way, however, they were able to get good accuracy after only a few hundred queries were observed. Note also that this was not about a specific implementation that was leaky, but the attack was based only on theoretical properties of a perfect encrypted database (one that allows ranged queries).

Overall, this confirmed my initial thought when I heard about these things that they sound interesting, but I doubt they can be made practical.

Jon • March 5, 2019 2:53 AM

@ Me

That was sort of what I was getting at in my post – taking one of their assumptions out and shooting it.

One of their assumptions was (I think) that:
“The same query always returns the same results”.

Therefore, if the database returns different results, it must then be a different query. But that’s not true, if you have ‘fuzzy’ database entries. By breaking the assumption, I hope to break the proof.

Another idea I had, to break that assumption, was a ‘sticky’ database. Every query will return, encrypted, some records that don’t actually match the query – they sort of ‘stuck’ to the trawl the database engine was doing.

When returned to the client, which has the original query in plaintext as well as the results in plaintext, the ‘sticky’ returns can be easily scraped off.

For even better security, you’d want also a ‘slippery’ database, where some records that do match the query do not get returned, but that might be awfully hard to sell to the customer.

It’s a bit like breaking cryptography. Instead of attacking the math, attack the implementation. Instead of attacking the proof, attack its assumptions…

PS – @ AJWM: Agreed. Poisoning databases is heap good fun. Canary traps are fun too – A trivial, unique, mis-spelling in one field and when it gets back you know exactly where it came from…

Gweihir • March 5, 2019 5:58 PM

Ah, yes. the perennial quest for things that may well be impossible. Nice article, boils essentially down to “if you can observe people using databases, you can deduce what is in it”. Not much of a surprise, really.

KeithB • March 7, 2019 4:08 PM

AJWM:
Obviously, not applicable where prohibited by law. Don’t lie on your tax returns, folks.

Or your SF-86. 8^)

Schneier on Security

Data Leakage from Encrypted Databases

Comments

Leave a comment Cancel reply