Story of the ZooKeeper Poison-Packet Bug

Home Blog

Story of the ZooKeeper Poison-Packet Bug

Interesting story of a complex and deeply hidden bug—with AES as a part of it.

Tags: AES, Apache, vulnerabilities

Posted on May 25, 2015 at 9:20 AM • 20 Comments

Comments

Snarki, child of Loki • May 25, 2015 9:38 AM

The linked article gets all the way to the workaround, but doesn’t
quite nail the culprit.

Was the hardware aes-ni instruction, or the linux aesni-intel module, the source
of the bug?

Chris S • May 25, 2015 11:46 AM

@snarki;

That’s actually part of the interesting, if frustrating, result. They still aren’t quite sure. They know the workaround works, but as they explicitly state “a proper fix still regrettably eludes us”.

Futher comments suggest that it’s actually the Xen virtualization layer which may not accurately save and restore state on VM switches; this is further indicated because as of Xen 4.4, the problem ceases.

Kudos to them for getting this far, AND publishing this.

Curious • May 25, 2015 12:19 PM

Even though most internet tech stuff is beyond me, I still thought the linked article was a fascinating read. 🙂

Btw, someone in the comment field claimed something that I found bewildering and at the same time a fasicnating piece of information:
“For something that not a lot of network people seem to know, 2 single bit errors exactly 16 bits apart will calculate to the same TCP checksum.”

Alex • May 25, 2015 12:35 PM

Fascinating.
Does the native Intel AES-NI instruction bogus?
Or AES-NI software emulation in Xen is incorrect?
(see https://communities.intel.com/message/107795 “AES-NI is not available in a virtualized environment”)

Curious • May 25, 2015 12:46 PM

Not being a computer engineer or anything, thus not having a basic understanding of the whole thing, forgive me for asking:
Am I to understand that, if the standards is to blame for deeming a second checksum to be literally “unnecessary”, could this ‘bug’ also be deemed an ‘exploit’?

What is interesting with this notion of an exploit, would be that the problems that arise with this ‘bug’, might as I imagine it, quietly manifest itself in other situations (w. Windows for example).

Bewildered by my ignorance in these matters, I am still wondering:
Would a bug in sort of having a checksum in TCP with IPsec have anything to do with using crypto on the internet in general?

My simple mind recall “IPsec” being a part of IPv6, but I wouldn’t think IPsec was a part of everyday stuff on the internet.

Anura • May 25, 2015 1:55 PM

@Curious

I’m not sure how you could exploit it, as you would have to manipulate the packet before it is encrypted. Tampering with it over the wire is just going to cause the MAC to fail to validate. The TCP checksum is designed only to detect accidental errors, and yeah, it’s probably insufficient even for this purpose; with only 65536 possible values, a bad packet will go undetected quite often. The checksum is designed to be as fast as possible, and thus only does addition.

Ray Dillinger • May 25, 2015 2:07 PM

It depends on your programming environment, but I always start by assuming that whatever the code I’m writing must interact with is an application written by Satan.

Thus I wind up with lots of checks for things that “Can’t happen” if some other system is working correctly, and most of the same people who call me a “raging hair-triggered paranoid doing unnecessary work for no reason”, are later astonished how often those checks fail.

I probably am paranoid in fact; I assumed from the outset that Intel would get the AES instruction wrong. I assumed this because of a combination of two reasons: First, it was no longer source code where everyone could see how it was done and verify that it was right. Second, there are powerful entities who are motivated to ensure that they fail in subtle ways that are exploitable but won’t usually be noticed. Sometimes I hate being right, but color me completely unsurprised.

So, no, in my universe you don’t skip a checksum just because it’s also supposed to have been done somewhere else. You don’t fail to handle an exception because the exception won’t happen if the system you’re interacting with does its job. You use unsigned if there’s any chance of overflow because signed overflow is undefined behavior… and so on. And code takes several times longer to write, and in the short run I’m less productive than a lot of programmers. In the long run I like to think I spend a smaller fraction of my career chasing down crap like this.

Anura • May 25, 2015 2:29 PM

@Ray Dillinger

If Intel got the code wrong, it would be a lot more noticeable; anything using AES in GCM or CCM modes or using AES with CBC-MAC would end up failing regularly with any implementation using AES-NI instructions. The fact that it only happens on virtualized environments with certain versions of Xen seems to indicate it is not a problem with the hardware implementation itself.

It's a conspiracy! • May 25, 2015 3:08 PM

Plot twist: This is actually an Intel EAS-NI backdoor malfonctionning!

Curious • May 25, 2015 3:18 PM

Somehow I was reminded of a batman comic when reading the linked article. The comic might have been a spinoff from a Batman movie. Anyway, iirc the Joker poisons people by selling consumable products, but the consumables are not poiosonous by themselves, but become toxic when used in combination with other consumables. 😛

Buck • May 25, 2015 4:44 PM

@Curious

Ha! I’d be willing to wager that the Joker had some clowns on the inside of those grocery store ‘loyalty’ card databases… ^_^

(Sorry, way off topic! Uhhh… yeah, this particular bug/exploit is Xen-based, not Intel initiated.)

Snake Oil Alert • May 25, 2015 7:40 PM

I’m going to post this in the next Squid Blog but whilst we’re on-topic you can all have a good laugh at this.

I was researching Sandisk Secure Access Software when I discovered the supplier: <a href=”http://www.encryptstick.com/”EncryptStick”>EncryptStick. My suspicions were aroused when I saw this claim:

“You have the option to encrypt your vaults with 128, 256, or 512-bit AES ciphers. Our encryption is registered and government approved, and is FIPS 140-2 compliant. In an upcoming release, we will offer a 1024-bit encryption option.”

The official AES standard is 128, 192 or 256 bits and not 128, 256, 512 or 1024 bits… a definite sign of snake oil. They say their product is FIPS “compliant” but not “certified”, presumably to mislead gullible purchasers.

Worst still I found who I believe to be the parent company, offering a piece of software called TurboCrypt, which supposedly offers “1024 bit Polymorphic Encryption and 4×256 bit AES”. They showcase their “Polymorphic Giant Block Size Cipher” and even have a pseudo-scientific paper to support their wild claims.

Their “Polymorphic Medley Cipher V.2 – AES, Twofish, Serpent, Cast-256, RC6, SEED, Camellia and Anubis cascaded with keyed cipher selection” source code is here 🙂

Has anybody heard of their founder C.B. Roellgen?

Jacob • May 25, 2015 11:00 PM

@Snake Oil Alert

Even if you try, I doubt that you can concoct such a low server security grade at seen at turbocrypt.com:

https://www.ssllabs.com/ssltest/analyze.html?d=www.turbocrypt.com&s=80.242.134.246

Mike Amling • May 25, 2015 11:41 PM

@Curious
“For something that not a lot of network people seem to know, 2 single bit errors exactly 16 bits apart will calculate to the same TCP checksum.”

If I understand it correctly, only in some cases will 2 single-bit errors 16 bis apart leave the checksum unchanged. If bit A is the same as bit B which occurs 16 bits later, then changing them both will change the checksum. But if they’re different, then changing them both will leave the checksum unchanged.

The TCP checksum is just a sum (albeit ones-complement). If you exchange one 16-bit word of the TCP payload with any other, the checksum remains unchanged. If you exchange 1 or more bits of any 16-bit word with the corresponding bits of any other 16-bit word, the checksum remains unchanged.

rgaff • May 26, 2015 12:05 AM

@ Jacob how is this possible to get such a grade? That has to take some real serious effort to get that low!

G • May 26, 2015 1:13 AM

Interesting. Just to comment on the upper layer in this stack, having written a couple data data replication subsystems for enterprise storage systems and the like, I have to say, not checksumming your messages yourself (depending on TCP to do it for you? Really?), and passing raw RPC length fields straight to malloc (or the Java equivalent here) is pretty noobish. Even if nothing in the infrastructure or network clobbered that value, the client may have had a bug and emitted a message with a bogus value there. That’s all it takes.

This type of problem is not as hard to debug as one might conclude from this article. The main thing is it’s not all that hard to repeat, and it’s easy to detect (corrupt packets) when it happens. This allows you to dig and iterate. The main cool thing about this story is that they persisted (didn’t just settle for the first workaround they discovered), and documented it for the rest of the world.

dm • May 26, 2015 4:44 AM

Roellgen sounds like a kook, but that may just be the result of the language difference. Over the years I have worked with many foreigners for whom English is a second language, and they often sound just like Roellgen, even when they really do have something to offer.

I don’t want to impose any opinions about his work — I do have my own opinion. But for an objective assessment, he mentions that his “Polymorphic Encryption” is a patented technique.

I work on the side as a “Scientific Advisor” to Patent Law firms. If his method is really patented, then all of the information needed to replicate it must also be published in the patent disclosure. Hence, it cannot be held as a secret sauce.

It should therefore be easy enough to discern whether or not he is a kook…

Snake Oil Alert • May 26, 2015 7:49 AM

@Jacob – I didn’t even look at his website security – ‘F’ is terrible. Last time I checked my own site I had an A+ rating and that is easily achievable with modern certificates/server programming. How a security company can get such a poor rating is beyond me.

@dm – I don’t think it is the language difference – I can understand what he’s saying but his logic doesn’t add up. It’s his whole explanation of standards and (unrelated) mathematical concepts in his ‘literature’. He seems to have copy and pasted explanations of cryptographic concepts and changed them slightly to fit his idea.

His use of phrases like “No Government Agency in this world can ever break TurboCrypt”, “Royalty-free use for any legal civil application” and his $7,000 bounty for cracking his cipher complete with corny pictures makes me laugh.

He then tries to debunk established ciphers like Rijndael (a.k.a. AES) despite the fact he uses it within his own products! If Bruce Schneier had something interesting to say about AES we’d all listen – he’s a well established expert with demonstrable work/proof. Mr Roellgen seems more akin to a charlatan than an expert.

The software EncryptStick (developed by Roellgen) which SanDisk are using for their encrypted USB sticks is trivial to defeat because of its poor implementation.

@Schneier – If you see this post I’ve just found your old Crypto-Gram newsletter on Snake Oil and Mr Roellgen seems to hit almost all of your warning signs. Your other article on the Fallacy of Cracking Contests is also very pertinent here. If you have a ‘Rogues Gallery’ you might want to consider awarding Mr Roellgen a place.

Petrov • May 27, 2015 10:01 AM

Comment:

I was interested in this because the title maybe promised either a security issue which was put in by nation states, or possibly previous found by nation states. As very difficult to find security issues of a critical rating are among the highest of value for such parties. However, while it does sound like it is likely exploitable, and it was extremely difficult to find, and it likely would have been exploitable at a critical level using a high value (encrypted) protocol… it requires a number of ‘stars to be aligned’ for it to actually work.

So, unlikely to have been found before (though the developer indicated some manner of crash, he did not seem to have had solid enough data beyond a ‘pretty good suspicion’ of the culprit) (developers and especially ones tending bug databases tend to find out these sorts of issues as they have, effectively, such a large citizen qa team)… and unlikely to have been put there by anyone intentionally. (There are plenty of ways to have done this level of attack so there was not a ‘many stars aligned’ problem, and have low likelihood of detection due at the least to the complexity of the underlying encruption and packet handling systems required.)

Interesting story, though, and well told. I feel for them going that far to track out the problem.

Jeroen • June 15, 2015 2:55 PM

Hi Guys,
I would like to comment on the things I read here in relation to EncryptStick.
TurboCrypt and EncryptStick are totally unrelated and Roellgen is certainly not a founder of EncryptStick, nor does he have any relation with the current product or the company.
EncryptStick used the PMC cipher many years ago but it was replaced by the FIPS OpenSSL AES implementation.
SanDisk SecureAcces uses the OpenSSL AES FIPS version. Not a single bit PMC.
Need more info? just email support@encryptstick.com and you can get the facts.

We are trying to build the best security product we can for a broad audience. Associating us with TurboCrypt or Roellgen does not help.
Like always, don’t take my word for it, do the hard work and check the facts.

Jeroen

Schneier on Security

Story of the ZooKeeper Poison-Packet Bug

Comments

Leave a comment Cancel reply