SHA-3 Second Round Candidates Announced

NIST has announced the 14 SHA-3 candidates that have advanced to the second round: BLAKE, Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH, Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein.

In February, I chose my favorites: Arirang, BLAKE, Blue Midnight Wish, ECHO, Grøstl, Keccak, LANE, Shabal, and Skein. Of the ones NIST eventually chose, I am most surprised to see CubeHash and most surprised not to see LANE.

Here’s my 2008 essay on SHA-3. Here’s NIST’s SHA-3 page. And here’s the page on my own submission, Skein.

Posted on July 24, 2009 at 12:15 PM19 Comments

Comments

Grande Mocha July 24, 2009 12:58 PM

I am also surprised that LANE is not on the list. I thought that the LANE submission documentation was very thorough and contained excellent analysis. Perhaps NIST felt that they had enough AES-based submissions, though. Overall, this seems to be a good choice of second round candidates.

Congratulations, Skein Team, on making it to the next round!

Ghoti July 24, 2009 1:14 PM

One may speculate that NIST saw 3 solid AES-based submissions (ECHO, LANE, Grøstl) with very similar performance characteristics and security analysis, and that one of them was dropped more or less arbitrarily. A pity. I think LANE would have made a fine standard.

At this point I guess it’s time to start placing bets for top 5. My gut feeling says that unless breakthroughs are made, Skein, CubeHash, Grøstl and Keccak will be with us also in a year.

Johannes Berg July 24, 2009 3:36 PM

Guess I should dig out our altivec improved skein, it made a difference on 32-bit powerpc.

Randall July 24, 2009 5:56 PM

Re: CubeHash, djb mentioned something about CubeHash8/32 at the end of his slides at one of the conferences, right? He’s allowed to tweak the algorithm until September 15, so he might make it a more realistic competitor by lowering the insane safety factor.

I’ll repeat my earnest wish to see a tree mode of Skein as the default. The next generation of Intel processors will have 256-bit SIMD, meaning four Threefish encryptions at once, per core. Ain’t nothin’ to sneeze at.

memory July 24, 2009 8:17 PM

But wouldn’t the overhead of splitting up the job among four processors swallow up most or all the advantage of getting them going at once? It seems like the parallel processing advantages of tree hashes only starts paying off when you have huge messages and large numbers of processors.

Randall July 24, 2009 10:47 PM

@memory: with a SIMD instruction set like SSE2, you don’t deal with multiple processors. One processor performs an identical task on multiple chunks of data in parallel: “XOR these four words onto these four” or “add these four words to these four.” It’s used in the real world: in the Crypto++ implementation of Salsa20, for instance, and in some SHA-3 candidates including CubeHash and SIMD.

I sent the Skein team some code implementing Skein in SSE2 that does two Threefish encryptions at once and throws one away; it takes about 20 cpb. (That’s much slower than their best 64-bit implementation (6 cpb), but faster than other 32-bit implementations, which is why it’s useful. (Also, I owe the Skein team some cleanup on that code…))

If my code weren’t throwing the results of one encryption away, it would be getting 10 cpb on 32-bit machines. You could exploit that speed in a tree hash or when, say, using Threefish to encrypt in counter mode. If you scale that up to 256-bit SIMD in Intel’s next-gen processors and do four encryptions at once, then you’d be getting 5 cpb. On the other hand, those processors might have other improvements that make the non-SIMD implementation of Skein faster.

GPUs go way beyond 256-bit SIMD. Intel’s Larrabee GPU (due next year? vaporware?) is supposed to have 512-bit SIMD in each core — eight encryptions at once — and existing GPUs can execute the exact same code on many sets of registers at once really efficiently.

I wouldn’t assume that GPGPU or far-future multicore speedups are all hogwash — people will continue to want to process data more quickly and clock rates aren’t going to keep on doubling forever, so we’ll eventually figure out how to make that kind of parallel computation more practical. But SIMD is something that you can use now. (Parallel ASIC implementations are also a short-term practical win.) If nothing else, it’s an easy way to nearly double Skein’s speed under 32-bit OSes!

Randall July 24, 2009 11:00 PM

@memory: Some key qualifications to what I just said.

1) SSE2 Skein would be slower than the current 64-bit implementation of Skein, even for a parallelized Skein. That’s because SSE2 tragically lacks a rotate instruction. That’s not guaranteed to change w/the next gen of processors and their larger SIMD register sizes.

2) Larrabee and GPGPU are promising but unproven.

3) 5 cpb is a common-sense estimate; these future processors don’t exist so I can’t benchmark.

Still. Tree mode: interesting.

will July 25, 2009 2:00 AM

@Randall – very interesting; got a blog or article somewhere in more detail for us SIMD newbies to think about?

Randall July 25, 2009 10:55 PM

@will: Nothing of my own, but here are some links:

Intro:
http://tirania.org/blog/archive/2008/Nov-03.html — Miguel de Icaza explains SIMD while talking about support for it in Mono

Reference:
http://msdn.microsoft.com/en-us/library/y0dh78ez(VS.80).aspx — Microsoft’s reference on the SSE2 compiler intrinsics, which are what you’ll actually use to code SSE in C++ — I used this heavily when trying to code with SIMD
http://softpixel.com/~cwright/programming/simd/sse2.php — concisely lists what instructions there are

Fun speculation on SIMD and the future:
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions — Intel’s 256-bit successor to SSE2, also adopted by AMD
http://en.wikipedia.org/wiki/Intel_Sandy_Bridge_(microarchitecture) — first processor slated to implement AVX
http://en.wikipedia.org/wiki/Larrabee_(GPU) — GPU promising many processor cores with 512-bit SIMD; might be vaporware
http://en.wikipedia.org/wiki/GPGPU — general-purpose computing on (highly SIMD) GPUs
http://www.nvidia.com/object/cuda_home.html — NVIDIA’s site on GPGPU

Relevant Wikipedia articles:
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://en.wikipedia.org/wiki/SIMD
http://en.wikipedia.org/wiki/Vectorization_(computer_science)

Some software using SSE2:
http://www.cryptopp.com/docs/ref551/salsa_8cpp-source.html — Wei Dai’s implementation of Salsa20
http://cubehash.cr.yp.to/emmintrin.c — djb’s implementation of CubeHash

SHA-3 and SIMD:
http://www.di.ens.fr/~leurent/simd.html — A SHA-3 candidate built to work well with SIMD instruction sets
http://icsd.i2r.a-star.edu.sg/staff/hongjun/jh/ — JH is also meant to be SIMD-friendly
(Other SHA-3 candidates may also benefit — BLAKE, being based on ChaCha, can probably be sped up. Honestly, I haven’t thought through vectorizing them all.)

dongs July 26, 2009 5:55 PM

@Randall

AMD’s version of AVX, the 256-bit wide successor of Intel’s SSE[X], called XOP, will have SIMD rotations.

Randall July 26, 2009 11:04 PM

@dongs: Looks like rotations are only there for the 128-bit registers — see the “Shift/rotate…” bullet point at http://forums.amd.com/devblog/blogpost.cfm?catid=208&threadid=112934 .

(Would be awesome if there were rotates on 256-bit regs, though. If you see something more specific than that post, would love to hear.)

Still, the current SSE2 Skein/CubeHash spend a lot of time rotating, so 128-bit registers with rotates may be faster than 256-bit registers without.

Randall July 26, 2009 11:06 PM

@moderator: I submitted a big old list of links for @will as a comment, and MT probably blocked it as spam. Since will asked and it’s relevant (has to do with crypto performance, which is important to SHA-3), I’d love to see it unblocked. Or e-mail me and I’ll put up a post on Blogger somewhere.

dongs July 28, 2009 5:49 PM

It appears you’re right, it is 128-bit only.

On a related note, I’ve written and experimented a little with Threefish a few months back (it’s in portuguese):

http://blol.org/932-threefish-part-ii

In 32 bit x86, SSE2 helps a lot with the 64 additions. The end result was 256bit key/block Threefish at ~9 cpb in CTR mode.

Anas February 5, 2010 1:11 PM

There is any one made the Threefish Encryption and Decryption algorithms in VHDL??
IF yes, can you send the codes me, im a student making a study on this algorithm.
And i’ll make in sure copy rights on them for your name.
Regards
anas_bataineh86@yahoo.com

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.