Wanted: Skein Hardware Help

As part of NIST’s SHA-3 selection process, people have been implementing the candidate hash functions on a variety of hardware and software platforms. Our team has implemented Skein in Intel’s 32 nm ASIC process, and got some impressive performance results (presentation and paper). Several other groups have implemented Skein in FPGA and ASIC, and have seen significantly poorer performance. We need help understanding why.

For example, a group led by Brian Baldwin at the Claude Shannon Institute for Discrete Mathematics, Coding and Cryptography implemented all the second-round candidates in FPGA (presentation and paper). Skein performance was terrible, but when they checked their code, they found an error. Their corrected performance comparison (presentation and paper) has Skein performing much better and in the top ten.

We suspect that the adders in all the designs may not be properly optimized, although there may be other performance issues. If we can at least identify (or possibly even fix) the slowdowns in the design, it would be very helpful, both for our understanding and for Skein’s hardware profile. Even if we find that the designs are properly optimized, that would also be good to know.

A group at George Mason University led by Kris Gaj implemented all the second-round candidates in FPGA (presentation, paper, and much longer paper). Skein had the worst performance of any of the implementations. We’re looking for someone who can help us understand the design, and determine if it can be improved.

Another group, led by Stefan Tillich at University of Bristol, implemented all the candidates in 180 nm custom ASIC (presentation and paper). Here, Skein is one of the worst performers. We’re looking for someone who can help us understand what this group did.

Three other groups—one led by Patrick Schaumont of Virginia Tech (presentation and paper), another led by Shin’ichiro Matsuo at National Institute of Information and Communications Technology in Japan (presentation and paper), and a third led by Luca Henzen at ETH Zurich (paper with appendix, and conference version)—implemented the SHA-3 candidates. Again, we need help understanding how their Skein performance numbers are so different from ours.

We’re looking for people with FPGA and ASIC skills to work with the Skein team. We don’t have money to pay anyone; co-authorship on a paper (and a Skein polo shirt) is our primary reward. Please send me e-mail if you’re interested.

Posted on September 1, 2010 at 1:17 PM42 Comments

Comments

Grande Mocha September 1, 2010 1:40 PM

I suspect that you won’t see great performance coming out of FPGA models of Skein. I don’t know of any existing FPGA products which have 64-bit adders as basic building blocks, so the 64-bit adders in Skein are being built from the existing smaller adders plus some FPGA cells needed for the routing and carry propagation.

On the other hand, your ASIC models were probably using Intel’s highly optimized 64-bit adders (one member of the Skein team is an Intel employee, right?), so those numbers are going to be substantially faster than anything that comes from an FPGA.

James September 1, 2010 1:43 PM

I find the discrepancies interesting for the reason that they seem to highlight that implementations have such a large effect and possible security problems, that it seems to validate everything that has ever been said about the implementation being very critical to ensure security.

awalls September 1, 2010 4:28 PM

Looking at GMU’s much longer paper, table 4.2 shows that the high throughput algorithms should all have a good ratio of bits processed at a time (block size) to the multiplier multiplier on N or T (corresponding to the iterations to do the rounds).

“Why did the SIMD designs in aggregate do so poorly?” is another question to ask. For which the authors give no reason in table 4.3.

I will note in table 4.3 Skein’s performance was the most predictable with little or no relative difference from the predictions. It looks like the automated design tools simply may not have been able to optimize the Skein implementation beyond the design in section 3.17.

Section 2.6 has an interesting comment on the design choice:
“The Controller is implemented
using three main Finite State Machines, working in parallel, and responsible for the Input,
Main Processing, and the Output, respectively. As a result, each circuit can simultaneously
perform the following three tasks: output hash value for the previous message, process a
current block of data, and read the next block of data. The parameters of the interface are
selected in such a way that the time necessary to process one block of data is always larger
or equal to the time necessary to read the next block of data.”

That design decision seems like it might favor algorithms that work on long input blocks and algorithms that has rounds that can be unrolled easily (at least for throughput performance).

(Or maybe that was all just obvious…)

awalls September 1, 2010 4:57 PM

Huh, the ETH Zurich paper showed how to a better throughput/area for Skein: don’t unroll the rounds at all.

It appears to show that Skein is a good choice for area constrained applications (where one doesn’t mind lower throughput) since it’s compression function is so regular.

User1 September 1, 2010 6:21 PM

The GMU team aren’t performing any FPGA-specific optimizations (“we have decided to opt out of using any dedicated resources”) and so are comparing 2001-era implementations.

Since they aren’t attempting to extrapolate the FPGA results to an ASIC equivalent (dangerous to do so in any case) their results are fairly meaningless. I imagine skein would make good use of embedded DSP slices, as well as some non-vanilla fast adder implementations.

Doug Coulter September 1, 2010 8:28 PM

If I understand the above, a problem is these wide adders. As someone once tasked with making very fast synchronous counters (similar problem) I can say — it gets bad fast with carry propagation, and you wind up with a pretty deep lookahead tree if you do it that way. Very resource intensive.

However, if you could, for example, use the output in chunks as they become available (eg first byte is ready quick, then second and so on) then you might get a good speedup. Depends on whether subsequent stages can also roll with this.

Pipelining, but “sideways”.

Clive Robinson September 1, 2010 10:39 PM

@ Bruce,

As the Skein groups results are considerably better than others perhaps instead of asking what others did wrong you should ask what you did right.

I’m assuming you did not just do a vanilla design but used some optimizations in the ASIC design.

Understanding what the optomisations effects are might well answer the questions.

Also what are the criteria for the various designs ie speed, through put, silicon real estate etc.

Also are you measuring like with like?

In a highly piplined design end to end propagation time can be of minor consequence when you have a hundred or more identical operations to perform in series which is often what a hash is used for.

That is the (delay) time for a single operation to get through may be long but the time between operations being output is very very short thus the start and stop delays caused by the long delay time are minor in comparison to the total time to do a hundred or so operations.

As Doug Coulter notes sometimes you have to think sideways when it comes to carry forward.

I once had to work on a project that involved multiple additions of very large integers mad as it might sound the solution was to build a bunch of very low gate count “serial adders” and “cascade them” so they where all adding together but delayed by one bit. Then roll this “weird optimization” back through the design such that the first adder would start on the next addition before the last adder was finished with the previous addition. This way the adder array was working at max throughput all the time and the delay between results was one bit addition time.

JoachimS September 2, 2010 1:20 AM

Aloha!

(First: Bruce, the links for the Intel paper is wrong, it points to the PDF with the slides.)

Would it be possible for the Skein/Intel team to release their RTL code (Verilog, SystemVerilog or VHDL)? This would allow us HW guys to try and map the design to FPGAs and do a better apples to grapes comparison.

Or, is the implementation so hardwired (litterally – a tech specific gate level design) to the target technology that it wouldn’t be meaningful.

AFAIK no FPGA vendor today includes hard macros for fast adders beyond 48 bits, normally they are 36 bit at most.

Cheers!

Robert September 2, 2010 3:08 AM

Bruce,
You might be looking in the wrong spot for the differences because the source of delays in 32nm is very different from FPGA or even 180nm ASIC’s

Below 90nm most of the delay is due to routing resistance and parasitic ILD (inter level dielectric) capacitance. I assume the metal is copper and the ILD is probably very low K material, (black diamond maybe)

By contrast in a 180nm implementation delay will be determined mainly by the gate delay.

If the routing has not been optimized than the speed, area and power differences could be substantial and depend a lot on the exact details of the synthesis flow.

That said the differences are substantial, much more than I would expect. So I’d suggest you extract the worst case signal path delays and see if there is at least some agreement at this level.

wb September 2, 2010 6:15 AM

I’m not expert, and not even designed any FPGA but there are many adder designs which can be much faster especially for 64+ bits. English wikipedia have nice articles about them. I also found a way to add numbers without O(log n) carry propagation at all! But this needs first converting number to speciall representation (which is slow). Search for sparse skew binary system.

As of comparing hardware of different candidates, remember to keep as much of conditions constant (clocking, transistor sizes, etc), to make analysis simpler.

Zygmunt Lozinski September 2, 2010 6:25 AM

Yes: I believe it will get you an Erdős number of 5.

How:
Bruce coauthored with Ross Anderson (Econ of Info Sec)

Ross Anderson coauthored with Torleiv Kløve MR1658419 (99i:94039)

Torleiv Kløve coauthored with Charles J. Colbourn MR2094885 (2005e:94316)

Charles J. Colbourn coauthored with Paul Erdős1 MR0830680 (87k:05104)

Volker Hetzer September 2, 2010 6:55 AM

Hi!
B ack, when we did chip design, the main thing in a pipelined design was the placement of the flipflops, i.e. the pipeline stages.
If you do all of the adders and carry logic in one pipeline stage, it’s the bottleneck.
Extend the pipeline by a few stages, say, so that each addition happens in a separate clock cycle and the throughput ought to increase.

There should be tools that measure how much time is left in each pipeline stage, using a fiven clock frequency.

Just rewriting the Skein code into vhdl or verilog is definitely not going to cut it.

Martin Thompson September 2, 2010 9:05 AM

A brief look at the GMU paper shows they are explicitly not using any FPGA features other than slice logic. I know it makes comparisons hard, but it’s leaving area&performance on the table. It might also show a significant difference between suitability of the various families to the different functions. Some hashes may be equally efficient across families (which is probably a good thing).

Roman September 2, 2010 9:08 AM

Is source code (e.g. VHDL files + project) available for any implementations mentioned in the post?

Neil in Chicago September 2, 2010 1:07 PM

“I helped debug Bruce Schneier and all I got was this lousy T-shirt”?
(yes I can see you grimacing, but I still couldn’t resist.)

Alan September 2, 2010 2:09 PM

Fascinating discussion. I learn so much here (most of this is far over my head!).

I just can’t resist expressing my admiration for Clive Robinson’s description of the “mad adder”. =)

Doug Coulter September 2, 2010 6:13 PM

Me either — and we did some of that as well, Clive took it farther than I was willing to risk saying out loud here, as some wouldn’t “get it” — serial being thought of as slow by most. But he’s exactly right. Sounds like we may have had quite similar jobs sometime in the past…

Clive Robinson September 2, 2010 9:49 PM

For those who want to dig a little deeper into the subject of verification of hardware design for arithmetic functions. Have a look at,

http://www.russinoff.com/libman/index.html

David M. Russinoff has done the rounds (and written the book 😉 He works at AMD where he is involved in the AMD64 design.

Anon September 3, 2010 12:50 AM

Standardizing the tree-hashing mode wouldn’t improve the gate count/performance attributes, but it would let hardware designers scale up their chips to get whatever throughput an app needs for long messages. Gate area keeps getting cheaper!

That train left long ago, I guess, but might as well make another pitch for parallelism. 🙂

David September 3, 2010 4:13 AM

” “Would co-authorship earn the contributor an Erdős number?”
You’d get a 4. I’m a 3, through Ron Rivest.”

one wonders (nawww…) if Bruce would add as co-author anyone who’d posted here…

~smile~

Jurjen September 3, 2010 4:28 PM

Just a thought:
The well known way to speed up long carry computations for additions is “carry save addition”, where you remember the carries instead of rippling them through.
These words are easy to rotate, and maybe there is a trick to combine this with XOR in some way. This might speed up things a bit.
(Sorry for putting this under the wrong message first.)

JoachimS September 4, 2010 12:50 AM

Aloha!

Bruce (not Schneier): You are right that delays in 180nm differs from 32nm – wire delay totally dominates at deep sub micron. But modern FPGAs are manufactured using 65nm, 45nm and even smaller geometries. The difference ASIC vs FPGA is that the wire delays are known and handled by the FPGA Place & Route tools.

J: No sorry, VHDL is not my friend. SystemVerilog is.

The important poin I tried to make is that in order to really get to the gist of the difference, having the same architecture/design would be very helpful. Then it might well be that way that:

(1) Skein can not be efficiently implemented on a current FPGA device. The reliance on 64-bit operators might kill a high performance implementation.

(2) Skein can be implemented efficiently, but requires a different architecture. Bit slicing (as someone else pointed out) etc might be the key here.

But please Skein + Intel team, if you want help, please help by providing any RTL. Put it under GPL, Creative Commons attribute-share alike-non commercial if you like as long as we can do experiments.

Cheers
JoachimS

Greg September 4, 2010 6:27 AM

I must confess that i am surprised 100% FPGA is really relevant? I haven’t worked on anything without a dsp on board for the last few years. Also I haven’t seen any of these dsps without a 64bit adder path.

Why 100% FPGA? for chip pin applications i would think gate count is more important?

finally wasn’t the requirement for the new hashes to be 64bit machine optimized. Then wouldn’t other hashes suffer from the same problem? (of course many seem to use AES)

Clive Robinson September 4, 2010 10:12 AM

@ Jurjen,

“The well known way to speed up long carry computations for additions is “carry save addition”, where you remember the carries instead of rippling them through These words are easy to rotate, and maybe there is a trick to combine this with XOR in some way This might speed up things a bit”

Have you been looking at Dadda and Wallace trees in fast multipliers?

http://www.cerc.utexas.edu/~whitney/pdfs/spie03.pdf

Or “overturned stairs”,

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.4204&rep=rep1&type=pdf

http://portal.acm.org/citation.cfm?id=141489

Miro September 5, 2010 4:41 PM

My first guess would be that all ARX candidates would highly benefit from the Intel’s 32 nm technology. It definitely works better when it comes down to implementing efficient adders than all other (University available) libraries.

I understand that Intel does not want to release their RTL code, but another verification can be easily done. Most of the above mentioned groups are providing their code online or, at least, are willing to provide it on request. Why do you not then evaluate its performance using Intel’s technology? By doing that you would, at least, be able to locate the cause of this annoying discrepancy.

Simon @Helion September 6, 2010 9:52 AM

Hi Bruce

Having implemented the FIPS 180-3 hash functions in FPGA it seems there are two critical issues effecting the performance of Skein in FPGA:

  1. It seems that at best for a straight “one round per clock” implementation there are two 64-bit adds in the critical path of the Threefish round function. From an FPGA perspective this is not a good thing and is not much better than the critical path of the SHA-384 or SHA-512 functions.
  2. The raw bits per clock number (block size (in bits)/number of rounds per block) is key to the throughput of any hash function. For SHA-256 it is 8 bits/clock and for Skein-256 it is 256/72 = 3.56 bits/clock, so the latter is inherently more than twice as slow. Similarly, Skein-1024 has the same performance as SHA-384 or SHA-512 i.e. 12.8 bits/clock.

In my experience, even with judicious use of low level FPGA implementation techniques it would not be possible to achieve more than 200MHz in any current FPGA technology.

All the best
Simon

Grande Mocha September 6, 2010 11:31 AM

@ Laptop AC Adapter: “… wasn’t the requirement for the new hashes to be 64bit machine optimized”

Actually, no. NIST did not explicitly require the SHA-3 candidates to be optimized for 64-bit machines. However, NIST did choose the Intel Core 2 as the reference platform, which seemed to imply a preference (at least in my mind) for 64-bit platforms.

At this stage of the game, it looks like the arguments are going something like this: all of the 2nd round candidates are “fast enough*” on big hardware, so we’ll concentrate on the embedded platforms.

  • The “fast enough” criteria is a little murky because there are candidates such as CubeHash which are hedging their bets by having a “formal” mode which is probably secure but very slow versus a “recommended” mode which is faster but arguably less secure.

Anon September 6, 2010 7:02 PM

@Simon: Giving the key schedule its own pipeline stage gets you one addition in the critical path — 25% more clocks, but potentially cuts your clock period enough to be worth it. I’m sure there are tricks that get you well beyond 200MHz.

Anyway, the long-run solution for high throughput is parallelism, either from tree hashing or inherently parallel apps (speeding up thousands of SSL sessions). As long as an algorithm doesn’t do anything truly awful — huge multiplications, random 8×64 S-boxes — there is a way.

Simon @Helion September 7, 2010 5:56 AM

@Anon

I had assumed the subkey would be pipelined and was only considering the critical path of the round function itself as shown in the paper by Men Long.

For a “straight” one round per clock implementation it seems you have to perform a mix, permute and subkey add every fourth round. As such the critical path of this approach contains two 64-bit additions. I assume this will be the plain vanilla approach used in the various trials Bruce referred to.

I confess I have not had long to look at the Skein algorithm and there may indeed be ways of obviating this. However, even the presence of a single 64-bit add is a serious limitation on performance in certain FPGA technologies, in particular those commonly used in Mil-Aero and Space applications which do not contain hardwired carry chain logic.

ML September 7, 2010 1:01 PM

Not knowing the details of different implementations I would say that unrolling 4 or 8 rounds although simplifies the logic affects the maximum clock frequency and thus performance. So on the top of all the discussion regarding critical path of 64 bit adders I would say that the performance optimized FPGA implementation of Skein should combine unrolling with pipelining.

Anon September 7, 2010 2:22 PM

@Simon: I’m looking at Long’s paper, figure 10, and it shows two parallel Skein MIXes, then another row of WireShifts and the subkey addition for the key schedule.

You can see that the second row of additions is subkey additions because those adders get their inputs from the box at top right that generates the subkey.

If you look back at the Skein paper, you can see they describe rounds without the key schedule as 2, 4, or 8 additions in parallel (depending on block size).

So you could take Long’s design and add control logic so it does 4 regular rounds then a subkey addition, or just unroll 4 or 8 rounds as separate pipeline stages with subkey addition “rounds” added in the right places, if you have the gate area and can use the extra parallel messages.

(I had forgotten that each key schedule itself involves two serial additions (one to combine key and tweak into subkey, one to add key and block). The addition to get the subkey could happen in parallel with the MIX in the previous round, just like it happens in parallel with the mixing in Long’s current design.)

Anon September 7, 2010 2:29 PM

@Simon: D’oh — I just noticed, Long actually mentions this idea in section 3.3.4 of his paper and estimates it might ~double clock frequency and reduce overall latency 37%, but says implementation is “a work in progress.”

(He takes the idea a very reasonable step further and reuses adders from the MIXes in the subkey addition, to save gate area.)

Clive Robinson September 8, 2010 5:00 AM

@ –Andrew,

Both of the papers authors work in the faculty of engineering in the Uni of Calabria Italy.

http://www.unical.it

There Email addresses and other contact details are readily available on the net (just google Stefania Perri or go to the Uni website portal for faculty staff in DEIS).

Searching on Ms Perri’s email address pulls up a whole load of papers she and Paolo have worked on. Most are to do with related matters to circuit design for FPGA’s etc in the area of communications.

--Andrew September 9, 2010 12:25 AM

Thanks Clive, I’ll take a look.

I was also wanting to review the differing implementations that Bruce cited from a system perspective. See what other elements besides the Threefish pipelining and adders might be causing issues/consuming clocks and compare it to Intel’s ASIC implementation. For example, the Claude Shannon Inst. implementation looked like it used a lot of cycles just padding short messages (Table 3 in their Results section). I’d be curious if other implementations had similar characteristics and I’d also be curious if the test data set leaned heavy that way.

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.