Schneier on Security
A blog covering security and security technology.
« Schneier for TSA Administrator |
| RIAA Lawsuits May Be Unconstitutional »
November 19, 2008
Skein and SHA-3 News
There are two bugs in the Skein code. They are subtle and esoteric, but they're there. We have revised both the reference and optimized code -- and provided new test vectors -- on the Skein website. A revision of the paper -- Version 1.1 -- has new IVs, new test vectors, and also fixes a few typos in the paper.
Errata: Version 1.1 of the paper, reference, and optimized code corrects an error in which the length of the configuration string was passed in as the size of the internal block (256 bits for Skein-256, 512 for Skein-512, and 1024 for Skein-1024), instead of a constant 256 bits for all three sizes. This error has no cryptographic significance, but affected the test vectors and the initialization values. The revised code also fixes a bug in the MAC mode key processing. This bug does not affect the NIST submission in any way.
NIST has received 64 submissions. (This article interviews one of the submitters, who is fifteen.) Of those, 28 are public and six have been broken. NIST is going through the submissions right now, making sure they are complete and proper. Their goal is to publish the accepted submissions by the end of the month, in advance of the Third Cryptographic Hash Workshop to be held in Belgium right after FSE in February. They expect to quickly make a first cut of algorithms -- hopefully to about a dozen -- and then give the community about a year of cryptanalysis before making a second cut in 2010.
Lastly, this is a really nice article on Skein.
These submissions make some accommodation to the Core 2 processor. They operate in "little-endian" mode (a quirk of the Intel-like processors that reads some bytes in reverse order). They also allow a large file to be broken into chunks to split the work across multiple processors.
However, virtually all of the contest submissions share the performance problem mentioned above. The logic they use won't optimally fit within the constraints of a Intel Core 2 processor. Most will perform as bad or worse than the existing SHA-1 algorithm.
One exception to this is Skein, created by several well-known cryptographers and noted pundit Bruce Schneier. It was designed specifically to exploit all three of the Core 2 execution units and to run at a full 64-bits. This gives it roughly four to 10 times the logic density of competing submissions.
This is what I meant by the Matrix quote above. They didn't bend the spoon; they bent the crypto algorithm. They moved the logic operations around in a way that wouldn't weaken the crypto, but would strengthen its speed on the Intel Core 2.
In their paper (PDF), the authors of Skein express surprise that a custom silicon ASIC implementation is not any faster than the software implementation. They shouldn't be surprised. Every time you can redefine a problem to run optimally in software, you will reach the same speeds you get with optimized ASIC hardware. The reason software has a reputation of being slow is because people don't redefine the original problem.
That's exactly what we were trying to do.
EDITED TO ADD (11/20): I wrote an essay for Wired.com on the process.
Posted on November 19, 2008 at 6:14 AM
• 40 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Why doesn't NIST publicly reveal all the submissions?
I find it bizarre that they describe "little endian" mode as "a quirk of Intel processors" that stores bytes in the "reverse order".
Little or big endian is an arbitrary choice by the CPU designer--it's not as if big endian is the "proper" way of doing it, or something! x86 is the most widely-popular little endian design, but by no means the only one. I've always found it to be more natural than big endian... and also, this "quirk of Intel processors" is not a new thing in Core 2's, it has been that way since the 8080 and in the desktop world, the vast majority of machines have always been little endian.
To me, the surprise is that so many crypto algorithms are specified in a way that naturally favors big endian hardware instead of little endian! I wonder what the historical reason is for this.
The submitters are able to reveal their submissions themselves, if they want to. There is little point in NIST showing submissions that they are going to reject out of hand before consideration even starts (especially if they are embarassingly amateurish in addition to being flawed. No need to embarrass anybody.)
(If you were wondering why they would reject submissions out of hand, well... they would only jettison submissions that obviously don't function as advertised, or are a copy-and-paste of some other well-known hash function, or something like that. So that the crypto community doesn't waste any time looking at them, and can spend that time trying to analyse and break the rest of the submissions).
A follow-up about little endian... I guess which one seems more "natural" depends on how you number the bits in a word. Big-endian people are used to numbering from MSB=0 to LSB=n-1, while little-endian people use the opposite numbering (LSB=0 to MSB=n-1). So both camps are used to bits 0..7 of a word being the "first byte" of the word (i.e. at the lowest memory address) on the kind of machine they are used to. When you switch over to the other kind of machine, either you adjust your thinking to use the other bit numbering scheme, or you end up prejudiced against it like the author of that article.
It's amusing that skein's designers are called well-known cryptographers, except for Bruce, who is called a noted pundit.
One question, however. Does optimizing for little endian end up sacrificing performance on some embedded processors and/or big-endian CPUs?
> Little or big endian is an arbitrary choice by the CPU designer--it's not as if big endian is the "proper" way of doing it, or something!
The only reason I can think to consider big endian more "normal" or common is that network byte order is big endian so anything transfered over an IP network needs to be reordered on a little endian machine.
You can find the 15-year-old kid's submission here: http://ehash.iaik.tugraz.at/wiki/Ponic
I know it's cheap to rip on a 15-year-old kid, and he's on the bright side of ordinary for his age, but I don't think NIST or other submitters are going to be wasting time "tearing apart" his submission as he hopes. I agree with moo that if this is representative of many of the 60 submissions then it's probably a good thing that NIST isn't publishing them until they've been weeded.
It's also amusing, if not flattering to young Peter, that the NIST manager says that some of the submissions look like they were written by a *13*-year-old kid.
Big endian is more normal, since it is closer to how we read numbers in real life. If we see "12 34 56 78" in a hex dump, it's more natural to read it as 0x12345678 than 0x87654321. Therefore, anybody who doesn't routinely work with little-endian hex dumps or fancy assembly language routines will favor big endian. It's like starting counting with 0 rather than 1, an unnatural habit that comes about through long exposure to computing quirks.
@moo et al. on Endianness.
Not such a suprising a statement, actually.
A little perhaps more inflammatory than it needs to be. It is only a slightly less touchy subject than say which text editor is better :), but I think that is really more a sign of an attempt to make the description accessible.
Network endian is Big Endian, and generally that is the de-facto choice of almost all standards that grew up in respectable standards bodies, because its portable and the choice other people made. Every sane platform has a to-network-endian (big) and a to-native/host -endian function. So every programmer can easily make a big endian number on easily on any sane platform without using non standard APIs.
Now, some protocols use a magic number to be endian agnostic. This is a decent idea for many things like filesystems. Detect the endianness and transparently do endian conversion on the fly. This allows better performance for the poor oppressed little endian masses. :)
Little endian and big endian are just as good from a numerical perspective of course.
However, I will not say the same about VAX endian. :)
> Big endian is more normal, since it is closer to how we read numbers in real life
Well, at least for people who read and write from left to right.
While what you say is true on some level. The left to right placement of bits from (0..n) _could_ be argued to be a choice as well.
Arabic numerals do put the most significant place leftmost. So I suppose that makes them at least Left-to-right big-endian :-P. (Suddenly I wonder about languages that read right to left with left-to-right big-endian numbers embedded.)
I could see arguing zero padding / casting aliasing of larger numbers on little endian
more fits our simplistic programming beliefs that pointers are all alike they just addresses of memory which is just an array of bytes,
and pointers are really just variables and I can add small numbers to large numbers and nothing magic happens when a read 8 bits vs 16 bits from the same memory / variable / pointer.
So, like reading two bytes and four byte ...
( Left to Right = 0..n like normal cs people )
( no magic )
[ 10000000 ] = 1
[ 10000000 ] [ 00000000 ] = 1
[ 00000000 ] = 0
[ 00000000 ] [ 00000001 ] = 1
So, I am _not_ arguing LE is better than BE or vis versa. I don't think it matters at all. Finite arithmetic in base 2 is not like a normal number we right down, IMO.
If anything paper is mostly read like a serial protocol within a page and not really anything like endian ness in a CPU or in memory.
So in short, my brain isn't little or big endian
I would guess that the zeroth cut is on submitters who didn't satisfy all the submission requirements, including the boring boilerplate, and the exact section-by-section requirements of the algorithm specification & supporting documentation, and so on. Anyone who sent something in that doesn't conform to all the published requirements will have their proposal returned as "non-responsive" to the call for proposals. It won't even be seen by an actual cryptologer.
Interestingly, I see that in the "Reference Platform and Compiler" section of the requirements, the following text appears:
"Reference Platform: Wintel personal computer, with an Intel Core 2 Duo Processor, 2.4GHz clock speed, 2GB RAM, running Windows Vista Ultimate 32-bit (x86) and 64-bit (x64) Edition."
"Wintel"? Someone has a sense of humor. That's not an industry term, but rather a bit of derisive jargon that tends to leave Intel engineers spluttering about "typos".
Note however that Arabic numberals, being Arabic, were originally developed in a R-L setting and are therefore historically LE.
I don't understand if that's what bitmonger said, but little-endian has the nice property that given a 64-bit pointer p, then (uint32_t)*p == *(uint32_t *)p. (Well, it's not valid C but you get the idea). That's why porting big-endian to little-endian is less likely to introduce bugs (except for program who do not use ntohl/ntohs properly) than porting little-endian to big-endian.
Arabic numerals actually originate from south Asia where the script is from left to right. So historically they are what, big-endian*?
* Yes, I notice the pun and I'm trying to avoid it. Gah!
Arabic numerals were in fact invented in India, where native scripts are L-R. Their orientation has been kept in Arab despite the R-L orientation of Arabic script and the big-endianess of spoken arabic.
Interestingly, German is little-endian for numbers smaller than 100 (98 is said 8-und-90), a construction which has survived in English for the "teens" (nineteen=9+10, and my English is bad enough to let me sometimes confuse neineteen and ninety).
Fred, who feels natural to say 22.214.171.124 to pronounce 98
if you have to care about endianness in C then you're doing something fundamentally wrong. Most operations are working on the value level with a well defined binary representation, thus when you avoid to work on the memory storage then you get full portability.
E.g. instead of sending an int over the network with:
ui32 = htonl(ui32);
SendBytes((unsigned char *)&ui32, 4);
you can do:
unsigned char a;
a = (ui32 >> 24) & 0xff;
a = (ui32 >> 16) & 0xff;
a = (ui32 >> 8) & 0xff;
a = ui32 & 0xff;
which is a bit more code (but you put this in a function, anyway, instead of sprinkling portability-sensitive code everywhere, right?) and a bit less efficient (which doesn't matter for slow network communication, first prove the bottleneck by measurement before optimizing it), but it has well defined behaviour and will work on any platform with a standard-conforming compiler regardless of memory storage and padding.
@pb, yes your very clear explanation is what I was getting at... :)
My point was that the difference is mostly immaterial. Neither one is really more natural, IMO.
Why do physicists draw right handed 3-d graphs and mathematicians draw left handed 3-d graphs?
Because it makes no difference _except_ some people want things done their way or at least want a reason better than: "'cause we do", but sometimes: "'cause we do", is the only real argument. :P
I feel much better about not making my submission now that I know there were a lot of amateur submissions.
On the bright side, I get to feel clever for seemingly stumbling in a direction that smart folks thought was promising -- I was building on Daniel Bernstein's Salsa20, so, like Skein, I was working with a minimalist (and parallelizable) round function with only additions, XORs, and rotations, but fast diffusion, lots of rounds, and a large block size (in my case, 64 32-bit words), and I was building a combined stream/hash module from it.
And, without actually submitting it, I don't have to deal with the pesky reality that, if it wasn't ripped up summarily, the best *possible* outcome would have been that I'd design the awkward little brother of Skein -- less flexible in its mode of operation (no tree mode, only one block size, etc.), maybe slower (I was banking on processing four 32-bit words at a time using SSE2 instructions, but notably, it takes four instructions to rotate in SSE2), and probably less secure (certainly with much less analysis in the submission paper) -- and I'd mostly owe whatever redeeming graces it had to Salsa20's design, not to original innovations of mine.
Here's more background on endianness of bytes:
Most architectures are bitwise little endian even if they're bytewise big endian (IBM being an obvious exception). Bitwise little endian makes sense since bit N is 2^N. Also, bit 0 is always bit 0, unlike IBM where bit 0 in 32 bit mode is bit 32 in 64-bit mode and the LSB is either bit 31 or 63 depending on mode.
French, I take it? 98 is quatre-vignt-dix-huit (four-twenty-ten-eight) in French, and that's the only language I know of that uses that construction.
As a native English speaker who took both French and German in high school, and is doing some Japanese now, the spoken forms of numbers seem to be among the least constant of constructions across languages.
Back a bit more on topic, the last paragraph of the second quote reminds me of DES, where a level of bit-rearrangement was added to the spec for no reason other than to make software implementations slower than hardware implementations. It added nothing to the cryptographic strength of the algorithm, but bit rearrangement is O(0) in hardware and O(n) in software...
@Randall: you submitted a hash function based on Salsa20? Note that BLAKE is based on Salsa20's variant ChaCha:
The Perl bindings are available on CPAN: http://search.cpan.org/dist/Digest-Skein
I've included two patches; one to fix building on Win32, another for Dragonfly (I don't know if the Dragonfly one works -- CPAN testers haven't tested it at the time of writing).
More information and links to the patches: http://radek.cc/2008/11/16/digest-skein/
Note: the code in NIST/CD/KAT_MCT/src has not been updated in Skein_NIST_CD_111008.zip; neither have been the *MsgKat*.txt files.
@Bryan: “Four score and seven years ago...”
It's not really modern English, though.
@jpa: I didn't submit my design, thankfully -- most likely it would have ended up on the scrap heap of badly broken amateur submissions (I'm not a cryptographer or mathematician), and even if it didn't I'd merely have achieved a less-flexible, less-efficient, less-well-analyzed cousin of BLAKE or Skein.
The design, such as it is, is here: http://twotwotwo.googlepages.com/aperture . It's really only loosely derived from Salsa20; check it out if you're curious. It's probably badly breakable.
Excellent work designing and analyzing BLAKE and producing C, VHDL, *and* PIC assembler with a four-person team!
No promises, but I might be able to come up with an SSE2 implementation of BLAKE-64 (and/or Skein or other functions). It seems like that could especially help with 32-bit-mode performance, because SSE can do 64-bit operations (often on two words at once) regardless of what mode the processor is in.
@Endo and the Endians:
Your point is right... its the reason I entertain Big Endian as the "right" order. Big Endian is consistent, while little endian always has a discontinuity around the byte level.
To demosnstrate, lets count up in hex, starting with bit 0
now for little endian to be consistent, the little endian form should be
but its not. Since the bits in the byte are always in big endian order, you get
Not as clear, is it?
On a similar note, I found that when you used mutlicharacter literals as values for integers, C/C++ seems to assume big endian:
int x = 'reco'
const char* d = (const char*)&x;
That prints 'orec' when I tried it.
Bruce - there's a typo on page 39 (section 8.1, under "variable internal state"):
"Skein-256, the low-end version, which we consider more that adequate [...]"
That should be "more than adequate".
(I wish I could offer more meaningful feedback then that, of course, but while I find it quite interesting to read about cryptography, I don't actually know anything about it myself, so I can't. Alas, poor Yorick^WMuffin.)
I thought little endian is more "natural" hardware-wise because lower memory address means lesser significant bytes, just like lower "bits" means lesser significant bits. The problem is, when the memory content is displayed for people to see, it's displayed in "wrong order" so people tend to think big endian is more natural.
For example, a 32 bits number in little endian, if inside the range of a 16 bits number, can be read as a 16 bit number at the same address. It's not possible if it's in big endian.
@ Ping-Che Chen,
"For example, a 32 bits number in little endian, if inside the range of a 16 bits number, can be read as a 16 bit number at the same address. It's not possible if it's in big endian."
You are of course assuming that you are dealing with the limited case of unsigned ints...
There are three number format things to consider,
1, The location of the sign bit.
2, Where the "point" is (ie Int / Float / Fixed Point storage).
3, If an exponent is included and how.
That is Unsigned Ints are of the format,
Fixed point (as used in DSPs),
All of which makes BE-v-LE an open argument when number representation is taken into account.
On top of this there is "byte alignment" and if negative numbers are "ones or twos" compliment making things harder.
Then there is the actual data width, not all CPUs are 8/16/32/64 bits some are 12 or other bits wide and some CPUs have instructions for BCD or other data specific handeling (ie to deal with AD/DA convertors etc)...
All in all, it is one of the reasons that the C spec makes no gaurenties on number conversions except in very specific cases.
@Randall: you'll regret not submitting it, when a 15 year-old had the stones to do so. It would have cost you nothing to see if your design would have survived.
@rubbernecker: the kid might be on the bright side of ordinary but he submitted something. what did you do?
"On a similar note, I found that when you used mutlicharacter literals as values for integers, C/C++ seems to assume big endian:
int x = 'reco'
const char* d = (const char*)&x;
That prints 'orec' when I tried it."
That's because C/C++ doesn't define how it is stored. It will depend on the optimisation of the C/C++ compiler and the hardware you are using.
It is funny. When people debate about endianness, there is always a mention of "left" and "right". Apparently, the way we write should influence the way data are organized in a computer.
"It is funny. When people debate about endianness, there is always a mention of "left" and "right". Apparently, the way we write should influence the way data are organized in a computer."
I understand why you might feel that way but data without interpretation is noise.
"In signal processing or computing it can be considered data without meaning;"
Strictly speaking there is no data in computer hardware.
"This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables."
Hardware people are the ones who build the devices that do the measuring. These people need to understand how their designs work and so they lean on their worldview to help them more easily understand their creations and be assured of their correct operation.
That correct operation allows software people the illusion that they are performing well defined operations on data. Like the hardware people, software people leverage their worldview to ensure their creations are also correct.
So people are required at multiple levels to understand what computers are doing and are thus inexorably linked to the interpretation of data. As long as this is true, customs will also be tied to the interpretation of data.
@christopher: Not true that "it would have cost me nothing" to submit, because near the end of October I was deciding between doing loads of programming to support a US presidential campaign (my job) and doing loads of paper-writing and coding to produce a submission CD. (Alas, an idea and a complete submission are two different things.) I narrowly decided to do my job.
Candidly, I'll give you that if I'd started putting together materials earlier, or just skimped on my job, it probably would have been great fun to join the bake-off. It'd make a much better story for comment threads, anyway. :)
But I also believe what I said above: even if sound, my idea wouldn't've contributed much to finding SHA-3; other submitters covered the ground I was aiming to cover, and they did some things better. The number and content of the submissions helped convince me of this.
Instead of relitigating all that, though, maybe I can do something productive now, like contribute fast implementations of some of these algorithms. And work on that day job.
Somebody already mentioned this, but I'm late on catching up and it bears repeating:
>One exception to this is Skein, created by several
> well-known cryptographers and noted pundit Bruce Schneier.
Oh, did I laugh at this. Bruce, does this mean that you've graduated from being a mere cryptographer and arrived into the big world of punditry, or you've jumped the shark?
Or was Graham was just giving Bruce a tweak in response to some Security Guru in joke? Is there a "bar at a crypto conference" backstory hidden in here?
I was also wondering why Skein performs so much more better than the other SHA-3 candidates. I see mainly 2 reasons for it:
-less internal states because it's not a wide pipe design, thus requiring less memory accesses(even though those would almost be entirely from the L cache) because more variables can be held in registers
-no s-box lookups. Many other crypto algorithms have plenty, and what is especially bad, mandantory consecutive s-box lookups
I want to focus on the later one. Even though s-box lookups should be dismissed purely due to the fact that they allow timing-based attacks, there are still a plenty of uses outside of cryptography.
(getting quick answers for simple approximations or quick starting values for fixed-point/newton-styled iterations, especially when doing multidimensional ones. Take the insidious trigonometric lookup for 3D-rotations as an example. There are plenty of others)
In fact, they are so useful, that I think there should be a special processor extension that allows it to perform multiple, but rather small-tabled and simple constant-timed(like real time constant, required for the security of cryptography) lookups in parallel. Like a 4|8 x 8bit -> 16|32bit lookup. The required tables should be stored in a superfast RAM (like SRAM), comparable to onchip L1-cache memory, that unlike L1-caches must be visible and accessible. (marginal note: should all of the paralleled lookups work on the same table or should each basic lookup have its own table?)
The biggest engineering challenge implementing this in current x86-like processors would be to allow complex instructions to access some kind of on-chip local storage besides regular registers, which has never been seen on the x86 interface yet.
corollar: fortunately, this kind of access wouldn't require much of a change to a Cell SPE-like design, where one has direct access to the local storage per se, it would only require some new instructions to make it work a little faster by eleminating some(actually quiet a lot) householding shift-and-mask instructions.
I found an inconsistency in the test vectors of the new Skein download. The file ShortMsgKAT_512.txt still contains the old (1.0) results while the files skein_golden_kat.txt and skein_golden_kat_short.txt contain the new (1.1) results. Example: the digest for len = 0 in ShortMsgKAT_512.txt starts with 5F, while the digest in the file skein_golden_kat.txt it starts with D3 (:Skein-512: 512-bit hash, msgLen = 0 bits, data = 'zero'). Could you have a look at that please?
I am working together with Maarten Bodewes on Java versions of Skein. I wrote a simple (one class) and fast implementation of Skein 512-512, while he wrote a complete and modular implementation.
FYI: Doug Whiting fixed the inconsistency in the test vectors.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.