The main sources of inspiration was Bosselaers, Govaerts, and Vandewalle's paper Fast Hashing on the Pentium [BGV]. Bosselaers' note Even Faster Hashing on the Pentium [BOS] encouraged me to pursue more deeply an independent discovery I made for speeding up the hash functions even more.
Peter Gutmann helped me nail down a better interface to the C functions, and confirmed that my idea of using small work buffers on the stack to improve speed should be ok from a security point of view. An e-mail correspondence with Peter inspired me to make major portability improvements to this library, and to include pregenerated assembly files. Peter also suggested a workaround for a bug that I discovered in GNU's assembler as. (This bug turns out to be present in MSVC++ version 4.0 as well, but has been worked around.).
Thanks to Peter Gutmann and Eric Young for pointing me in the direction of Intel's program VTune . I am trying to time my 30-day free evaluation period with 30 days when I will have time to look at the last few precent difference from Bosselaers' implementation.