To try to squeeze out a bit more performance I attempted to some compiler optimizations. Unfortunately, due to the sheer complexity of the algorithm, I was unable to find other logic complexities to simplify. I tried some loop unrolling to make the compiler have to work a little less, some examples are here below:


I made a graph to demonstrate the minute differences this makes in the test vectors below:

At most a few millisecond difference is all that can be acquired, and this is only from the finalcount[] array as the digest array produces errors if not compiled in a loop along with other for loops in the code. To test this I simply altered the sha1.c code and ran the make file to see if the vectors passed or failed.
As mentioned this is a compiler optimzation, in other words it is completed already, especially at the -O3 level where the benchmarking was done. I would not recommend this change to be pushed upstream normally due to the insignificant time change, complexity and length of the code that would need to be written. It does make a slight change that should be noted and I recommend this only if the absolute most performance must be gathered from the function, however at that point inline assembler should be considered.
I made a graph to demonstrate the minute differences this makes in the test vectors below:
At most a few millisecond difference is all that can be acquired, and this is only from the finalcount[] array as the digest array produces errors if not compiled in a loop along with other for loops in the code. To test this I simply altered the sha1.c code and ran the make file to see if the vectors passed or failed.
As mentioned this is a compiler optimzation, in other words it is completed already, especially at the -O3 level where the benchmarking was done. I would not recommend this change to be pushed upstream normally due to the insignificant time change, complexity and length of the code that would need to be written. It does make a slight change that should be noted and I recommend this only if the absolute most performance must be gathered from the function, however at that point inline assembler should be considered.
Comments
Post a Comment