[PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather

Mon Aug 14 18:24:05 CEST 2023

On 14.8.2023 5.47, Jacob Bachmeyer wrote:
> Jussi Kivilinna wrote:
>> * cipher/twofish-avx2-amd64.S (do_gather): New.
>> (g16): Switch to use 'do_gather' instead of VPGATHER instruction.
>> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
>> for 'do_gather'.
>> -- 
>>
>> As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
>> switch twofish-avx2 implementation to use manual memory gathering
>> instead.
>>
>> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
>> microcode):
>>
>> Before:
>>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>>         ECB enc |      7.00 ns/B     136.3 MiB/s     28.62 c/B      4089
>>         ECB dec |      7.00 ns/B     136.2 MiB/s     28.64 c/B      4090
>>
>> After (~3.1x faster):
>>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>>         ECB enc |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4090
>>         ECB dec |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4089
>>
>> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
>>
>> Before:
>>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>>         ECB enc |      1.91 ns/B     499.0 MiB/s      8.98 c/B      4700
>>         ECB dec |      1.90 ns/B     500.7 MiB/s      8.95 c/B      4700
>>
>> After (~6% faster):
>>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>>         ECB enc |      1.78 ns/B     534.7 MiB/s      8.38 c/B      4700
>>         ECB dec |      1.79 ns/B     533.7 MiB/s      8.40 c/B      4700
> 
> Obviously, do_gather is bouncing the data around in the cache, but the fact that this change is a performance improvement on a processor not affected by "Downfall" strongly suggests that using VPGATHER may have been suboptimal from the start.  Can you do a third test on the i3-1115G4 with older microcode?  Would this patch have actually improved performance in all cases?
> 

VPGATHER used to be faster than manual gather starting with Intel Skylake. Old results on this i3-1115G4 show ~6.5 c/B for Twofish-CTR. Interesting thing is that older Intel CPUs with AVX2 had slower VPGATHER implementation and those are not affected by "Downfall". For AMD CPUs, VPGATHER has been slower and getting faster tiny bit generation to generation. With Zen4, gather performance was finally good enough that twofish-avx2 implementation beat the twofish-3way-asm implementation so I enabled HWF_INTEL_FAST_VPGATHER HW-feature for AMD Zen4+ CPUs.

> Was using VPGATHER a waste of time the whole time?  Do we need to be more skeptical about new SSE/AVX/etc. opcodes in the future?
> 

I don't think so, VPGATHER really was quite a bit faster on Intel Skylake+ CPUs. About being skeptical, I think problem is not so much with specific opcodes but optimizations that have been or get baked into microarchitectures.

-Jussi

> 
> -- Jacob
>