[PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather

Mon Aug 14 04:47:44 CEST 2023

Jussi Kivilinna wrote:
> * cipher/twofish-avx2-amd64.S (do_gather): New.
> (g16): Switch to use 'do_gather' instead of VPGATHER instruction.
> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
> for 'do_gather'.
> --
>
> As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
> switch twofish-avx2 implementation to use manual memory gathering
> instead.
>
> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
> microcode):
>
> Before:
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      7.00 ns/B     136.3 MiB/s     28.62 c/B      4089
>         ECB dec |      7.00 ns/B     136.2 MiB/s     28.64 c/B      4090
>
> After (~3.1x faster):
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4090
>         ECB dec |      2.20 ns/B     433.7 MiB/s      8.99 c/B      4089
>
> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
>
> Before:
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      1.91 ns/B     499.0 MiB/s      8.98 c/B      4700
>         ECB dec |      1.90 ns/B     500.7 MiB/s      8.95 c/B      4700
>
> After (~6% faster):
>  TWOFISH        |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
>         ECB enc |      1.78 ns/B     534.7 MiB/s      8.38 c/B      4700
>         ECB dec |      1.79 ns/B     533.7 MiB/s      8.40 c/B      4700
>   

Obviously, do_gather is bouncing the data around in the cache, but the 
fact that this change is a performance improvement on a processor not 
affected by "Downfall" strongly suggests that using VPGATHER may have 
been suboptimal from the start.  Can you do a third test on the 
i3-1115G4 with older microcode?  Would this patch have actually improved 
performance in all cases?

Was using VPGATHER a waste of time the whole time?  Do we need to be 
more skeptical about new SSE/AVX/etc. opcodes in the future?

-- Jacob