[PATCH] twofish-avx2-amd64: replace VPGATHER with manual gather
Jacob Bachmeyer
jcb62281 at gmail.com
Mon Aug 14 04:47:44 CEST 2023
Jussi Kivilinna wrote:
> * cipher/twofish-avx2-amd64.S (do_gather): New.
> (g16): Switch to use 'do_gather' instead of VPGATHER instruction.
> (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack
> for 'do_gather'.
> --
>
> As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
> switch twofish-avx2 implementation to use manual memory gathering
> instead.
>
> Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
> microcode):
>
> Before:
> TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 7.00 ns/B 136.3 MiB/s 28.62 c/B 4089
> ECB dec | 7.00 ns/B 136.2 MiB/s 28.64 c/B 4090
>
> After (~3.1x faster):
> TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4090
> ECB dec | 2.20 ns/B 433.7 MiB/s 8.99 c/B 4089
>
> Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
>
> Before:
> TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 1.91 ns/B 499.0 MiB/s 8.98 c/B 4700
> ECB dec | 1.90 ns/B 500.7 MiB/s 8.95 c/B 4700
>
> After (~6% faster):
> TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
> ECB enc | 1.78 ns/B 534.7 MiB/s 8.38 c/B 4700
> ECB dec | 1.79 ns/B 533.7 MiB/s 8.40 c/B 4700
>
Obviously, do_gather is bouncing the data around in the cache, but the
fact that this change is a performance improvement on a processor not
affected by "Downfall" strongly suggests that using VPGATHER may have
been suboptimal from the start. Can you do a third test on the
i3-1115G4 with older microcode? Would this patch have actually improved
performance in all cases?
Was using VPGATHER a waste of time the whole time? Do we need to be
more skeptical about new SSE/AVX/etc. opcodes in the future?
-- Jacob
More information about the Gcrypt-devel
mailing list