[PATCH 4/5] aria-avx512: small optimization for aria_diff_m
Taehee Yoo
ap420073 at gmail.com
Wed Feb 22 13:07:43 CET 2023
On 2023. 2. 21. 오전 2:38, Jussi Kivilinna wrote:
Hi Jussi,
> Hello,
>
> On 20.2.2023 12.54, Taehee Yoo wrote:
>> On 2/19/23 17:49, Jussi Kivilinna wrote:
>>
>> Hi Jussi,
>> Thank you so much for this optimization!
>>
>> I tested this optimization in the kernel.
>> It works very well.
>> In my machine(i3-12100), it improves performance ~9%, awesome!
>
> Interesting.. I'd expect alderlake to behave similarly to tigerlake. Did
> you
> test with version that has unrolled round functions?
>
> In libgcrypt, I changed from round unrolling to using loops in order to
> reduce
> code size and to allow code to fit into uop-cache. Maybe speed increase
> happens
> since vpternlogq reduces code-size for unrolled version enough and
> algorithm fits
> into i3-12100's uop-cache, giving the extra performance.
>
After your response, I retested it and found my benchmark data is wrong.
When I implement aria-avx512, the benchmark result is below.
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1504 cycles (1024 bytes)
tcrypt: 1 operation in 4595 cycles (4096 bytes)
tcrypt: 1 operation in 1763 cycles (1024 bytes)
tcrypt: 1 operation in 5540 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1502 cycles (1024 bytes)
tcrypt: 1 operation in 4615 cycles (4096 bytes)
tcrypt: 1 operation in 1759 cycles (1024 bytes)
tcrypt: 1 operation in 5554 cycles (4096 bytes)
But, the current result is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1443 cycles (1024 bytes)
tcrypt: 1 operation in 4396 cycles (4096 bytes)
tcrypt: 1 operation in 1683 cycles (1024 bytes)
tcrypt: 1 operation in 5368 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1458 cycles (1024 bytes)
tcrypt: 1 operation in 4416 cycles (4096 bytes)
tcrypt: 1 operation in 1723 cycles (1024 bytes)
tcrypt: 1 operation in 5358 cycles (4096 bytes)
So, after your optimization is like this.
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1388 cycles (1024 bytes)
tcrypt: 1 operation in 4107 cycles (4096 bytes)
tcrypt: 1 operation in 1595 cycles (1024 bytes)
tcrypt: 1 operation in 5011 cycles (4096 bytes)
tcrypt: testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1379 cycles (1024 bytes)
tcrypt: 1 operation in 4163 cycles (4096 bytes)
tcrypt: 1 operation in 1603 cycles (1024 bytes)
tcrypt: 1 operation in 5098 cycles (4096 bytes)
The 9% performance gap I said is actually wrong.
I don't know why the result is changed... anyway, this optimization
increases performance by 5~7%.
Also, I tested it on the both loop and unroll but I couldn't find any
performance gap.
I haven't enough knowledge about uop-cache, so I couldn't provide useful
for focusing on the uop-cache.
Sorry for that the previous benchmark result is wrong.
Thank you so much!
Taehee Yoo
> -Jussi
>
>> It will be really helpful to the kernel side aria-avx512 driver for
>> improving performance.
>>
>> > * cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for
>> > 3-way XOR operation.
>> > ---
>> >
>> > Using vpternlogq gives small performance improvement on AMD Zen4.
With
>> > Intel tiger-lake speed is the same as before.
>> >
>> > Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
>> >
>> > Before:
>> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
>> auto Mhz
>> > ECB enc | 0.204 ns/B 4682 MiB/s 0.957
>> c/B 4700
>> > ECB dec | 0.204 ns/B 4668 MiB/s 0.960
>> c/B 4700
>> > CTR enc | 0.212 ns/B 4509 MiB/s 0.994
>> c/B 4700
>> > CTR dec | 0.212 ns/B 4490 MiB/s 0.998
>> c/B 4700
>> >
>> > After (~3% faster):
>> > ARIA128 | nanosecs/byte mebibytes/sec cycles/byte
>> auto Mhz
>> > ECB enc | 0.198 ns/B 4812 MiB/s 0.932
>> c/B 4700
>> > ECB dec | 0.198 ns/B 4824 MiB/s 0.929
>> c/B 4700
>> > CTR enc | 0.204 ns/B 4665 MiB/s 0.961
>> c/B 4700
>> > CTR dec | 0.206 ns/B 4631 MiB/s 0.968
>> c/B 4700
>> >
>> > Cc: Taehee Yoo <ap420073 at gmail.com>
>> > Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
>> > ---
>> > cipher/aria-gfni-avx512-amd64.S | 16 ++++++----------
>> > 1 file changed, 6 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/cipher/aria-gfni-avx512-amd64.S
>> b/cipher/aria-gfni-avx512-amd64.S
>> > index 849c744b..24a49a89 100644
>> > --- a/cipher/aria-gfni-avx512-amd64.S
>> > +++ b/cipher/aria-gfni-avx512-amd64.S
>> > @@ -406,21 +406,17 @@
>> > vgf2p8affineinvqb $0, t2, y3, y3; \
>> > vgf2p8affineinvqb $0, t2, y7, y7;
>> >
>> > -
>> > #define aria_diff_m(x0, x1, x2, x3, \
>> > t0, t1, t2, t3) \
>> > /* T = rotr32(X, 8); */ \
>> > /* X ^= T */ \
>> > - vpxorq x0, x3, t0; \
>> > - vpxorq x1, x0, t1; \
>> > - vpxorq x2, x1, t2; \
>> > - vpxorq x3, x2, t3; \
>> > /* X = T ^ rotr(X, 16); */ \
>> > - vpxorq t2, x0, x0; \
>> > - vpxorq x1, t3, t3; \
>> > - vpxorq t0, x2, x2; \
>> > - vpxorq t1, x3, x1; \
>> > - vmovdqu64 t3, x3;
>> > + vmovdqa64 x0, t0; \
>> > + vmovdqa64 x3, t3; \
>> > + vpternlogq $0x96, x2, x1, x0; \
>> > + vpternlogq $0x96, x2, x1, x3; \
>> > + vpternlogq $0x96, t0, t3, x2; \
>> > + vpternlogq $0x96, t0, t3, x1;
>> >
>> > #define aria_diff_word(x0, x1, x2, x3, \
>> > x4, x5, x6, x7, \
>>
>> Thank you so much!
>> Taehee Yoo
>>
>
More information about the Gcrypt-devel
mailing list