From hello at knugi.com Mon Dec 8 17:17:42 2025 From: hello at knugi.com (Knugi) Date: Mon, 8 Dec 2025 17:17:42 +0100 (CET) Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS References: Message-ID: * src/fips.c: Use __declspec(thread) for MinGW32 thread-local context. Signed-off-by: Knugi --- src/fips.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/fips.c b/src/fips.c index d1aff8a5..d1fa3e03 100644 --- a/src/fips.c +++ b/src/fips.c @@ -75,9 +75,15 @@ struct gcry_thread_context { }; #ifdef HAVE_GCC_STORAGE_CLASS__THREAD +#ifdef __MINGW32__ +static __declspec(thread) struct gcry_thread_context the_tc = { +? 0, GCRY_FIPS_FLAG_REJECT_DEFAULT +}; +#else static __thread struct gcry_thread_context the_tc = { ?? 0, GCRY_FIPS_FLAG_REJECT_DEFAULT }; +#endif #else #error libgcrypt requires thread-local storage to support FIPS mode #endif --? 2.52.0 -------------- next part -------------- A non-text attachment was scrubbed... Name: DCO Type: application/octet-stream Size: 1270 bytes Desc: not available URL: From wk at gnupg.org Mon Dec 8 18:52:47 2025 From: wk at gnupg.org (Werner Koch) Date: Mon, 08 Dec 2025 18:52:47 +0100 Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: (Knugi via Gcrypt-devel's message of "Mon, 8 Dec 2025 17:17:42 +0100 (CET)") References: Message-ID: <87345kc13k.fsf@jacob.g10code.de> On Mon, 8 Dec 2025 17:17, Knugi said: > * src/fips.c: Use __declspec(thread) for MinGW32 thread-local context. Why do you need this? We are building Windows versions using mingw just fine with the __thread attribute. Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From hello at knugi.com Tue Dec 9 05:59:16 2025 From: hello at knugi.com (Knugi) Date: Tue, 9 Dec 2025 05:59:16 +0100 (CET) Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: <87345kc13k.fsf@jacob.g10code.de> References: <87345kc13k.fsf@jacob.g10code.de> Message-ID: Hi Werner, The previous configuration also compiled successfully on my side, but caused incorrect runtime behavior because it seems Windows failed to properly initialize the thread-local variables defined using __thread. More specifically, with __thread, libaacs could not be properly initialized. Thanks, Knugi Dec 9, 2025, 01:49 by wk at gnupg.org: > On Mon, 8 Dec 2025 17:17, Knugi said: > >> * src/fips.c: Use __declspec(thread) for MinGW32 thread-local context. >> > > Why do you need this? We are building Windows versions using mingw just > fine with the __thread attribute. > > > Shalom-Salam, > > Werner > > -- > The pioneers of a warless world are the youth that > refuse military service. - A. Einstein > From wk at gnupg.org Tue Dec 9 15:08:05 2025 From: wk at gnupg.org (Werner Koch) Date: Tue, 09 Dec 2025 15:08:05 +0100 Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: (Knugi's message of "Tue, 9 Dec 2025 05:59:16 +0100 (CET)") References: <87345kc13k.fsf@jacob.g10code.de> Message-ID: <874ipzpx2y.fsf@jacob.g10code.de> On Tue, 9 Dec 2025 05:59, Knugi said: > the thread-local variables defined using __thread. More specifically, with > __thread, libaacs could not be properly initialized. How did you noticed that libgpg-error TLS was also not properly intialized? Did you noticed this comment in gpg_err_init: # ifdef DLL_EXPORT /* We always have a constructor and thus this function is called automatically. Due to the way the C init code of mingw works, the constructors are called before our DllMain function is called. The problem with that is that the TLS has not been setup and w32-gettext.c requires TLS. To solve this we do nothing here but call the actual init code from our DllMain. */ and later in DLLMain: case DLL_PROCESS_ATTACH: tls_index = TlsAlloc (); if (tls_index == TLS_OUT_OF_INDEXES) return FALSE; #ifndef _GPG_ERR_HAVE_CONSTRUCTOR /* If we have no constructors (e.g. MSC) we call it here. */ _gpg_w32__init_gettext_module (); _gpgrt_w32__init_utils (); #endif /* fallthru. */ ? Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From wk at gnupg.org Tue Dec 9 15:17:55 2025 From: wk at gnupg.org (Werner Koch) Date: Tue, 09 Dec 2025 15:17:55 +0100 Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: <874ipzpx2y.fsf@jacob.g10code.de> (Werner Koch via Gcrypt-devel's message of "Tue, 09 Dec 2025 15:08:05 +0100") References: <87345kc13k.fsf@jacob.g10code.de> <874ipzpx2y.fsf@jacob.g10code.de> Message-ID: <87zf7roi24.fsf@jacob.g10code.de> On Tue, 9 Dec 2025 15:08, Werner Koch said: > How did you noticed that libgpg-error TLS was also not properly > intialized? Sorry, I was in libgpg-error hacking mode and thus mixed this uop with the new code in Libgcrypt. Anyway, if there is a problem we need to anaylyze the problem to see where both attribution methods differ. If we use thread-local-storage (TLS) in Libgcrypt we may want to delegate this to libgpg-error which is anyway a dependency. Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From gniibe at fsij.org Fri Dec 12 03:56:59 2025 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 12 Dec 2025 11:56:59 +0900 Subject: [PATCH] w32: Fix use of GetProcAddress. Message-ID: <028926d3edb52ac2ea6bfabf92748c323265a1d7.1765508194.git.gniibe@fsij.org> * src/hwfeatures.c (_gcry_get_sysconfdir): Add a type cast. -- GnuPG-bug-id: 7968 Signed-off-by: NIIBE Yutaka --- src/hwfeatures.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-w32-Fix-use-of-GetProcAddress.patch Type: text/x-patch Size: 566 bytes Desc: not available URL: From gniibe at fsij.org Fri Dec 12 06:06:27 2025 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 12 Dec 2025 14:06:27 +0900 Subject: [PATCH] mpi:ec: Fix for use of ec_mulm_lli in _gcry_mpi_ec_get_affine. Message-ID: <92bbe34514ee180c074b882d8459cdf6b873ba0c.1765515944.git.gniibe@fsij.org> * mpi/ec.c (_gcry_mpi_ec_get_affine): Resize X and Y. Add GCRYECC_FLAG_LEAST_LEAK flag. -- Fixes-commit: aa089ec89badcd74817e5008c66036b1e28674f5 Reported-by: Kr0emer Signed-off-by: NIIBE Yutaka --- mpi/ec.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-mpi-ec-Fix-for-use-of-ec_mulm_lli-in-_gcry_mpi_ec_ge.patch Type: text/x-patch Size: 1482 bytes Desc: not available URL: From hello at knugi.com Sat Dec 13 15:33:12 2025 From: hello at knugi.com (Knugi) Date: Sat, 13 Dec 2025 15:33:12 +0100 (CET) Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: <87zf7roi24.fsf@jacob.g10code.de> References: <87345kc13k.fsf@jacob.g10code.de> <874ipzpx2y.fsf@jacob.g10code.de> <87zf7roi24.fsf@jacob.g10code.de> Message-ID: I have delved deeper into the issue and found that the it is caused by an implicit runtime dependency when using __thread under MinGW. The GNU implementation of __thread on Windows (or MinGW32) seems to dynamically link to libwinpthread.dll and libgcc_s_seh-1.dll. Since these dependencies are not guaranteed to be present on standard Windows systems, the OS loader fails to load the main library. This is further confirmed by the fact that static linking or manually deploying these runtime DLLs resolves the issue. The proposed change, which uses conditional compilation to replace __thread with __declspec(thread) when __MINGW32__ is defined, provides a viable solution. This approach is effective because __declspec(thread) is a Windows-native API that leverages the OS's built-in TLS mechanism, thereby eliminating the dependency on the aforementioned DLLs and ensuring the library loads correctly on all Windows targets. Thanks, Knugi Dec 9, 2025, 22:13 by wk at gnupg.org: > On Tue, 9 Dec 2025 15:08, Werner Koch said: > >> How did you noticed that libgpg-error TLS was also not properly >> intialized? >> > > Sorry, I was in libgpg-error hacking mode and thus mixed this uop with > the new code in Libgcrypt. Anyway, if there is a problem we need to > anaylyze the problem to see where both attribution methods differ. > > If we use thread-local-storage (TLS) in Libgcrypt we may want to > delegate this to libgpg-error which is anyway a dependency. > > > Salam-Shalom, > > Werner > > -- > The pioneers of a warless world are the youth that > refuse military service. - A. Einstein > From martin at martin.st Sat Dec 13 19:13:05 2025 From: martin at martin.st (=?ISO-8859-15?Q?Martin_Storsj=F6?=) Date: Sat, 13 Dec 2025 20:13:05 +0200 (EET) Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: References: <87345kc13k.fsf@jacob.g10code.de> <874ipzpx2y.fsf@jacob.g10code.de> <87zf7roi24.fsf@jacob.g10code.de> Message-ID: <3d283f7c-b8b5-14fb-e992-222a89436ed2@martin.st> On Sat, 13 Dec 2025, Knugi via Gcrypt-devel wrote: > The proposed change, which uses conditional compilation to replace > __thread with __declspec(thread) when __MINGW32__ is defined, provides > a viable solution. This approach is effective because __declspec(thread) > is a Windows-native API that leverages the OS's built-in TLS mechanism, > thereby eliminating the dependency on the aforementioned DLLs and ensuring > the library loads correctly on all Windows targets. Sorry, but I don't think this part is true. The compiler uses its choice of mechanism for thread local variables, regardless of the syntax for specifying that it is thread local. (Either "__thread", or "thread_local", etc.) With GCC, this is emutls (which does depend on libgcc, as you've noticed). Clang in mingw mode defaults to Windows native TLS, but by passing -femulated-tls, one can make it use the same mechanism as GCC. Using "__declspec(thread)" in mingw mode doesn't do anything at all; with GCC it produces the following warning: tls.c:2:1: warning: ?thread? attribute directive ignored [-Wattributes] 2 | int __declspec(thread) var1; | ^~~ With Clang it produces the following warning: tls.c:2:16: warning: unknown attribute 'thread' ignored [-Wunknown-attributes] 2 | int __declspec(thread) var1; | ^~~~~~ To see this for yourselves, see https://gcc.godbolt.org/z/K3snhT1rc. With the upcoming GCC 16, it should be possible to use the Windows native TLS mechanism with GCC as well - but this is a choice that is made when GCC is configured and built; the syntax used for declaring the variable as thread local does not affect it. // Martin From gniibe at fsij.org Tue Dec 16 02:49:52 2025 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 16 Dec 2025 10:49:52 +0900 Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: <3d283f7c-b8b5-14fb-e992-222a89436ed2@martin.st> References: <87345kc13k.fsf@jacob.g10code.de> <874ipzpx2y.fsf@jacob.g10code.de> <87zf7roi24.fsf@jacob.g10code.de> <3d283f7c-b8b5-14fb-e992-222a89436ed2@martin.st> Message-ID: <871pkvi4an.fsf@haruna.fsij.org> Hello, Martin Storsj? wrote: > With the upcoming GCC 16, it should be possible to use the Windows native > TLS mechanism with GCC as well - but this is a choice that is made when > GCC is configured and built; the syntax used for declaring the variable as > thread local does not affect it. Thanks for the information. I didn't know that GCC 16 will come with possible Windows native TLS mechanism. The issue here is how TLS (by C language) works on Windows, and there are multiple ways. IIUC, by changing the code to "__declspec(thread)", it becomes non-thread-local (equivalent to remove "__thread"), hence, the problem of linkage at runtime looks gone superficially. As long as an application is not multi-threaded or it doesn't care about FIPS certification, it's OK. It might be a workaround in some situations. I think that it is the decision of a user who builds libgcrypt with whatever compiler with whatever options, until compilers with TLS stabilized. In the current situation (where thread local strage implementations differ among compilrs+compiler-options), a user should be careful about consequences like dependency to runtime libraries. If really needed, I'm open to introduce new code for Windows, using Windows native TLS mechanism in fips.c. I believe that the situation is better for POSIX systems. -- From gniibe at fsij.org Tue Dec 16 03:20:03 2025 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 16 Dec 2025 11:20:03 +0900 Subject: [PATCH] w32: Use __declspec(thread) for FIPS thread context TLS In-Reply-To: References: <87345kc13k.fsf@jacob.g10code.de> <874ipzpx2y.fsf@jacob.g10code.de> <87zf7roi24.fsf@jacob.g10code.de> Message-ID: <87y0n3gobw.fsf@haruna.fsij.org> Hello, Knugi wrote: > The GNU implementation of __thread on Windows (or MinGW32) seems to > dynamically link to libwinpthread.dll and libgcc_s_seh-1.dll. In my environment of Debian, we have two executable variants with two different threading models; POSIX threading model: x86_64-w64-mingw32-gcc-posix and Window threading model: x86_64-w64-mingw32-gcc-win32. With x86_64-w64-mingw32-gcc-win32, I found no use of libwinpthread.dll. I guess that you are using POSIX threading model, right? Please try with Window threading model. -- From jussi.kivilinna at iki.fi Sun Dec 21 11:58:48 2025 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 21 Dec 2025 12:58:48 +0200 Subject: [PATCH 3/4] aria-x86_64: fixes for CFI markings In-Reply-To: <20251221105849.384773-1-jussi.kivilinna@iki.fi> References: <20251221105849.384773-1-jussi.kivilinna@iki.fi> Message-ID: <20251221105849.384773-3-jussi.kivilinna@iki.fi> * cipher/aria-aesni-avx-amd64.S: Add missing CFI stack adjustments after pushq/popq. * cipher/aria-aesni-avx2-amd64.S: Likewise. * cipher/aria-gfni-avx512-amd64.S: Likewise. -- Signed-off-by: Jussi Kivilinna --- cipher/aria-aesni-avx-amd64.S | 10 ++++++++++ cipher/aria-aesni-avx2-amd64.S | 8 ++++++++ cipher/aria-gfni-avx512-amd64.S | 2 ++ 3 files changed, 20 insertions(+) diff --git a/cipher/aria-aesni-avx-amd64.S b/cipher/aria-aesni-avx-amd64.S index 2a88c1e7..29597488 100644 --- a/cipher/aria-aesni-avx-amd64.S +++ b/cipher/aria-aesni-avx-amd64.S @@ -1057,6 +1057,7 @@ _gcry_aria_aesni_avx_ecb_crypt_blk1_16: .Lecb_less_than_16: pushq %r8; + CFI_ADJUST_CFA_OFFSET(8); inpack_1_15_pre(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %rdx, %r8d); @@ -1064,6 +1065,7 @@ _gcry_aria_aesni_avx_ecb_crypt_blk1_16: call __aria_aesni_avx_crypt_16way; popq %rax; + CFI_ADJUST_CFA_OFFSET(-8); write_output_1_15(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %r11, %eax); @@ -1156,6 +1158,7 @@ __aria_aesni_avx_ctr_gen_keystream_16way: .Lctr_byteadd_full_ctr_carry: addb $16, 15(%r8); pushq %rcx; + CFI_ADJUST_CFA_OFFSET(8); movl $14, %ecx; 1: adcb $0, (%r8, %rcx); @@ -1163,6 +1166,7 @@ __aria_aesni_avx_ctr_gen_keystream_16way: loop 1b; 2: popq %rcx; + CFI_ADJUST_CFA_OFFSET(-8); jmp .Lctr_byteadd_xmm; .align 8 .Lctr_byteadd: @@ -1215,6 +1219,7 @@ _gcry_aria_aesni_avx_ctr_crypt_blk16: call __aria_aesni_avx_ctr_gen_keystream_16way; pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; /* use stack for temporary store */ movq %rcx, %rdx; @@ -1223,6 +1228,7 @@ _gcry_aria_aesni_avx_ctr_crypt_blk16: call __aria_aesni_avx_crypt_16way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxor (0 * 16)(%r11), %xmm1, %xmm1; vpxor (1 * 16)(%r11), %xmm0, %xmm0; vpxor (2 * 16)(%r11), %xmm3, %xmm3; @@ -1358,6 +1364,7 @@ _gcry_aria_gfni_avx_ecb_crypt_blk1_16: .Lecb_less_than_16_gfni: pushq %r8; + CFI_ADJUST_CFA_OFFSET(8); inpack_1_15_pre(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %rdx, %r8d); @@ -1365,6 +1372,7 @@ _gcry_aria_gfni_avx_ecb_crypt_blk1_16: call __aria_gfni_avx_crypt_16way; popq %rax; + CFI_ADJUST_CFA_OFFSET(-8); write_output_1_15(%xmm1, %xmm0, %xmm3, %xmm2, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %r11, %eax); @@ -1399,6 +1407,7 @@ _gcry_aria_gfni_avx_ctr_crypt_blk16: call __aria_aesni_avx_ctr_gen_keystream_16way pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; /* use stack for temporary store */ movq %rcx, %rdx; @@ -1407,6 +1416,7 @@ _gcry_aria_gfni_avx_ctr_crypt_blk16: call __aria_gfni_avx_crypt_16way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxor (0 * 16)(%r11), %xmm1, %xmm1; vpxor (1 * 16)(%r11), %xmm0, %xmm0; vpxor (2 * 16)(%r11), %xmm3, %xmm3; diff --git a/cipher/aria-aesni-avx2-amd64.S b/cipher/aria-aesni-avx2-amd64.S index d33fa54b..08bf5a0e 100644 --- a/cipher/aria-aesni-avx2-amd64.S +++ b/cipher/aria-aesni-avx2-amd64.S @@ -1395,6 +1395,7 @@ __aria_aesni_avx2_ctr_gen_keystream_32way: .Lctr_byteadd_full_ctr_carry: addb $32, 15(%r8); pushq %rcx; + CFI_ADJUST_CFA_OFFSET(8); movl $14, %ecx; 1: adcb $0, (%r8, %rcx); @@ -1402,6 +1403,7 @@ __aria_aesni_avx2_ctr_gen_keystream_32way: loop 1b; 2: popq %rcx; + CFI_ADJUST_CFA_OFFSET(-8); jmp .Lctr_byteadd_ymm; .align 8 .Lctr_byteadd: @@ -1457,6 +1459,7 @@ _gcry_aria_aesni_avx2_ctr_crypt_blk32: call __aria_aesni_avx2_ctr_gen_keystream_32way; pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; /* use stack for temporary store */ movq %rcx, %rdx; @@ -1465,6 +1468,7 @@ _gcry_aria_aesni_avx2_ctr_crypt_blk32: call __aria_aesni_avx2_crypt_32way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxor (0 * 32)(%r11), %ymm1, %ymm1; vpxor (1 * 32)(%r11), %ymm0, %ymm0; vpxor (2 * 32)(%r11), %ymm3, %ymm3; @@ -1622,6 +1626,7 @@ _gcry_aria_vaes_avx2_ctr_crypt_blk32: call __aria_aesni_avx2_ctr_gen_keystream_32way; pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; /* use stack for temporary store */ movq %rcx, %rdx; @@ -1630,6 +1635,7 @@ _gcry_aria_vaes_avx2_ctr_crypt_blk32: call __aria_vaes_avx2_crypt_32way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxor (0 * 32)(%r11), %ymm1, %ymm1; vpxor (1 * 32)(%r11), %ymm0, %ymm0; vpxor (2 * 32)(%r11), %ymm3, %ymm3; @@ -1788,6 +1794,7 @@ _gcry_aria_gfni_avx2_ctr_crypt_blk32: call __aria_aesni_avx2_ctr_gen_keystream_32way; pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; /* use stack for temporary store */ movq %rcx, %rdx; @@ -1796,6 +1803,7 @@ _gcry_aria_gfni_avx2_ctr_crypt_blk32: call __aria_gfni_avx2_crypt_32way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxor (0 * 32)(%r11), %ymm1, %ymm1; vpxor (1 * 32)(%r11), %ymm0, %ymm0; vpxor (2 * 32)(%r11), %ymm3, %ymm3; diff --git a/cipher/aria-gfni-avx512-amd64.S b/cipher/aria-gfni-avx512-amd64.S index 0eaa2de8..a497b395 100644 --- a/cipher/aria-gfni-avx512-amd64.S +++ b/cipher/aria-gfni-avx512-amd64.S @@ -969,6 +969,7 @@ _gcry_aria_gfni_avx512_ctr_crypt_blk64: call __aria_gfni_avx512_ctr_gen_keystream_64way pushq %rsi; + CFI_ADJUST_CFA_OFFSET(8); movq %rdx, %r11; movq %rcx, %rsi; movq %rcx, %rdx; @@ -977,6 +978,7 @@ _gcry_aria_gfni_avx512_ctr_crypt_blk64: call __aria_gfni_avx512_crypt_64way; popq %rsi; + CFI_ADJUST_CFA_OFFSET(-8); vpxorq (0 * 64)(%r11), %zmm3, %zmm3; vpxorq (1 * 64)(%r11), %zmm2, %zmm2; vpxorq (2 * 64)(%r11), %zmm1, %zmm1; -- 2.51.0 From jussi.kivilinna at iki.fi Sun Dec 21 11:58:46 2025 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 21 Dec 2025 12:58:46 +0200 Subject: [PATCH 1/4] camellia-aesni-avx: optimize camellia_f used for key setup Message-ID: <20251221105849.384773-1-jussi.kivilinna@iki.fi> * cipher/camellia-aesni-avx-amd64.S (filter_8bit_3op): New. (filter_8bit): Refactor. (transpose_8x8b): Remove. (camellia_f, camellia_f_core): Refactor. (.Lsbox4_input_mask): Remove. (__camellia_avx_setup128, __camellia_avx_setup256): Adjust for new 'camellia_f'. -- Signed-off-by: Jussi Kivilinna --- cipher/camellia-aesni-avx-amd64.S | 183 ++++++++++++------------------ 1 file changed, 73 insertions(+), 110 deletions(-) diff --git a/cipher/camellia-aesni-avx-amd64.S b/cipher/camellia-aesni-avx-amd64.S index 76e62ea8..e21a8468 100644 --- a/cipher/camellia-aesni-avx-amd64.S +++ b/cipher/camellia-aesni-avx-amd64.S @@ -1,6 +1,6 @@ /* camellia-avx-aesni-amd64.S - AES-NI/AVX implementation of Camellia cipher * - * Copyright (C) 2013-2015,2020,2023 Jussi Kivilinna + * Copyright (C) 2013-2015,2020,2023,2025 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -39,14 +39,17 @@ /********************************************************************** helper macros **********************************************************************/ -#define filter_8bit(x, lo_t, hi_t, mask4bit, tmp0) \ - vpand x, mask4bit, tmp0; \ - vpandn x, mask4bit, x; \ - vpsrld $4, x, x; \ +#define filter_8bit_3op(out, in, lo_t, hi_t, mask4bit, tmp0) \ + vpand in, mask4bit, tmp0; \ + vpandn in, mask4bit, out; \ + vpsrld $4, out, out; \ \ vpshufb tmp0, lo_t, tmp0; \ - vpshufb x, hi_t, x; \ - vpxor tmp0, x, x; + vpshufb out, hi_t, out; \ + vpxor tmp0, out, out; + +#define filter_8bit(x, lo_t, hi_t, mask4bit, tmp0) \ + filter_8bit_3op(x, x, lo_t, hi_t, mask4bit, tmp0); /********************************************************************** 16-way camellia @@ -450,65 +453,6 @@ vmovdqu st1, b1; \ /* does not adjust output bytes inside vectors */ -#define transpose_8x8b(a, b, c, d, e, f, g, h, t0, t1, t2, t3, t4) \ - vpunpcklbw a, b, t0; \ - vpunpckhbw a, b, b; \ - \ - vpunpcklbw c, d, t1; \ - vpunpckhbw c, d, d; \ - \ - vpunpcklbw e, f, t2; \ - vpunpckhbw e, f, f; \ - \ - vpunpcklbw g, h, t3; \ - vpunpckhbw g, h, h; \ - \ - vpunpcklwd t0, t1, g; \ - vpunpckhwd t0, t1, t0; \ - \ - vpunpcklwd b, d, t1; \ - vpunpckhwd b, d, e; \ - \ - vpunpcklwd t2, t3, c; \ - vpunpckhwd t2, t3, t2; \ - \ - vpunpcklwd f, h, t3; \ - vpunpckhwd f, h, b; \ - \ - vpunpcklwd e, b, t4; \ - vpunpckhwd e, b, b; \ - \ - vpunpcklwd t1, t3, e; \ - vpunpckhwd t1, t3, f; \ - \ - vmovdqa .Ltranspose_8x8_shuf rRIP, t3; \ - \ - vpunpcklwd g, c, d; \ - vpunpckhwd g, c, c; \ - \ - vpunpcklwd t0, t2, t1; \ - vpunpckhwd t0, t2, h; \ - \ - vpunpckhqdq b, h, a; \ - vpshufb t3, a, a; \ - vpunpcklqdq b, h, b; \ - vpshufb t3, b, b; \ - \ - vpunpckhqdq e, d, g; \ - vpshufb t3, g, g; \ - vpunpcklqdq e, d, h; \ - vpshufb t3, h, h; \ - \ - vpunpckhqdq f, c, e; \ - vpshufb t3, e, e; \ - vpunpcklqdq f, c, f; \ - vpshufb t3, f, f; \ - \ - vpunpckhqdq t4, t1, c; \ - vpshufb t3, c, c; \ - vpunpcklqdq t4, t1, d; \ - vpshufb t3, d, d; - /* load blocks to registers and apply pre-whitening */ #define inpack16_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ y6, y7, rio, key) \ @@ -1830,63 +1774,86 @@ _gcry_camellia_aesni_avx_ocb_auth: CFI_ENDPROC(); ELF(.size _gcry_camellia_aesni_avx_ocb_auth,.-_gcry_camellia_aesni_avx_ocb_auth;) -/* - * IN: - * ab: 64-bit AB state - * cd: 64-bit CD state - */ -#define camellia_f(ab, x, t0, t1, t2, t3, t4, inv_shift_row, sbox4mask, \ - _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, key) \ - vmovq key, t0; \ - vpxor x, x, t3; \ - \ - vpxor ab, t0, x; \ +/* Camellia F function, AVX+AESNI version */ +#define camellia_f_core(ab, x, t0, t1, t2, t3, t4, \ + inv_shift_row_n_s2n3_shuffle, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, \ + sp1mask, sp2mask, sp3mask, sp4mask, fn_out_xor, \ + out_xor_dst) \ \ /* \ * S-function with AES subbytes \ */ \ \ - /* input rotation for sbox4 (<<< 1) */ \ - vpand x, sbox4mask, t0; \ - vpandn x, sbox4mask, x; \ - vpaddw t0, t0, t1; \ - vpsrlw $7, t0, t0; \ - vpor t0, t1, t0; \ - vpand sbox4mask, t0, t0; \ - vpor t0, x, x; \ + vmovdqa .Lpre_tf_lo_s4(%rip), t0; \ + vmovdqa .Lpre_tf_hi_s4(%rip), t1; \ + vpxor t3, t3, t3; \ + \ + /* prefilter sboxes s1,s2,s3 */ \ + filter_8bit_3op(t4, ab, pre_s1lo_mask, pre_s1hi_mask, _0f0f0f0fmask, t2); \ \ - vmovdqa .Lpost_tf_lo_s1 rRIP, t0; \ - vmovdqa .Lpost_tf_hi_s1 rRIP, t1; \ + /* prefilter sbox s4 */ \ + filter_8bit_3op(x, ab, t0, t1, _0f0f0f0fmask, t2); \ \ - /* prefilter sboxes */ \ - filter_8bit(x, pre_s1lo_mask, pre_s1hi_mask, _0f0f0f0fmask, t2); \ + vmovdqa .Lpost_tf_lo_s1(%rip), t0; \ + vmovdqa .Lpost_tf_hi_s1(%rip), t1; \ \ - /* AES subbytes + AES shift rows + AES inv shift rows */ \ + /* AES subbytes + AES shift rows */ \ + vaesenclast t3, t4, t4; \ vaesenclast t3, x, x; \ \ - /* postfilter sboxes */ \ + /* postfilter sboxes s1,s2,s3 */ \ + filter_8bit(t4, t0, t1, _0f0f0f0fmask, t2); \ + \ + /* postfilter sboxes s4 */ \ filter_8bit(x, t0, t1, _0f0f0f0fmask, t2); \ \ + /* Unpack 8-bit fields in lower 64-bits of XMM to */ \ + /* 16-bit fields in full 128-bit XMM. This is to allow faster */ \ + /* byte rotation for s2&s3 as SSE/AVX lacks native byte */ \ + /* shift/rotation instructions. */ \ + vpshufb inv_shift_row_n_s2n3_shuffle, t4, t1; \ + \ + vpshufb sp1mask, t4, t4; \ + vpshufb sp4mask, x, x; \ + \ /* output rotation for sbox2 (<<< 1) */ \ /* output rotation for sbox3 (>>> 1) */ \ - vpshufb inv_shift_row, x, t1; \ - vpshufb .Lsp0044440444044404mask rRIP, x, t4; \ - vpshufb .Lsp1110111010011110mask rRIP, x, x; \ vpaddb t1, t1, t2; \ vpsrlw $7, t1, t0; \ vpsllw $7, t1, t3; \ vpor t0, t2, t0; \ vpsrlw $1, t1, t1; \ - vpshufb .Lsp0222022222000222mask rRIP, t0, t0; \ + vpshufb sp2mask, t0, t0; \ vpor t1, t3, t1; \ \ vpxor x, t4, t4; \ - vpshufb .Lsp3033303303303033mask rRIP, t1, t1; \ + vpshufb sp3mask, t1, t1; \ vpxor t4, t0, t0; \ vpxor t1, t0, t0; \ vpsrldq $8, t0, x; \ + fn_out_xor(t0, x, out_xor_dst); + +#define camellia_f_xor_x(t0, x, _) \ vpxor t0, x, x; +/* + * IN: + * ab: 64-bit AB state + * cd: 64-bit CD state + */ +#define camellia_f(ab, x, t0, t1, t2, t3, t4, inv_shift_row, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, key) \ + vmovq key, t0; \ + vpxor ab, t0, x; \ + camellia_f_core(x, x, t0, t1, t2, t3, t4, inv_shift_row, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, \ + .Lsp1110111010011110mask rRIP, \ + .Lsp0222022222000222mask rRIP, \ + .Lsp3033303303303033mask rRIP, \ + .Lsp0044440444044404mask rRIP, \ + camellia_f_xor_x, _); + #define vec_rol128(in, out, nrol, t0) \ vpshufd $0x4e, in, out; \ vpsllq $(nrol), in, t0; \ @@ -1920,8 +1887,6 @@ _camellia_aesni_avx_keysetup_data: .Lsp3033303303303033mask: .long 0x04ff0404, 0x04ff0404; .long 0xff0a0aff, 0x0aff0a0a; -.Lsbox4_input_mask: - .byte 0x00, 0xff, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00; .Lsigma1: .long 0x3BCC908B, 0xA09E667F; .Lsigma2: @@ -1953,7 +1918,6 @@ __camellia_avx_setup128: vpshufb .Lbswap128_mask rRIP, KL128, KL128; vmovdqa .Linv_shift_row_and_unpcklbw rRIP, %xmm11; - vmovq .Lsbox4_input_mask rRIP, %xmm12; vbroadcastss .L0f0f0f0f rRIP, %xmm13; vmovdqa .Lpre_tf_lo_s1 rRIP, %xmm14; vmovdqa .Lpre_tf_hi_s1 rRIP, %xmm15; @@ -1968,18 +1932,18 @@ __camellia_avx_setup128: camellia_f(%xmm2, %xmm4, %xmm1, %xmm5, %xmm6, %xmm7, %xmm8, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma1 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma1 rRIP); vpxor %xmm4, %xmm3, %xmm3; camellia_f(%xmm3, %xmm2, %xmm1, %xmm5, %xmm6, %xmm7, %xmm8, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma2 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma2 rRIP); camellia_f(%xmm2, %xmm3, %xmm1, %xmm5, %xmm6, %xmm7, %xmm8, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma3 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma3 rRIP); vpxor %xmm4, %xmm3, %xmm3; camellia_f(%xmm3, %xmm4, %xmm1, %xmm5, %xmm6, %xmm7, %xmm8, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma4 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma4 rRIP); vpslldq $8, %xmm3, %xmm3; vpxor %xmm4, %xmm2, %xmm2; @@ -2303,7 +2267,6 @@ __camellia_avx_setup256: vpshufb .Lbswap128_mask rRIP, KR128, KR128; vmovdqa .Linv_shift_row_and_unpcklbw rRIP, %xmm11; - vmovq .Lsbox4_input_mask rRIP, %xmm12; vbroadcastss .L0f0f0f0f rRIP, %xmm13; vmovdqa .Lpre_tf_lo_s1 rRIP, %xmm14; vmovdqa .Lpre_tf_hi_s1 rRIP, %xmm15; @@ -2319,20 +2282,20 @@ __camellia_avx_setup256: camellia_f(%xmm2, %xmm4, %xmm5, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma1 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma1 rRIP); vpxor %xmm4, %xmm3, %xmm3; camellia_f(%xmm3, %xmm2, %xmm5, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma2 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma2 rRIP); vpxor %xmm6, %xmm2, %xmm2; camellia_f(%xmm2, %xmm3, %xmm5, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma3 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma3 rRIP); vpxor %xmm4, %xmm3, %xmm3; vpxor KR128, %xmm3, %xmm3; camellia_f(%xmm3, %xmm4, %xmm5, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma4 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma4 rRIP); vpslldq $8, %xmm3, %xmm3; vpxor %xmm4, %xmm2, %xmm2; @@ -2350,12 +2313,12 @@ __camellia_avx_setup256: camellia_f(%xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma5 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma5 rRIP); vpxor %xmm5, %xmm3, %xmm3; camellia_f(%xmm3, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, - %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, .Lsigma6 rRIP); + %xmm11, %xmm13, %xmm14, %xmm15, .Lsigma6 rRIP); vpslldq $8, %xmm3, %xmm3; vpxor %xmm5, %xmm4, %xmm4; vpsrldq $8, %xmm3, %xmm3; -- 2.51.0 From jussi.kivilinna at iki.fi Sun Dec 21 11:58:49 2025 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 21 Dec 2025 12:58:49 +0200 Subject: [PATCH 4/4] camellia-gfni-avx512: add 1-block constant-time implementation In-Reply-To: <20251221105849.384773-1-jussi.kivilinna@iki.fi> References: <20251221105849.384773-1-jussi.kivilinna@iki.fi> Message-ID: <20251221105849.384773-4-jussi.kivilinna@iki.fi> * cipher/camellia-gfni-avx512-amd64.S (_gcry_camellia_gfni_avx512_enc_blk1) (_gcry_camellia_gfni_avx512_dec_blk1): New. * cipher/camellia-glue.c [USE_GFNI_AVX512] (_gcry_camellia_gfni_avx512_enc_blk1) (_gcry_camellia_gfni_avx512_dec_blk1): New prototypes. (camellia_decrypt, camellia_encrypt) [USE_GFNI_AVX512]: Use GFNI/AVX512 1-block implementation if supported by CPU. -- Benchmark on Intel (tigerlake): Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 5.57 ns/B 171.3 MiB/s 22.77 c/B 4090 CFB enc | 5.57 ns/B 171.2 MiB/s 22.79 c/B 4090 After (~27% faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 4.36 ns/B 218.9 MiB/s 17.82 c/B 4090 CFB enc | 4.35 ns/B 219.1 MiB/s 17.80 c/B 4090 Benchmark on AMD Ryzen 9 9950X3D (zen5): Before: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 3.15 ns/B 302.8 MiB/s 18.10 c/B 5747 CFB enc | 3.18 ns/B 300.0 MiB/s 18.27 c/B 5748 After (~13% slower): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 3.58 ns/B 266.7 MiB/s 20.55 c/B 5746?5 CFB enc | 3.58 ns/B 266.7 MiB/s 20.55 c/B 5748 Signed-off-by: Jussi Kivilinna --- cipher/camellia-gfni-avx512-amd64.S | 236 +++++++++++++++++++++++++++- cipher/camellia-glue.c | 26 +++ 2 files changed, 261 insertions(+), 1 deletion(-) diff --git a/cipher/camellia-gfni-avx512-amd64.S b/cipher/camellia-gfni-avx512-amd64.S index 643eed3e..22ae43d9 100644 --- a/cipher/camellia-gfni-avx512-amd64.S +++ b/cipher/camellia-gfni-avx512-amd64.S @@ -1,6 +1,6 @@ /* camellia-gfni-avx512-amd64.S - GFNI/AVX512 implementation of Camellia * - * Copyright (C) 2022-2023 Jussi Kivilinna + * Copyright (C) 2022-2023,2025 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -692,6 +692,21 @@ ELF(.type _gcry_camellia_gfni_avx512__constants, at object;) .Lbige_addb_16: .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16 +.align 16 +/* Shuffling constants for AVX512+GFNI 1-way variant. */ +.Lsp1mask_swap32_gfni: + .byte 0xff, 0x04, 0x04, 0x04, 0xff, 0x04, 0x04, 0x04 + .byte 0xff, 0x03, 0x03, 0x03, 0x03, 0xff, 0xff, 0x03 +.Lsp2mask_swap32_gfni: + .byte 0x07, 0x07, 0x07, 0xff, 0x07, 0x07, 0x07, 0xff + .byte 0x02, 0x02, 0x02, 0xff, 0xff, 0xff, 0x02, 0x02 +.Lsp3mask_swap32_gfni: + .byte 0x06, 0x06, 0xff, 0x06, 0x06, 0x06, 0xff, 0x06 + .byte 0x01, 0x01, 0xff, 0x01, 0xff, 0x01, 0x01, 0xff +.Lsp4mask_swap32_gfni: + .byte 0x00, 0xff, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff + .byte 0x05, 0xff, 0x05, 0x05, 0x05, 0xff, 0x05, 0x05 + ELF(.size _gcry_camellia_gfni_avx512__constants,.-_gcry_camellia_gfni_avx512__constants;) .text @@ -1630,5 +1645,224 @@ _gcry_camellia_gfni_avx512_dec_blk64: CFI_ENDPROC(); ELF(.size _gcry_camellia_gfni_avx512_dec_blk64,.-_gcry_camellia_gfni_avx512_dec_blk64;) +/********************************************************************** + 1-block non-parallel camellia + **********************************************************************/ + +/* Camellia F function, AVX512+GFNI version */ +#define camellia_f_gfni(ab, cd, x1, x2, x3, x4, \ + pre_filter_bitmatrix_s123, pre_filter_bitmatrix_s4, \ + post_filter_bitmatrix_s14, post_filter_bitmatrix_s2, \ + post_filter_bitmatrix_s3, sp1mask, sp2mask, sp3mask, \ + sp4mask) \ + /* camellia sboxes s1, s2, s3, s4 */ \ + vgf2p8affineqb $(pre_filter_constant_s1234), \ + pre_filter_bitmatrix_s4, ab, x4; \ + vgf2p8affineinvqb $(post_filter_constant_s14), \ + post_filter_bitmatrix_s14, x4, x4; \ + vgf2p8affineqb $(pre_filter_constant_s1234), \ + pre_filter_bitmatrix_s123, ab, x1; \ + vgf2p8affineinvqb $(post_filter_constant_s2), \ + post_filter_bitmatrix_s2, x1, x2; \ + vgf2p8affineinvqb $(post_filter_constant_s3), \ + post_filter_bitmatrix_s3, x1, x3; \ + vgf2p8affineinvqb $(post_filter_constant_s14), \ + post_filter_bitmatrix_s14, x1, x1; \ + \ + /* permutation */ \ + vpshufb sp4mask, x4, x4; \ + vpshufb sp2mask, x2, x2; \ + vpshufb sp3mask, x3, x3; \ + vpshufb sp1mask, x1, x1; \ + vpxor x4, x2, x2; \ + vpternlogd $0x96, x3, x1, x2; \ + vpsrldq $8, x2, x1; \ + \ + /* output xor */ \ + vpternlogd $0x96, x2, x1, cd; + +#define preload_camellia_f_consts() \ + vmovq .Lpre_filter_bitmatrix_s123 rRIP, %xmm14; \ + vmovq .Lpre_filter_bitmatrix_s4 rRIP, %xmm13; \ + vmovq .Lpost_filter_bitmatrix_s14 rRIP, %xmm12; \ + vmovq .Lpost_filter_bitmatrix_s2 rRIP, %xmm11; \ + vmovq .Lpost_filter_bitmatrix_s3 rRIP, %xmm10; \ + vmovdqa .Lsp1mask_swap32_gfni rRIP, %xmm9; \ + vmovdqa .Lsp2mask_swap32_gfni rRIP, %xmm8; \ + vmovdqa .Lsp3mask_swap32_gfni rRIP, %xmm7; \ + vmovdqa .Lsp4mask_swap32_gfni rRIP, %xmm6; + +#define do_camellia_f(ab, cd, t0, t1, t2, t3) \ + camellia_f_gfni(ab, cd, t0, t1, t2, t3, \ + %xmm14, %xmm13, %xmm12, %xmm11, %xmm10, \ + %xmm9, %xmm8, %xmm7, %xmm6); + +#define preload_camellia_key_consts() \ + kxnorb %k1, %k1, %k1; \ + vmovdqa .Lpack_bswap rRIP, %xmm15; \ + kshiftrb $7, %k1, %k1; + +#define add_roundkey_blk1(cd, key) \ + vpxorq key, cd, cd{%k1}{z}; + +#define do_fls_blk1(ll, lr, rl, rr, t0, t1, kll, klr, krl, krr) \ + vpternlogd $0x1E, krr, rr, rl{%k1}{z}; \ + vpandd kll, ll, t0{%k1}{z}; \ + vpandd krl, rl, t1{%k1}{z}; \ + vprold $1, t0, t0; \ + vprold $1, t1, t1; \ + vpxor t0, lr, lr; \ + vpxor t1, rr, rr; \ + vpternlogd $0x1E, klr, lr, ll{%k1}{z}; + +#define clear_regs_blk1() \ + kxorq %k1, %k1, %k1; \ + vzeroall; + +#define roundsm_blk1(ab, cd, t0, t1, t2, t3, key) \ + add_roundkey_blk1(cd, key); \ + do_camellia_f(ab, cd, t0, t1, t2, t3); + +#define two_roundsm_blk1(ab, cd, t0, t1, t2, t3, i, dir, ctx) \ + roundsm_blk1(ab, cd, t0, t1, t2, t3, \ + (key_table + (i) * 8)(ctx)); \ + roundsm_blk1(cd, ab, t0, t1, t2, t3, \ + (key_table + ((i) + (dir)) * 8)(ctx)); + +#define enc_rounds_blk1(ab, cd, t0, t1, t2, t3, i, ctx) \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 2, 1, ctx); \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 4, 1, ctx); \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 6, 1, ctx); + +#define dec_rounds_blk1(ab, cd, t0, t1, t2, t3, i, ctx) \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 7, -1, ctx); \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 5, -1, ctx); \ + two_roundsm_blk1(ab, cd, t0, t1, t2, t3, (i) + 3, -1, ctx); + +#define fls_blk1(ll_lr, rl_rr, tmp_lr, tmp_rr, t0, t1, kll, klr, krl, krr) \ + vpsrlq $32, rl_rr, tmp_rr; \ + vpsrlq $32, ll_lr, tmp_lr; \ + do_fls_blk1(ll_lr, tmp_lr, rl_rr, tmp_rr, t0, t1, kll, klr, krl, krr); \ + vpunpckldq tmp_lr, ll_lr, ll_lr; \ + vpunpckldq tmp_rr, rl_rr, rl_rr; + +#define inpack_blk1(ab, cd, src, key) \ + vmovq 0(src), ab; \ + vmovq 8(src), cd; \ + vpshufb %xmm15, ab, ab; \ + vpshufb %xmm15, cd, cd; \ + add_roundkey_blk1(ab, key); + +#define outunpack_blk1(ab, cd, dst, key) \ + add_roundkey_blk1(cd, key); \ + vpshufb %xmm15, ab, ab; \ + vpshufb %xmm15, cd, cd; \ + vmovq ab, 8(dst); \ + vmovq cd, 0(dst); + +.align 16 +.globl _gcry_camellia_gfni_avx512_enc_blk1 +ELF(.type _gcry_camellia_gfni_avx512_enc_blk1, at function;) + +_gcry_camellia_gfni_avx512_enc_blk1: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 bytes) + * %rdx: src (16 bytes) + */ + CFI_STARTPROC(); + spec_stop_avx512; + + preload_camellia_key_consts(); + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + inpack_blk1(%xmm0, %xmm1, %rdx, (key_table)(CTX)); + + preload_camellia_f_consts(); + + leaq (-8 * 8)(CTX, %r8, 8), %r8; + +.align 16 +.Lenc_loop_blk1: + enc_rounds_blk1(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, 0, CTX); + + cmpq %r8, CTX; + je .Lenc_done_blk1; + leaq (8 * 8)(CTX), CTX; + + fls_blk1(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX), + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX)); + + jmp .Lenc_loop_blk1; + +.align 16 +.Lenc_done_blk1: + outunpack_blk1(%xmm0, %xmm1, %rsi, ((key_table) + 8 * 8)(CTX)); + + clear_regs_blk1(); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_enc_blk1,.-_gcry_camellia_gfni_avx512_enc_blk1;) + +.align 16 +.globl _gcry_camellia_gfni_avx512_dec_blk1 +ELF(.type _gcry_camellia_gfni_avx512_dec_blk1, at function;) + +_gcry_camellia_gfni_avx512_dec_blk1: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 bytes) + * %rdx: src (16 bytes) + */ + CFI_STARTPROC(); + spec_stop_avx512; + + preload_camellia_key_consts(); + + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + + inpack_blk1(%xmm0, %xmm1, %rdx, (key_table)(CTX, %r8, 8)); + + preload_camellia_f_consts(); + + leaq (-8 * 8)(CTX, %r8, 8), %rax; + +.align 16 +.Ldec_loop_blk1: + dec_rounds_blk1(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, 0, %rax); + + cmpq CTX, %rax; + je .Ldec_done_blk1; + + fls_blk1(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, + ((key_table) + 8)(%rax), + ((key_table) + 12)(%rax), + ((key_table) + 0)(%rax), + ((key_table) + 4)(%rax)); + + leaq (-8 * 8)(%rax), %rax; + jmp .Ldec_loop_blk1; + +.align 16 +.Ldec_done_blk1: + outunpack_blk1(%xmm0, %xmm1, %rsi, (key_table)(CTX)); + + clear_regs_blk1(); + + ret_spec_stop; + CFI_ENDPROC(); +ELF(.size _gcry_camellia_gfni_avx512_dec_blk1,.-_gcry_camellia_gfni_avx512_dec_blk1;) + #endif /* defined(ENABLE_GFNI_SUPPORT) && defined(ENABLE_AVX512_SUPPORT) */ #endif /* __x86_64 */ diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index 5051a305..78ff22b9 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -427,6 +427,16 @@ extern void _gcry_camellia_gfni_avx512_dec_blk64(const CAMELLIA_context *ctx, const unsigned char *in) ASM_FUNC_ABI; +extern void _gcry_camellia_gfni_avx512_enc_blk1(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in) + ASM_FUNC_ABI; + +extern void _gcry_camellia_gfni_avx512_dec_blk1(const CAMELLIA_context *ctx, + unsigned char *out, + const unsigned char *in) + ASM_FUNC_ABI; + /* Stack not used by AVX512 implementation. */ static const int avx512_burn_stack_depth = 0; #endif @@ -715,6 +725,14 @@ camellia_encrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx=c; +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + _gcry_camellia_gfni_avx512_enc_blk1(ctx, outbuf, inbuf); + return 0; + } +#endif + Camellia_EncryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); #define CAMELLIA_encrypt_stack_burn_size \ @@ -732,6 +750,14 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) { CAMELLIA_context *ctx=c; +#ifdef USE_GFNI_AVX512 + if (ctx->use_gfni_avx512) + { + _gcry_camellia_gfni_avx512_dec_blk1(ctx, outbuf, inbuf); + return 0; + } +#endif + Camellia_DecryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); #define CAMELLIA_decrypt_stack_burn_size \ -- 2.51.0 From jussi.kivilinna at iki.fi Sun Dec 21 11:58:47 2025 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 21 Dec 2025 12:58:47 +0200 Subject: [PATCH 2/4] camellia-simd128: optimize round key loading and key setup In-Reply-To: <20251221105849.384773-1-jussi.kivilinna@iki.fi> References: <20251221105849.384773-1-jussi.kivilinna@iki.fi> Message-ID: <20251221105849.384773-2-jussi.kivilinna@iki.fi> * cipher/camellia-simd128.h (if_vprolb128, vprolb128) (vmovd128_amemld, vmovq128_amemld, vmovq128_memld) (memory_barrier_with_vec, filter_8bit_3op): New. (LE64_LO32, LE64_HI32): Remove. (roundsm16, fls16, inpack16_pre, outunpack16): Use 'vmovd128_amemld' and 'vmovq128_amemld' for loading round keys. (camellia_f): Optimize/Rewrite and split core to ... (camellia_f_core): ... this. (camellia_f_xor_x): New. (sp0044440444044404mask, sp1110111010011110mask) (sp0222022222000222mask, sp3033303303303033mask): Adjust constants for optimized/rewritten 'camellia_f'. (camellia_setup128, camellia_setup256): Adjust for optimized 'camellia_f'; Use 'vmovq128_amemld' for loading round keys. (FUNC_KEY_SETUP): Use 'vmovq128_amemld' instead of 'vmovq128'. -- Signed-off-by: Jussi Kivilinna --- cipher/camellia-simd128.h | 319 +++++++++++++++++++++++--------------- 1 file changed, 198 insertions(+), 121 deletions(-) diff --git a/cipher/camellia-simd128.h b/cipher/camellia-simd128.h index c39823ac..d0f6ea32 100644 --- a/cipher/camellia-simd128.h +++ b/cipher/camellia-simd128.h @@ -1,5 +1,5 @@ /* camellia-simd128.h - Camellia cipher SIMD128 intrinsics implementation - * Copyright (C) 2023 Jussi Kivilinna + * Copyright (C) 2023,2025 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -97,6 +97,12 @@ asm_sbox_be(uint8x16_t b) #define vpsrl_byte_128(s, a, o) vpsrlb128(s, a, o) #define vpsll_byte_128(s, a, o) vpsllb128(s, a, o) +#define if_vprolb128(...) __VA_ARGS__ +#define if_not_vprolb128(...) /*_*/ +#define vprolb128(s, a, o, tmp) ({ vpsllb128((s), a, tmp); \ + vpsrlb128((8-(s)), a, o); \ + vpxor128(tmp, o, o); }) + #define vpaddb128(a, b, o) (o = (__m128i)vec_add((uint8x16_t)b, (uint8x16_t)a)) #define vpcmpgtb128(a, b, o) (o = (__m128i)vec_cmpgt((int8x16_t)b, (int8x16_t)a)) @@ -120,6 +126,13 @@ asm_sbox_be(uint8x16_t b) #define vmovq128(a, o) ({ uint64x2_t __tmp = { (a), 0 }; \ o = (__m128i)(__tmp); }) +#define vmovd128_amemld(z, a, o) ({ \ + const uint32_t *__tmp_ptr = (const void *)(a); \ + uint32x4_t __tmp = { __tmp_ptr[z], 0, 0, 0 }; \ + o = (__m128i)(__tmp); }) +#define vmovq128_amemld(a, o) ({ uint64x2_t __tmp = { *(const uint64_t *)(a), 0 }; \ + o = (__m128i)(__tmp); }) + #define vmovdqa128_memld(a, o) (o = *(const __m128i *)(a)) #define vmovdqa128_memst(a, o) (*(__m128i *)(o) = (a)) #define vpshufb128_amemld(m, a, o) vpshufb128(*(const __m128i *)(m), a, o) @@ -127,13 +140,15 @@ asm_sbox_be(uint8x16_t b) /* Following operations may have unaligned memory input */ #define vmovdqu128_memld(a, o) (o = (__m128i)vec_xl(0, (const uint8_t *)(a))) #define vpxor128_memld(a, b, o) vpxor128(b, (__m128i)vec_xl(0, (const uint8_t *)(a)), o) +#define vmovq128_memld(a, o) ({ uint64x2_t __tmp = { *(const uint64_unaligned_t *)(a), 0 }; \ + o = (__m128i)(__tmp); }) /* Following operations may have unaligned memory output */ #define vmovdqu128_memst(a, o) vec_xst((uint8x16_t)(a), 0, (uint8_t *)(o)) #define vmovq128_memst(a, o) (((uint64_unaligned_t *)(o))[0] = ((__m128i)(a))[0]) /* PowerPC AES encrypt last round => ShiftRows + SubBytes + XOR round key */ -static const uint8x16_t shift_row = +static const uint8x16_t shift_row __attribute__((unused)) = { 0, 5, 10, 15, 4, 9, 14, 3, 8, 13, 2, 7, 12, 1, 6, 11 }; #define vaesenclast128(a, b, o) \ ({ uint64x2_t __tmp = (__m128i)vec_sbox_be((uint8x16_t)(b)); \ @@ -152,6 +167,8 @@ static const uint8x16_t shift_row = #define if_aes_subbytes(...) __VA_ARGS__ #define if_not_aes_subbytes(...) /*_*/ +#define memory_barrier_with_vec(a) __asm__("" : "+wa"(a) :: "memory") + #endif /* __powerpc__ */ #ifdef __ARM_NEON @@ -189,6 +206,13 @@ static const uint8x16_t shift_row = #define vpsrl_byte_128(s, a, o) vpsrlb128(s, a, o) #define vpsll_byte_128(s, a, o) vpsllb128(s, a, o) +#define if_vprolb128(...) __VA_ARGS__ +#define if_not_vprolb128(...) /*_*/ +#define vprolb128(s, a, o, t) ({ t = (__m128i)vshlq_n_u8((uint8x16_t)a, s); \ + o = (__m128i)vsriq_n_u8((uint8x16_t)t, \ + (uint8x16_t)a, \ + 8-(s)); }) + #define vpaddb128(a, b, o) (o = (__m128i)vaddq_u8((uint8x16_t)b, (uint8x16_t)a)) #define vpcmpgtb128(a, b, o) (o = (__m128i)vcgtq_s8((int8x16_t)b, (int8x16_t)a)) @@ -210,6 +234,13 @@ static const uint8x16_t shift_row = #define vmovd128(a, o) ({ uint32x4_t __tmp = { a, 0, 0, 0 }; o = (__m128i)__tmp; }) #define vmovq128(a, o) ({ uint64x2_t __tmp = { a, 0 }; o = (__m128i)__tmp; }) +#define vmovd128_amemld(z, a, o) ({ \ + const uint32_t *__tmp_ptr = (const void *)(a); \ + uint32x4_t __tmp = { __tmp_ptr[z], 0, 0, 0 }; \ + o = (__m128i)(__tmp); }) +#define vmovq128_amemld(a, o) ({ uint64x1_t __tmp = vld1_u64((const uint64_t *)(a)); \ + o = (__m128i)vcombine_u64(__tmp, vcreate_u64(0)); }) + #define vmovdqa128_memld(a, o) (o = (*(const __m128i *)(a))) #define vmovdqa128_memst(a, o) (*(__m128i *)(o) = (a)) #define vpshufb128_amemld(m, a, o) vpshufb128(*(const __m128i *)(m), a, o) @@ -217,6 +248,8 @@ static const uint8x16_t shift_row = /* Following operations may have unaligned memory input */ #define vmovdqu128_memld(a, o) (o = (__m128i)vld1q_u8((const uint8_t *)(a))) #define vpxor128_memld(a, b, o) vpxor128(b, (__m128i)vld1q_u8((const uint8_t *)(a)), o) +#define vmovq128_memld(a, o) ({ uint8x8_t __tmp = vld1_u8((const uint8_t *)(a)); \ + o = (__m128i)vcombine_u8(__tmp, vcreate_u8(0)); }) /* Following operations may have unaligned memory output */ #define vmovdqu128_memst(a, o) vst1q_u8((uint8_t *)(o), (uint8x16_t)a) @@ -232,6 +265,8 @@ static const uint8x16_t shift_row = #define if_aes_subbytes(...) /*_*/ #define if_not_aes_subbytes(...) __VA_ARGS__ +#define memory_barrier_with_vec(a) __asm__("" : "+w"(a) :: "memory") + #endif /* __ARM_NEON */ #if defined(__x86_64__) || defined(__i386__) @@ -260,6 +295,9 @@ static const uint8x16_t shift_row = #define vpsrl_byte_128(s, a, o) vpsrld128(s, a, o) #define vpsll_byte_128(s, a, o) vpslld128(s, a, o) +#define if_vprolb128(...) /*_*/ +#define if_not_vprolb128(...) __VA_ARGS__ + #define vpaddb128(a, b, o) (o = _mm_add_epi8(b, a)) #define vpcmpgtb128(a, b, o) (o = _mm_cmpgt_epi8(b, a)) @@ -281,6 +319,11 @@ static const uint8x16_t shift_row = #define vmovd128(a, o) (o = _mm_set_epi32(0, 0, 0, a)) #define vmovq128(a, o) (o = _mm_set_epi64x(0, a)) +#define vmovd128_amemld(z, a, o) ({ \ + const uint32_t *__tmp_ptr = (const void *)(a); \ + o = (__m128i)_mm_loadu_si32(__tmp_ptr + (z)); }) +#define vmovq128_amemld(a, o) (o = (__m128i)_mm_loadu_si64(a)) + #define vmovdqa128_memld(a, o) (o = (*(const __m128i *)(a))) #define vmovdqa128_memst(a, o) (*(__m128i *)(o) = (a)) #define vpshufb128_amemld(m, a, o) vpshufb128(*(const __m128i *)(m), a, o) @@ -289,6 +332,7 @@ static const uint8x16_t shift_row = #define vmovdqu128_memld(a, o) (o = _mm_loadu_si128((const __m128i *)(a))) #define vpxor128_memld(a, b, o) \ vpxor128(b, _mm_loadu_si128((const __m128i *)(a)), o) +#define vmovq128_memld(a, o) vmovq128_amemld(a, o) /* Following operations may have unaligned memory output */ #define vmovdqu128_memst(a, o) _mm_storeu_si128((__m128i *)(o), a) @@ -305,7 +349,6 @@ static const uint8x16_t shift_row = #define if_not_aes_subbytes(...) __VA_ARGS__ #define memory_barrier_with_vec(a) __asm__("" : "+x"(a) :: "memory") -#define clear_vec_regs() ((void)0) #endif /* defined(__x86_64__) || defined(__i386__) */ @@ -322,6 +365,10 @@ static const uint8x16_t shift_row = vpshufb128(x, hi_t, x); \ vpxor128(tmp0, x, x); +#define filter_8bit_3op(out, in, lo_t, hi_t, mask4bit, tmp0) \ + vmovdqa128(in, out); \ + filter_8bit(out, lo_t, hi_t, mask4bit, tmp0); + #define transpose_4x4(x0, x1, x2, x3, t1, t2) \ vpunpckhdq128(x1, x0, t2); \ vpunpckldq128(x1, x0, x0); \ @@ -462,7 +509,7 @@ static const uint8x16_t shift_row = filter_8bit(x2, t2, t3, t7, t6); \ filter_8bit(x5, t2, t3, t7, t6); \ \ - vmovq128((key), t0); \ + vmovq128_amemld(&(key), t0); \ \ /* postfilter sbox 2 */ \ filter_8bit(x1, t4, t5, t7, t2); \ @@ -582,9 +629,6 @@ static const uint8x16_t shift_row = two_roundsm16(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ y6, y7, mem_ab, mem_cd, (i) + 3, -1, dummy_store); -#define LE64_LO32(x) ((x) & 0xffffffffU) -#define LE64_HI32(x) ((x >> 32) & 0xffffffffU) - /* * IN: * v0..3: byte-sliced 32-bit integers @@ -633,7 +677,7 @@ static const uint8x16_t shift_row = * lr ^= rol32(t0, 1); \ */ \ load_zero(tt0); \ - vmovd128(LE64_LO32(*(kl)), t0); \ + vmovd128_amemld(0, kl, t0); \ vpshufb128(tt0, t0, t3); \ vpshufb128(bcast[1], t0, t2); \ vpshufb128(bcast[2], t0, t1); \ @@ -661,7 +705,7 @@ static const uint8x16_t shift_row = * rl ^= t2; \ */ \ \ - vmovd128(LE64_HI32(*(kr)), t0); \ + vmovd128_amemld(1, kr, t0); \ vpshufb128(tt0, t0, t3); \ vpshufb128(bcast[1], t0, t2); \ vpshufb128(bcast[2], t0, t1); \ @@ -686,7 +730,7 @@ static const uint8x16_t shift_row = * t2 &= rl; \ * rr ^= rol32(t2, 1); \ */ \ - vmovd128(LE64_LO32(*(kr)), t0); \ + vmovd128_amemld(0, kr, t0); \ vpshufb128(tt0, t0, t3); \ vpshufb128(bcast[1], t0, t2); \ vpshufb128(bcast[2], t0, t1); \ @@ -714,7 +758,7 @@ static const uint8x16_t shift_row = * ll ^= t0; \ */ \ \ - vmovd128(LE64_HI32(*(kl)), t0); \ + vmovd128_amemld(1, kl, t0); \ vpshufb128(tt0, t0, t3); \ vpshufb128(bcast[1], t0, t2); \ vpshufb128(bcast[2], t0, t1); \ @@ -786,7 +830,7 @@ static const uint8x16_t shift_row = /* load blocks to registers and apply pre-whitening */ #define inpack16_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \ y6, y7, rio, key) \ - vmovq128((key), x0); \ + vmovq128_amemld(&(key), x0); \ vpshufb128(pack_bswap_stack, x0, x0); \ \ vpxor128_memld((rio) + 0 * 16, x0, y7); \ @@ -837,7 +881,7 @@ static const uint8x16_t shift_row = \ vmovdqa128(x0, stack_tmp0); \ \ - vmovq128((key), x0); \ + vmovq128_amemld(&(key), x0); \ vpshufb128(pack_bswap_stack, x0, x0); \ \ vpxor128(x0, y7, y7); \ @@ -1200,64 +1244,92 @@ FUNC_DEC_BLK16(const void *key_table, void *vout, const void *vin, /********* Key setup **********************************************************/ -/* - * Camellia F-function, 1-way SIMD/AESNI. - * - * IN: - * ab: 64-bit AB state - * cd: 64-bit CD state - */ -#define camellia_f(ab, x, t0, t1, t2, t3, t4, inv_shift_row, sbox4mask, \ - _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, key) \ - vmovq128((key), t0); \ - load_zero(t3); \ - \ - vpxor128(ab, t0, x); \ - \ +/* Camellia F-function, 1-way SIMD128. */ +#define camellia_f_core(ab, x, t0, t1, t2, t3, t4, inv_shift_row_n_s2n3_shuffle, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, \ + sp1mask, sp2mask, sp3mask, sp4mask, fn_out_xor, \ + out_xor_dst) \ /* \ * S-function with AES subbytes \ */ \ \ - /* input rotation for sbox4 (<<< 1) */ \ - vpand128(x, sbox4mask, t0); \ - vpandn128(x, sbox4mask, x); \ - vpaddb128(t0, t0, t1); \ - vpsrl_byte_128(7, t0, t0); \ - vpor128(t0, t1, t0); \ - vpand128(sbox4mask, t0, t0); \ - vpor128(t0, x, x); \ + vmovdqa128_memld(&pre_tf_lo_s4, t0); \ + vmovdqa128_memld(&pre_tf_hi_s4, t1); \ + if_not_aes_subbytes(load_zero(t3)); \ + \ + /* prefilter sboxes s1,s2,s3 */ \ + filter_8bit_3op(t4, ab, pre_s1lo_mask, pre_s1hi_mask, _0f0f0f0fmask, t2); \ + \ + /* prefilter sbox s4 */ \ + filter_8bit_3op(x, ab, t0, t1, _0f0f0f0fmask, t2); \ \ vmovdqa128_memld(&post_tf_lo_s1, t0); \ vmovdqa128_memld(&post_tf_hi_s1, t1); \ \ - /* prefilter sboxes */ \ - filter_8bit(x, pre_s1lo_mask, pre_s1hi_mask, _0f0f0f0fmask, t2); \ + if_not_aes_subbytes(/* AES subbytes + AES shift rows */); \ + if_not_aes_subbytes(aes_subbytes_and_shuf_and_xor(t3, t4, t4)); \ + if_not_aes_subbytes(aes_subbytes_and_shuf_and_xor(t3, x, x)); \ + \ + if_aes_subbytes(/* AES subbytes */); \ + if_aes_subbytes(aes_subbytes(t4, t4)); \ + if_aes_subbytes(aes_subbytes(x, x)); \ \ - /* AES subbytes + AES shift rows + AES inv shift rows */ \ - aes_subbytes_and_shuf_and_xor(t3, x, x); \ + /* postfilter sboxes s1,s2,s3 */ \ + filter_8bit(t4, t0, t1, _0f0f0f0fmask, t2); \ \ - /* postfilter sboxes */ \ + /* postfilter sbox s4 */ \ filter_8bit(x, t0, t1, _0f0f0f0fmask, t2); \ \ /* output rotation for sbox2 (<<< 1) */ \ /* output rotation for sbox3 (>>> 1) */ \ - aes_inv_shuf(inv_shift_row, x, t1); \ - vpshufb128_amemld(&sp0044440444044404mask, x, t4); \ - vpshufb128_amemld(&sp1110111010011110mask, x, x); \ - vpaddb128(t1, t1, t2); \ - vpsrl_byte_128(7, t1, t0); \ - vpsll_byte_128(7, t1, t3); \ - vpor128(t0, t2, t0); \ - vpsrl_byte_128(1, t1, t1); \ - vpshufb128_amemld(&sp0222022222000222mask, t0, t0); \ - vpor128(t1, t3, t1); \ + /* permutation */ \ + if_vprolb128(vpshufb128(sp2mask, t4, t0)); \ + if_vprolb128(vpshufb128(sp3mask, t4, t1)); \ + if_vprolb128(vpshufb128(sp1mask, t4, t4)); \ + if_vprolb128(vpshufb128(sp4mask, x, x)); \ + if_vprolb128(vprolb128(1, t0, t0, t2)); \ + if_vprolb128(vprolb128(7, t1, t1, t3)); \ + if_not_vprolb128(aes_inv_shuf(inv_shift_row_n_s2n3_shuffle, t4, t1)); \ + if_not_vprolb128(vpshufb128(sp1mask, t4, t4)); \ + if_not_vprolb128(vpshufb128(sp4mask, x, x)); \ + if_not_vprolb128(vpaddb128(t1, t1, t2)); \ + if_not_vprolb128(vpsrl_byte_128(7, t1, t0)); \ + if_not_vprolb128(vpsll_byte_128(7, t1, t3)); \ + if_not_vprolb128(vpor128(t0, t2, t0)); \ + if_not_vprolb128(vpsrl_byte_128(1, t1, t1)); \ + if_not_vprolb128(vpshufb128(sp2mask, t0, t0)); \ + if_not_vprolb128(vpor128(t1, t3, t1)); \ + if_not_vprolb128(vpshufb128(sp3mask, t1, t1)); \ \ vpxor128(x, t4, t4); \ - vpshufb128_amemld(&sp3033303303303033mask, t1, t1); \ vpxor128(t4, t0, t0); \ vpxor128(t1, t0, t0); \ vpsrldq128(8, t0, x); \ - vpxor128(t0, x, x); \ + fn_out_xor(t0, x, out_xor_dst); + +#define camellia_f_xor_x(t0, x, _) \ + vpxor128(t0, x, x); + +/* + * IN: + * ab: 64-bit AB state + * cd: 64-bit CD state + */ +#define camellia_f(ab, x, t0, t1, t2, t3, t4, inv_shift_row, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, key) \ + ({ \ + __m128i sp1mask, sp2mask, sp3mask, sp4mask; \ + vmovq128_amemld(&(key), t0); \ + vmovdqa128_memld(&sp1110111010011110mask, sp1mask); \ + vmovdqa128_memld(&sp0222022222000222mask, sp2mask); \ + vmovdqa128_memld(&sp3033303303303033mask, sp3mask); \ + vmovdqa128_memld(&sp0044440444044404mask, sp4mask); \ + vpxor128(ab, t0, x); \ + camellia_f_core(x, x, t0, t1, t2, t3, t4, inv_shift_row, \ + _0f0f0f0fmask, pre_s1lo_mask, pre_s1hi_mask, \ + sp1mask, sp2mask, sp3mask, sp4mask, \ + camellia_f_xor_x, _); \ + }) #define vec_rol128(in, out, nrol, t0) \ vpshufd128_0x4e(in, out); \ @@ -1292,24 +1364,31 @@ FUNC_DEC_BLK16(const void *key_table, void *vout, const void *vin, static const __m128i bswap128_mask = M128I_BYTE(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); -static const __m128i inv_shift_row_and_unpcklbw = - M128I_BYTE(0x00, 0xff, 0x0d, 0xff, 0x0a, 0xff, 0x07, 0xff, - 0x04, 0xff, 0x01, 0xff, 0x0e, 0xff, 0x0b, 0xff); +if_not_vprolb128( + static const __m128i inv_shift_row_and_unpcklbw = + M128I_BYTE(0x00, 0xff, 0x0d, 0xff, 0x0a, 0xff, 0x07, 0xff, + 0x04, 0xff, 0x01, 0xff, 0x0e, 0xff, 0x0b, 0xff); +) static const __m128i sp0044440444044404mask = - M128I_U32(0xffff0404, 0x0404ff04, 0x0d0dff0d, 0x0d0dff0d); + if_aes_subbytes(M128I_U32(0xffff0404, 0x0404ff04, 0x0101ff01, 0x0101ff01)) + if_not_aes_subbytes(M128I_U32(0xffff0404, 0x0404ff04, 0x0d0dff0d, 0x0d0dff0d)); static const __m128i sp1110111010011110mask = - M128I_U32(0x000000ff, 0x000000ff, 0x0bffff0b, 0x0b0b0bff); + if_aes_subbytes(M128I_U32(0x000000ff, 0x000000ff, 0x07ffff07, 0x070707ff)) + if_not_aes_subbytes(M128I_U32(0x000000ff, 0x000000ff, 0x0bffff0b, 0x0b0b0bff)); static const __m128i sp0222022222000222mask = - M128I_U32(0xff060606, 0xff060606, 0x0c0cffff, 0xff0c0c0c); + if_aes_subbytes(if_vprolb128(M128I_U32(0xff030303, 0xff030303, 0x0606ffff, 0xff060606))) + if_aes_subbytes(if_not_vprolb128(M128I_U32(0xff0e0e0e, 0xff0e0e0e, 0x0c0cffff, 0xff0c0c0c))) + if_not_aes_subbytes(if_vprolb128(M128I_U32(0xff070707, 0xff070707, 0x0e0effff, 0xff0e0e0e))) + if_not_aes_subbytes(if_not_vprolb128(M128I_U32(0xff060606, 0xff060606, 0x0c0cffff, 0xff0c0c0c))); static const __m128i sp3033303303303033mask = - M128I_U32(0x04ff0404, 0x04ff0404, 0xff0a0aff, 0x0aff0a0a); - -static const u64 sbox4_input_mask = - U64_BYTE(0x00, 0xff, 0x00, 0x00, 0xff, 0x00, 0x00, 0x00); + if_aes_subbytes(if_vprolb128(M128I_U32(0x02ff0202, 0x02ff0202, 0xff0505ff, 0x05ff0505))) + if_aes_subbytes(if_not_vprolb128(M128I_U32(0x04ff0404, 0x04ff0404, 0xff0202ff, 0x0202ff02))) + if_not_aes_subbytes(if_vprolb128(M128I_U32(0x0aff0a0a, 0x0aff0a0a, 0xff0101ff, 0x01ff0101))) + if_not_aes_subbytes(if_not_vprolb128(M128I_U32(0x04ff0404, 0x04ff0404, 0xff0a0aff, 0x0aff0a0a))); static const u64 sigma1 = U64_U32(0x3BCC908B, 0xA09E667F); @@ -1353,8 +1432,7 @@ camellia_setup128(void *key_table, __m128i x0) vpshufb128_amemld(&bswap128_mask, KL128, KL128); - vmovdqa128_memld(&inv_shift_row_and_unpcklbw, x11); - vmovq128(sbox4_input_mask, x12); + if_not_vprolb128(vmovdqa128_memld(&inv_shift_row_and_unpcklbw, x11)); vmovdqa128_memld(&mask_0f, x13); vmovdqa128_memld(&pre_tf_lo_s1, x14); vmovdqa128_memld(&pre_tf_hi_s1, x15); @@ -1369,18 +1447,18 @@ camellia_setup128(void *key_table, __m128i x0) camellia_f(x2, x4, x1, x5, x6, x7, x8, - x11, x12, x13, x14, x15, sigma1); + x11, x13, x14, x15, sigma1); vpxor128(x4, x3, x3); camellia_f(x3, x2, x1, x5, x6, x7, x8, - x11, x12, x13, x14, x15, sigma2); + x11, x13, x14, x15, sigma2); camellia_f(x2, x3, x1, x5, x6, x7, x8, - x11, x12, x13, x14, x15, sigma3); + x11, x13, x14, x15, sigma3); vpxor128(x4, x3, x3); camellia_f(x3, x4, x1, x5, x6, x7, x8, - x11, x12, x13, x14, x15, sigma4); + x11, x13, x14, x15, sigma4); vpslldq128(8, x3, x3); vpxor128(x4, x2, x2); @@ -1581,10 +1659,10 @@ camellia_setup128(void *key_table, __m128i x0) vmovq128_memst(x4, cmll_sub(5, ctx)); vmovq128_memst(x5, cmll_sub(6, ctx)); - vmovq128(*cmll_sub(7, ctx), x7); - vmovq128(*cmll_sub(8, ctx), x8); - vmovq128(*cmll_sub(9, ctx), x9); - vmovq128(*cmll_sub(10, ctx), x10); + vmovq128_amemld(cmll_sub(7, ctx), x7); + vmovq128_amemld(cmll_sub(8, ctx), x8); + vmovq128_amemld(cmll_sub(9, ctx), x9); + vmovq128_amemld(cmll_sub(10, ctx), x10); /* tl = subl(10) ^ (subr(10) & ~subr(8)); */ vpandn128(x10, x8, x15); vpsrldq128(4, x15, x15); @@ -1601,11 +1679,11 @@ camellia_setup128(void *key_table, __m128i x0) vpxor128(x0, x6, x6); vmovq128_memst(x6, cmll_sub(7, ctx)); - vmovq128(*cmll_sub(11, ctx), x11); - vmovq128(*cmll_sub(12, ctx), x12); - vmovq128(*cmll_sub(13, ctx), x13); - vmovq128(*cmll_sub(14, ctx), x14); - vmovq128(*cmll_sub(15, ctx), x15); + vmovq128_amemld(cmll_sub(11, ctx), x11); + vmovq128_amemld(cmll_sub(12, ctx), x12); + vmovq128_amemld(cmll_sub(13, ctx), x13); + vmovq128_amemld(cmll_sub(14, ctx), x14); + vmovq128_amemld(cmll_sub(15, ctx), x15); /* tl = subl(7) ^ (subr(7) & ~subr(9)); */ vpandn128(x7, x9, x1); vpsrldq128(4, x1, x1); @@ -1630,11 +1708,11 @@ camellia_setup128(void *key_table, __m128i x0) vmovq128_memst(x12, cmll_sub(13, ctx)); vmovq128_memst(x13, cmll_sub(14, ctx)); - vmovq128(*cmll_sub(16, ctx), x6); - vmovq128(*cmll_sub(17, ctx), x7); - vmovq128(*cmll_sub(18, ctx), x8); - vmovq128(*cmll_sub(19, ctx), x9); - vmovq128(*cmll_sub(20, ctx), x10); + vmovq128_amemld(cmll_sub(16, ctx), x6); + vmovq128_amemld(cmll_sub(17, ctx), x7); + vmovq128_amemld(cmll_sub(18, ctx), x8); + vmovq128_amemld(cmll_sub(19, ctx), x9); + vmovq128_amemld(cmll_sub(20, ctx), x10); /* tl = subl(18) ^ (subr(18) & ~subr(16)); */ vpandn128(x8, x6, x1); vpsrldq128(4, x1, x1); @@ -1664,10 +1742,10 @@ camellia_setup128(void *key_table, __m128i x0) vpsrldq128(8, x1, x1); vpxor128(x1, x0, x0); - vmovq128(*cmll_sub(21, ctx), x1); - vmovq128(*cmll_sub(22, ctx), x2); - vmovq128(*cmll_sub(23, ctx), x3); - vmovq128(*cmll_sub(24, ctx), x4); + vmovq128_amemld(cmll_sub(21, ctx), x1); + vmovq128_amemld(cmll_sub(22, ctx), x2); + vmovq128_amemld(cmll_sub(23, ctx), x3); + vmovq128_amemld(cmll_sub(24, ctx), x4); vpxor128(x9, x0, x0); vpxor128(x10, x8, x8); @@ -1720,8 +1798,7 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vpshufb128_amemld(&bswap128_mask, KL128, KL128); vpshufb128_amemld(&bswap128_mask, KR128, KR128); - vmovdqa128_memld(&inv_shift_row_and_unpcklbw, x11); - vmovq128(*&sbox4_input_mask, x12); + if_not_vprolb128(vmovdqa128_memld(&inv_shift_row_and_unpcklbw, x11)); vmovdqa128_memld(&mask_0f, x13); vmovdqa128_memld(&pre_tf_lo_s1, x14); vmovdqa128_memld(&pre_tf_hi_s1, x15); @@ -1737,20 +1814,20 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) camellia_f(x2, x4, x5, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma1); + x11, x13, x14, x15, sigma1); vpxor128(x4, x3, x3); camellia_f(x3, x2, x5, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma2); + x11, x13, x14, x15, sigma2); vpxor128(x6, x2, x2); camellia_f(x2, x3, x5, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma3); + x11, x13, x14, x15, sigma3); vpxor128(x4, x3, x3); vpxor128(KR128, x3, x3); camellia_f(x3, x4, x5, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma4); + x11, x13, x14, x15, sigma4); vpslldq128(8, x3, x3); vpxor128(x4, x2, x2); @@ -1768,12 +1845,12 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) camellia_f(x4, x5, x6, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma5); + x11, x13, x14, x15, sigma5); vpxor128(x5, x3, x3); camellia_f(x3, x5, x6, x7, x8, x9, x10, - x11, x12, x13, x14, x15, sigma6); + x11, x13, x14, x15, sigma6); vpslldq128(8, x3, x3); vpxor128(x5, x4, x4); vpsrldq128(8, x3, x3); @@ -2031,10 +2108,10 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vmovq128_memst(x4, cmll_sub(5, ctx)); vmovq128_memst(x5, cmll_sub(6, ctx)); - vmovq128(*cmll_sub(7, ctx), x7); - vmovq128(*cmll_sub(8, ctx), x8); - vmovq128(*cmll_sub(9, ctx), x9); - vmovq128(*cmll_sub(10, ctx), x10); + vmovq128_amemld(cmll_sub(7, ctx), x7); + vmovq128_amemld(cmll_sub(8, ctx), x8); + vmovq128_amemld(cmll_sub(9, ctx), x9); + vmovq128_amemld(cmll_sub(10, ctx), x10); /* tl = subl(10) ^ (subr(10) & ~subr(8)); */ vpandn128(x10, x8, x15); vpsrldq128(4, x15, x15); @@ -2051,11 +2128,11 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vpxor128(x0, x6, x6); vmovq128_memst(x6, cmll_sub(7, ctx)); - vmovq128(*cmll_sub(11, ctx), x11); - vmovq128(*cmll_sub(12, ctx), x12); - vmovq128(*cmll_sub(13, ctx), x13); - vmovq128(*cmll_sub(14, ctx), x14); - vmovq128(*cmll_sub(15, ctx), x15); + vmovq128_amemld(cmll_sub(11, ctx), x11); + vmovq128_amemld(cmll_sub(12, ctx), x12); + vmovq128_amemld(cmll_sub(13, ctx), x13); + vmovq128_amemld(cmll_sub(14, ctx), x14); + vmovq128_amemld(cmll_sub(15, ctx), x15); /* tl = subl(7) ^ (subr(7) & ~subr(9)); */ vpandn128(x7, x9, x1); vpsrldq128(4, x1, x1); @@ -2080,11 +2157,11 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vmovq128_memst(x12, cmll_sub(13, ctx)); vmovq128_memst(x13, cmll_sub(14, ctx)); - vmovq128(*cmll_sub(16, ctx), x6); - vmovq128(*cmll_sub(17, ctx), x7); - vmovq128(*cmll_sub(18, ctx), x8); - vmovq128(*cmll_sub(19, ctx), x9); - vmovq128(*cmll_sub(20, ctx), x10); + vmovq128_amemld(cmll_sub(16, ctx), x6); + vmovq128_amemld(cmll_sub(17, ctx), x7); + vmovq128_amemld(cmll_sub(18, ctx), x8); + vmovq128_amemld(cmll_sub(19, ctx), x9); + vmovq128_amemld(cmll_sub(20, ctx), x10); /* tl = subl(18) ^ (subr(18) & ~subr(16)); */ vpandn128(x8, x6, x1); vpsrldq128(4, x1, x1); @@ -2114,10 +2191,10 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vpsrldq128(8, x1, x1); vpxor128(x1, x0, x0); - vmovq128(*cmll_sub(21, ctx), x1); - vmovq128(*cmll_sub(22, ctx), x2); - vmovq128(*cmll_sub(23, ctx), x3); - vmovq128(*cmll_sub(24, ctx), x4); + vmovq128_amemld(cmll_sub(21, ctx), x1); + vmovq128_amemld(cmll_sub(22, ctx), x2); + vmovq128_amemld(cmll_sub(23, ctx), x3); + vmovq128_amemld(cmll_sub(24, ctx), x4); vpxor128(x9, x0, x0); vpxor128(x10, x8, x8); @@ -2131,14 +2208,14 @@ camellia_setup256(void *key_table, __m128i x0, __m128i x1) vmovq128_memst(x10, cmll_sub(21, ctx)); vmovq128_memst(x1, cmll_sub(22, ctx)); - vmovq128(*cmll_sub(25, ctx), x5); - vmovq128(*cmll_sub(26, ctx), x6); - vmovq128(*cmll_sub(27, ctx), x7); - vmovq128(*cmll_sub(28, ctx), x8); - vmovq128(*cmll_sub(29, ctx), x9); - vmovq128(*cmll_sub(30, ctx), x10); - vmovq128(*cmll_sub(31, ctx), x11); - vmovq128(*cmll_sub(32, ctx), x12); + vmovq128_amemld(cmll_sub(25, ctx), x5); + vmovq128_amemld(cmll_sub(26, ctx), x6); + vmovq128_amemld(cmll_sub(27, ctx), x7); + vmovq128_amemld(cmll_sub(28, ctx), x8); + vmovq128_amemld(cmll_sub(29, ctx), x9); + vmovq128_amemld(cmll_sub(30, ctx), x10); + vmovq128_amemld(cmll_sub(31, ctx), x11); + vmovq128_amemld(cmll_sub(32, ctx), x12); /* tl = subl(26) ^ (subr(26) & ~subr(24)); */ vpandn128(x6, x4, x15); @@ -2223,7 +2300,7 @@ FUNC_KEY_SETUP(void *key_table, const void *vkey, unsigned int keylen) case 24: vmovdqu128_memld(key, x0); - vmovq128(*(uint64_unaligned_t *)(key + 16), x1); + vmovq128_amemld((uint64_unaligned_t *)(key + 16), x1); x2[0] = -1; x2[1] = -1; -- 2.51.0 From jussi.kivilinna at iki.fi Sun Dec 21 17:21:40 2025 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 21 Dec 2025 18:21:40 +0200 Subject: [PATCH] blake2: avoid AVX/AVX2/AVX512 when CPU has high vector inst latency Message-ID: <20251221162140.1170976-1-jussi.kivilinna@iki.fi> * cipher/blake2.c (blake2b_init_ctx, blake2s_init_ctx): Disable AVX/AVX2/AVX512 implementation if integer vector latency is higher than 1. * src/g10lib.h (_gcry_get_hwf_int_vector_latency): New. * src/hwf-common.h (_gcry_hwf_detect_x86): Add 'int_vector_latency'. * src/hwf-x86.c (detect_x86_gnuc): Detect Zen5 and add 'int_vector_latency'. (_gcry_hwf_detect_x86): Add 'int_vector_latency'. * src/hwfeatures.c (hwf_int_vector_latency) (_gcry_get_hwf_int_vector_latency): New. -- Blake2s/Blake2b AVX/AVX2/AVX512 implementations are slower than generic C implementation if CPU has integer vector latency higher than 1 (for example, AMD Zen5 has int-vector latency of 2). Therefore add detection for integer vector latency for x86 CPUs and use generic C for Blake2 when latency is greater than 1. Generic C with AMD Zen5: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.473 ns/B 2016 MiB/s 2.72 c/B 5750 BLAKE2S_256 | 0.798 ns/B 1195 MiB/s 4.59 c/B 5750 AVX512 with AMD Zen5: | nanosecs/byte mebibytes/sec cycles/byte auto Mhz BLAKE2B_512 | 0.923 ns/B 1033 MiB/s 5.31 c/B 5750 BLAKE2S_256 | 1.42 ns/B 672.4 MiB/s 8.15 c/B 5749 Signed-off-by: Jussi Kivilinna --- cipher/blake2.c | 12 ++++++++---- src/g10lib.h | 1 + src/hwf-common.h | 2 +- src/hwf-x86.c | 38 ++++++++++++++++++++++++++++++++++---- src/hwfeatures.c | 15 ++++++++++++++- 5 files changed, 58 insertions(+), 10 deletions(-) diff --git a/cipher/blake2.c b/cipher/blake2.c index 1a04fbd8..2fce448d 100644 --- a/cipher/blake2.c +++ b/cipher/blake2.c @@ -484,17 +484,19 @@ static gcry_err_code_t blake2b_init_ctx(void *ctx, unsigned int flags, { BLAKE2B_CONTEXT *c = ctx; unsigned int features = _gcry_get_hw_features (); + unsigned int int_vec_lat = _gcry_get_hwf_int_vector_latency (); + (void)int_vec_lat; (void)features; (void)flags; memset (c, 0, sizeof (*c)); #ifdef USE_AVX2 - c->use_avx2 = !!(features & HWF_INTEL_AVX2); + c->use_avx2 = !!(features & HWF_INTEL_AVX2) && (int_vec_lat <= 1); #endif #ifdef USE_AVX512 - c->use_avx512 = !!(features & HWF_INTEL_AVX512); + c->use_avx512 = !!(features & HWF_INTEL_AVX512) && (int_vec_lat <= 1); #endif c->outlen = dbits / 8; @@ -821,17 +823,19 @@ static gcry_err_code_t blake2s_init_ctx(void *ctx, unsigned int flags, { BLAKE2S_CONTEXT *c = ctx; unsigned int features = _gcry_get_hw_features (); + unsigned int int_vec_lat = _gcry_get_hwf_int_vector_latency (); + (void)int_vec_lat; (void)features; (void)flags; memset (c, 0, sizeof (*c)); #ifdef USE_AVX - c->use_avx = !!(features & HWF_INTEL_AVX); + c->use_avx = !!(features & HWF_INTEL_AVX) && (int_vec_lat <= 1); #endif #ifdef USE_AVX512 - c->use_avx512 = !!(features & HWF_INTEL_AVX512); + c->use_avx512 = !!(features & HWF_INTEL_AVX512) && (int_vec_lat <= 1); #endif c->outlen = dbits / 8; diff --git a/src/g10lib.h b/src/g10lib.h index bb735e77..c229d717 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -292,6 +292,7 @@ gpg_err_code_t _gcry_disable_hw_feature (const char *name); void _gcry_detect_hw_features (void); unsigned int _gcry_get_hw_features (void); const char *_gcry_enum_hw_features (int idx, unsigned int *r_feature); +int _gcry_get_hwf_int_vector_latency (void); const char *_gcry_get_sysconfdir (void); diff --git a/src/hwf-common.h b/src/hwf-common.h index 749ff040..ef9ffdf9 100644 --- a/src/hwf-common.h +++ b/src/hwf-common.h @@ -20,7 +20,7 @@ #ifndef HWF_COMMON_H #define HWF_COMMON_H -unsigned int _gcry_hwf_detect_x86 (void); +unsigned int _gcry_hwf_detect_x86 (int *int_vector_latency); unsigned int _gcry_hwf_detect_arm (void); unsigned int _gcry_hwf_detect_ppc (void); unsigned int _gcry_hwf_detect_s390x (void); diff --git a/src/hwf-x86.c b/src/hwf-x86.c index 54af1c83..f056641c 100644 --- a/src/hwf-x86.c +++ b/src/hwf-x86.c @@ -184,7 +184,9 @@ get_xgetbv(void) #ifdef HAS_X86_CPUID static unsigned int -detect_x86_gnuc (void) +detect_x86_gnuc ( + int *int_vector_latency +) { union { @@ -198,10 +200,15 @@ detect_x86_gnuc (void) unsigned int fms, family, model; unsigned int result = 0; unsigned int is_amd_cpu = 0; + unsigned int has_avx512bmm = 0; + unsigned int has_sse3 = 0; (void)os_supports_avx_avx2_registers; (void)os_supports_avx512_registers; + /* Assume integer vector latency of 1 by default. */ + *int_vector_latency = 1; + if (!is_cpuid_available()) return 0; @@ -320,7 +327,8 @@ detect_x86_gnuc (void) * too high max_cpuid_level, so don't check level 7 if processor does not * support SSE3 (as cpuid:7 contains only features for newer processors). * Source: http://www.sandpile.org/x86/cpuid.htm */ - if (max_cpuid_level >= 7 && (features & 0x00000001)) + has_sse3 = !!(features & 0x00000001); + if (max_cpuid_level >= 7 && has_sse3) { /* Get CPUID:7 contains further Intel feature flags. */ get_cpuid(7, NULL, &features, &features2, NULL); @@ -385,6 +393,16 @@ detect_x86_gnuc (void) result |= HWF_INTEL_GFNI; } + /* Check additional feature flags. */ + if (max_cpuid_level >= 0x21 && has_sse3) + { + get_cpuid(0x21, &features, NULL, NULL, NULL); + if (features & (1 << 23)) + { + has_avx512bmm = 1; + } + } + if ((result & HWF_INTEL_CPU) && family == 6) { /* These Intel Core processor models have SHLD/SHRD instruction that @@ -413,6 +431,12 @@ detect_x86_gnuc (void) } } + if (is_amd_cpu && (family == 0x1a) && !has_avx512bmm) + { + /* Zen5 has integer vector instruction latency of 2. */ + *int_vector_latency = 2; + } + #ifdef ENABLE_FORCE_SOFT_HWFEATURES /* Soft HW features mark functionality that is available on all systems * but not feasible to use because of slow HW implementation. */ @@ -428,6 +452,9 @@ detect_x86_gnuc (void) * only for those Intel processors that benefit from the SHLD * instruction. Enabled here unconditionally as requested. */ result |= HWF_INTEL_FAST_SHLD; + + /* Assume that integer vector instructions have minimum latency. */ + *int_vector_latency = 0; #endif return result; @@ -436,11 +463,14 @@ detect_x86_gnuc (void) unsigned int -_gcry_hwf_detect_x86 (void) +_gcry_hwf_detect_x86 ( + int *int_vector_latency +) { #if defined (HAS_X86_CPUID) - return detect_x86_gnuc (); + return detect_x86_gnuc (int_vector_latency); #else + *int_vector_latency = 0; return 0; #endif } diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 1b107e63..1c3b8034 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -135,6 +135,9 @@ static unsigned int disabled_hw_features; available. */ static unsigned int hw_features; +/* Latency for simple interger vector instructions. */ +static int hwf_int_vector_latency; + static const char * @@ -204,6 +207,14 @@ _gcry_get_hw_features (void) } +/* Return latency for integer vector instructions. */ +int +_gcry_get_hwf_int_vector_latency (void) +{ + return hwf_int_vector_latency; +} + + /* Enumerate all features. The caller is expected to start with an IDX of 0 and then increment IDX until NULL is returned. */ const char * @@ -283,9 +294,11 @@ _gcry_detect_hw_features (void) parse_hwf_deny_file (); + hwf_int_vector_latency = -1; + #if defined (HAVE_CPU_ARCH_X86) { - hw_features = _gcry_hwf_detect_x86 (); + hw_features = _gcry_hwf_detect_x86 (&hwf_int_vector_latency); } #elif defined (HAVE_CPU_ARCH_ARM) { -- 2.51.0