[PATCH] rijndael: add VAES/AVX512 accelerated implementation
Jussi Kivilinna
jussi.kivilinna at iki.fi
Thu Jan 1 16:34:39 CET 2026
* cipher/Makefile.am: Add 'rijndael-vaes-avx512-amd64.S'.
* cipher/rijndael-internal.h (USE_VAES_AVX512): New.
(RIJNDAEL_context_s) [USE_VAES_AVX512]: Add 'use_vaes_avx512'.
* cipher/rijndael-vaes-avx2-amd64.S
(_gcry_vaes_avx2_ocb_crypt_amd64): Minor optimization for aligned
blk8 OCB path.
* cipher/rijndael-vaes-avx512-amd64.S: New.
* cipher/rijndael-vaes.c [USE_VAES_AVX512]
(_gcry_vaes_avx512_cbc_dec_amd64, _gcry_vaes_avx512_cfb_dec_amd64)
(_gcry_vaes_avx512_ctr_enc_amd64)
(_gcry_vaes_avx512_ctr32le_enc_amd64)
(_gcry_vaes_avx512_ocb_aligned_crypt_amd64)
(_gcry_vaes_avx512_xts_crypt_amd64)
(_gcry_vaes_avx512_ecb_crypt_amd64): New.
(_gcry_aes_vaes_ecb_crypt, _gcry_aes_vaes_cbc_dec)
(_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_ctr_enc)
(_gcry_aes_vaes_ctr32le_enc, _gcry_aes_vaes_ocb_crypt)
(_gcry_aes_vaes_ocb_auth, _gcry_aes_vaes_xts_crypt)
[USE_VAES_AVX512]: Add AVX512 code paths.
* cipher/rijndael.c (do_setkey) [USE_VAES_AVX512]: Add setup for
'ctx->use_vaes_avx512'.
* configure.ac: Add 'rijndael-vaes-avx512-amd64.lo'.
--
Commit adds VAES/AVX512 acceleration for AES. New implementation
is about ~2x faster (for parallel modes, such as OCB) compared to
VAES/AVX2 implementation on AMD zen5. With AMD zen4 and Intel
tigerlake, VAES/AVX512 is about same speed as VAES/AVX2 since
HW supports only 256bit wide processing for AES instructions.
Benchmark on AMD Ryzen 9 9950X3D (zen5):
Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.029 ns/B 32722 MiB/s 0.162 c/B 5566±1
ECB dec | 0.029 ns/B 32824 MiB/s 0.162 c/B 5563
CBC enc | 0.449 ns/B 2123 MiB/s 2.50 c/B 5563
CBC dec | 0.029 ns/B 32735 MiB/s 0.162 c/B 5566
CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565
CFB dec | 0.029 ns/B 32752 MiB/s 0.162 c/B 5565
CTR enc | 0.030 ns/B 31694 MiB/s 0.167 c/B 5565
CTR dec | 0.030 ns/B 31727 MiB/s 0.167 c/B 5568
XTS enc | 0.033 ns/B 28776 MiB/s 0.184 c/B 5560
XTS dec | 0.033 ns/B 28517 MiB/s 0.186 c/B 5551±4
GCM enc | 0.074 ns/B 12841 MiB/s 0.413 c/B 5565
GCM dec | 0.075 ns/B 12658 MiB/s 0.419 c/B 5566
GCM auth | 0.045 ns/B 21322 MiB/s 0.249 c/B 5566
OCB enc | 0.030 ns/B 32298 MiB/s 0.164 c/B 5543±4
OCB dec | 0.029 ns/B 32476 MiB/s 0.163 c/B 5545±6
OCB auth | 0.029 ns/B 32961 MiB/s 0.161 c/B 5561±2
After (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.015 ns/B 62011 MiB/s 0.085 c/B 5553±5
ECB dec | 0.015 ns/B 63315 MiB/s 0.084 c/B 5552±3
CBC enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565
CBC dec | 0.015 ns/B 63800 MiB/s 0.083 c/B 5557±4
CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5562
CFB dec | 0.015 ns/B 62510 MiB/s 0.085 c/B 5557±1
CTR enc | 0.016 ns/B 60975 MiB/s 0.087 c/B 5564
CTR dec | 0.016 ns/B 60737 MiB/s 0.087 c/B 5556±2
XTS enc | 0.018 ns/B 53861 MiB/s 0.098 c/B 5561±1
XTS dec | 0.018 ns/B 53604 MiB/s 0.099 c/B 5549±3
GCM enc | 0.037 ns/B 25806 MiB/s 0.206 c/B 5561±3
GCM dec | 0.038 ns/B 25223 MiB/s 0.210 c/B 5555±5
GCM auth | 0.021 ns/B 44365 MiB/s 0.120 c/B 5562
OCB enc | 0.016 ns/B 61035 MiB/s 0.087 c/B 5545±6
OCB dec | 0.015 ns/B 62190 MiB/s 0.085 c/B 5544±5
OCB auth | 0.015 ns/B 63886 MiB/s 0.083 c/B 5543±7
Benchmark on AMD Ryzen 9 7900X (zen4):
Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.028 ns/B 33759 MiB/s 0.160 c/B 5676
ECB dec | 0.028 ns/B 33560 MiB/s 0.161 c/B 5676
CBC enc | 0.441 ns/B 2165 MiB/s 2.50 c/B 5676
CBC dec | 0.029 ns/B 32766 MiB/s 0.165 c/B 5677±2
CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5676
CFB dec | 0.029 ns/B 33053 MiB/s 0.164 c/B 5686±4
CTR enc | 0.029 ns/B 32420 MiB/s 0.167 c/B 5677±1
CTR dec | 0.029 ns/B 32531 MiB/s 0.167 c/B 5690±5
XTS enc | 0.038 ns/B 25081 MiB/s 0.215 c/B 5650
XTS dec | 0.038 ns/B 25020 MiB/s 0.217 c/B 5704±6
GCM enc | 0.067 ns/B 14170 MiB/s 0.370 c/B 5500
GCM dec | 0.067 ns/B 14205 MiB/s 0.369 c/B 5500
GCM auth | 0.038 ns/B 25110 MiB/s 0.209 c/B 5500
OCB enc | 0.030 ns/B 31579 MiB/s 0.172 c/B 5708±20
OCB dec | 0.030 ns/B 31613 MiB/s 0.173 c/B 5722±5
OCB auth | 0.029 ns/B 32535 MiB/s 0.167 c/B 5688±1
After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.028 ns/B 33551 MiB/s 0.161 c/B 5676
ECB dec | 0.029 ns/B 33346 MiB/s 0.162 c/B 5675
CBC enc | 0.440 ns/B 2166 MiB/s 2.50 c/B 5675
CBC dec | 0.029 ns/B 33308 MiB/s 0.163 c/B 5685±3
CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5675
CFB dec | 0.029 ns/B 33254 MiB/s 0.163 c/B 5671±1
CTR enc | 0.029 ns/B 33367 MiB/s 0.163 c/B 5686
CTR dec | 0.029 ns/B 33447 MiB/s 0.162 c/B 5687
XTS enc | 0.034 ns/B 27705 MiB/s 0.195 c/B 5673±1
XTS dec | 0.035 ns/B 27429 MiB/s 0.197 c/B 5677
GCM enc | 0.057 ns/B 16625 MiB/s 0.324 c/B 5652
GCM dec | 0.059 ns/B 16094 MiB/s 0.326 c/B 5510
GCM auth | 0.030 ns/B 31982 MiB/s 0.164 c/B 5500
OCB enc | 0.030 ns/B 31630 MiB/s 0.166 c/B 5500
OCB dec | 0.030 ns/B 32214 MiB/s 0.163 c/B 5500
OCB auth | 0.029 ns/B 33413 MiB/s 0.157 c/B 5500
Benchmark on Intel Core i3-1115G4I (tigerlake):
Before (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.038 ns/B 25068 MiB/s 0.156 c/B 4090
ECB dec | 0.038 ns/B 25157 MiB/s 0.155 c/B 4090
CBC enc | 0.459 ns/B 2080 MiB/s 1.88 c/B 4090
CBC dec | 0.038 ns/B 25091 MiB/s 0.155 c/B 4090
CFB enc | 0.458 ns/B 2081 MiB/s 1.87 c/B 4090
CFB dec | 0.038 ns/B 25176 MiB/s 0.155 c/B 4090
CTR enc | 0.039 ns/B 24466 MiB/s 0.159 c/B 4090
CTR dec | 0.039 ns/B 24428 MiB/s 0.160 c/B 4090
XTS enc | 0.057 ns/B 16760 MiB/s 0.233 c/B 4090
XTS dec | 0.056 ns/B 16952 MiB/s 0.230 c/B 4090
GCM enc | 0.102 ns/B 9344 MiB/s 0.417 c/B 4090
GCM dec | 0.102 ns/B 9312 MiB/s 0.419 c/B 4090
GCM auth | 0.063 ns/B 15243 MiB/s 0.256 c/B 4090
OCB enc | 0.042 ns/B 22451 MiB/s 0.174 c/B 4090
OCB dec | 0.042 ns/B 22613 MiB/s 0.172 c/B 4090
OCB auth | 0.040 ns/B 23770 MiB/s 0.164 c/B 4090
After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.040 ns/B 24094 MiB/s 0.162 c/B 4097±3
ECB dec | 0.040 ns/B 24052 MiB/s 0.162 c/B 4097±3
CBC enc | 0.458 ns/B 2080 MiB/s 1.88 c/B 4090
CBC dec | 0.039 ns/B 24385 MiB/s 0.160 c/B 4097±3
CFB enc | 0.458 ns/B 2080 MiB/s 1.87 c/B 4090
CFB dec | 0.039 ns/B 24403 MiB/s 0.160 c/B 4097±3
CTR enc | 0.040 ns/B 24119 MiB/s 0.162 c/B 4097±3
CTR dec | 0.040 ns/B 24095 MiB/s 0.162 c/B 4097±3
XTS enc | 0.048 ns/B 19891 MiB/s 0.196 c/B 4097±3
XTS dec | 0.048 ns/B 20077 MiB/s 0.195 c/B 4097±3
GCM enc | 0.084 ns/B 11417 MiB/s 0.342 c/B 4097±3
GCM dec | 0.084 ns/B 11373 MiB/s 0.344 c/B 4097±3
GCM auth | 0.045 ns/B 21402 MiB/s 0.183 c/B 4097±3
OCB enc | 0.040 ns/B 23946 MiB/s 0.163 c/B 4097±3
OCB dec | 0.040 ns/B 23760 MiB/s 0.164 c/B 4097±4
OCB auth | 0.041 ns/B 23083 MiB/s 0.169 c/B 4097±4
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
cipher/Makefile.am | 1 +
cipher/rijndael-internal.h | 11 +-
cipher/rijndael-vaes-avx2-amd64.S | 7 +-
cipher/rijndael-vaes-avx512-amd64.S | 2471 +++++++++++++++++++++++++++
cipher/rijndael-vaes.c | 180 +-
cipher/rijndael.c | 5 +
configure.ac | 1 +
7 files changed, 2668 insertions(+), 8 deletions(-)
create mode 100644 cipher/rijndael-vaes-avx512-amd64.S
diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index bbcd518a..11bb19d7 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \
rijndael-amd64.S rijndael-arm.S \
rijndael-ssse3-amd64.c rijndael-ssse3-amd64-asm.S \
rijndael-vaes.c rijndael-vaes-avx2-amd64.S \
+ rijndael-vaes-avx512-amd64.S \
rijndael-vaes-i386.c rijndael-vaes-avx2-i386.S \
rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \
rijndael-armv8-aarch64-ce.S rijndael-aarch64.S \
diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h
index 15084a69..bb8f97a0 100644
--- a/cipher/rijndael-internal.h
+++ b/cipher/rijndael-internal.h
@@ -89,7 +89,7 @@
# endif
#endif /* ENABLE_AESNI_SUPPORT */
-/* USE_VAES inidicates whether to compile with AMD64 VAES code. */
+/* USE_VAES inidicates whether to compile with AMD64 VAES/AVX2 code. */
#undef USE_VAES
#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
@@ -99,6 +99,12 @@
# define USE_VAES 1
#endif
+/* USE_VAES inidicates whether to compile with AMD64 VAES/AVX512 code. */
+#undef USE_VAES_AVX512
+#if defined(USE_VAES) && defined(ENABLE_AVX512_SUPPORT)
+# define USE_VAES_AVX512 1
+#endif
+
/* USE_VAES_I386 inidicates whether to compile with i386 VAES code. */
#undef USE_VAES_I386
#if (defined(HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS) || \
@@ -210,6 +216,9 @@ typedef struct RIJNDAEL_context_s
unsigned int use_avx:1; /* AVX shall be used by AES-NI implementation. */
unsigned int use_avx2:1; /* AVX2 shall be used by AES-NI implementation. */
#endif /*USE_AESNI*/
+#ifdef USE_VAES_AVX512
+ unsigned int use_vaes_avx512:1; /* AVX512 shall be used by VAES implementation. */
+#endif /*USE_VAES_AVX512*/
#ifdef USE_S390X_CRYPTO
byte km_func;
byte km_func_xts;
diff --git a/cipher/rijndael-vaes-avx2-amd64.S b/cipher/rijndael-vaes-avx2-amd64.S
index 51ccf932..07e6f1ca 100644
--- a/cipher/rijndael-vaes-avx2-amd64.S
+++ b/cipher/rijndael-vaes-avx2-amd64.S
@@ -2370,16 +2370,11 @@ _gcry_vaes_avx2_ocb_crypt_amd64:
leaq -8(%r8), %r8;
leal 8(%esi), %esi;
- tzcntl %esi, %eax;
- shll $4, %eax;
vpxor (0 * 16)(%rsp), %ymm15, %ymm5;
vpxor (2 * 16)(%rsp), %ymm15, %ymm6;
vpxor (4 * 16)(%rsp), %ymm15, %ymm7;
-
- vpxor (2 * 16)(%r14), %xmm15, %xmm13; /* offset ^ first key ^ L[2] */
- vpxor (%r14, %rax), %xmm13, %xmm14; /* offset ^ first key ^ L[2] ^ L[ntz{nblk+8}] */
- vinserti128 $1, %xmm14, %ymm13, %ymm14;
+ vpxor (6 * 16)(%rsp), %ymm15, %ymm14;
cmpl $1, %r15d;
jb .Locb_aligned_blk8_dec;
diff --git a/cipher/rijndael-vaes-avx512-amd64.S b/cipher/rijndael-vaes-avx512-amd64.S
new file mode 100644
index 00000000..b7dba5e3
--- /dev/null
+++ b/cipher/rijndael-vaes-avx512-amd64.S
@@ -0,0 +1,2471 @@
+/* VAES/AVX512 AMD64 accelerated AES for Libgcrypt
+ * Copyright (C) 2026 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#if defined(__x86_64__)
+#include <config.h>
+#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+ defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
+ defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) && \
+ defined(ENABLE_AVX512_SUPPORT) && \
+ defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL)
+
+#include "asm-common-amd64.h"
+
+.text
+
+/**********************************************************************
+ helper macros
+ **********************************************************************/
+#define no(...) /*_*/
+#define yes(...) __VA_ARGS__
+
+#define AES_OP8(op, key, b0, b1, b2, b3, b4, b5, b6, b7) \
+ op key, b0, b0; \
+ op key, b1, b1; \
+ op key, b2, b2; \
+ op key, b3, b3; \
+ op key, b4, b4; \
+ op key, b5, b5; \
+ op key, b6, b6; \
+ op key, b7, b7;
+
+#define VAESENC8(key, b0, b1, b2, b3, b4, b5, b6, b7) \
+ AES_OP8(vaesenc, key, b0, b1, b2, b3, b4, b5, b6, b7)
+
+#define VAESDEC8(key, b0, b1, b2, b3, b4, b5, b6, b7) \
+ AES_OP8(vaesdec, key, b0, b1, b2, b3, b4, b5, b6, b7)
+
+#define XOR8(key, b0, b1, b2, b3, b4, b5, b6, b7) \
+ AES_OP8(vpxord, key, b0, b1, b2, b3, b4, b5, b6, b7)
+
+#define AES_OP4(op, key, b0, b1, b2, b3) \
+ op key, b0, b0; \
+ op key, b1, b1; \
+ op key, b2, b2; \
+ op key, b3, b3;
+
+#define VAESENC4(key, b0, b1, b2, b3) \
+ AES_OP4(vaesenc, key, b0, b1, b2, b3)
+
+#define VAESDEC4(key, b0, b1, b2, b3) \
+ AES_OP4(vaesdec, key, b0, b1, b2, b3)
+
+#define XOR4(key, b0, b1, b2, b3) \
+ AES_OP4(vpxord, key, b0, b1, b2, b3)
+
+/**********************************************************************
+ CBC-mode decryption
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_cbc_dec_amd64, at function)
+.globl _gcry_vaes_avx512_cbc_dec_amd64
+.align 16
+_gcry_vaes_avx512_cbc_dec_amd64:
+ /* input:
+ * %rdi: round keys
+ * %rsi: iv
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lcbc_dec_skip_avx512;
+
+ spec_stop_avx512;
+
+ /* Load IV. */
+ vmovdqu (%rsi), %xmm15;
+
+ /* Load first and last key. */
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm30;
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm31;
+
+ /* Process 32 blocks per loop. */
+.align 16
+.Lcbc_dec_blk32:
+ cmpq $32, %r8;
+ jb .Lcbc_dec_blk16;
+
+ leaq -32(%r8), %r8;
+
+ /* Load input and xor first key. Update IV. */
+ vmovdqu32 (0 * 16)(%rcx), %zmm0;
+ vshufi32x4 $0b10010011, %zmm0, %zmm0, %zmm9;
+ vmovdqu32 (4 * 16)(%rcx), %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm3;
+ vmovdqu32 (16 * 16)(%rcx), %zmm4;
+ vmovdqu32 (20 * 16)(%rcx), %zmm5;
+ vmovdqu32 (24 * 16)(%rcx), %zmm6;
+ vmovdqu32 (28 * 16)(%rcx), %zmm7;
+ vinserti32x4 $0, %xmm15, %zmm9, %zmm9;
+ vpxord %zmm30, %zmm0, %zmm0;
+ vpxord %zmm30, %zmm1, %zmm1;
+ vpxord %zmm30, %zmm2, %zmm2;
+ vpxord %zmm30, %zmm3, %zmm3;
+ vpxord %zmm30, %zmm4, %zmm4;
+ vpxord %zmm30, %zmm5, %zmm5;
+ vpxord %zmm30, %zmm6, %zmm6;
+ vpxord %zmm30, %zmm7, %zmm7;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+ vpxord %zmm31, %zmm9, %zmm9;
+ vpxord (3 * 16)(%rcx), %zmm31, %zmm10;
+ vpxord (7 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (11 * 16)(%rcx), %zmm31, %zmm12;
+ vpxord (15 * 16)(%rcx), %zmm31, %zmm13;
+ vpxord (19 * 16)(%rcx), %zmm31, %zmm14;
+ vpxord (23 * 16)(%rcx), %zmm31, %zmm16;
+ vpxord (27 * 16)(%rcx), %zmm31, %zmm17;
+ vmovdqu (31 * 16)(%rcx), %xmm15;
+ leaq (32 * 16)(%rcx), %rcx;
+
+ /* AES rounds */
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lcbc_dec_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lcbc_dec_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lcbc_dec_blk32_last:
+ vaesdeclast %zmm9, %zmm0, %zmm0;
+ vaesdeclast %zmm10, %zmm1, %zmm1;
+ vaesdeclast %zmm11, %zmm2, %zmm2;
+ vaesdeclast %zmm12, %zmm3, %zmm3;
+ vaesdeclast %zmm13, %zmm4, %zmm4;
+ vaesdeclast %zmm14, %zmm5, %zmm5;
+ vaesdeclast %zmm16, %zmm6, %zmm6;
+ vaesdeclast %zmm17, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ jmp .Lcbc_dec_blk32;
+
+ /* Process 16 blocks per loop. */
+.align 16
+.Lcbc_dec_blk16:
+ cmpq $16, %r8;
+ jb .Lcbc_dec_tail;
+
+ leaq -16(%r8), %r8;
+
+ /* Load input and xor first key. Update IV. */
+ vmovdqu32 (0 * 16)(%rcx), %zmm0;
+ vshufi32x4 $0b10010011, %zmm0, %zmm0, %zmm9;
+ vmovdqu32 (4 * 16)(%rcx), %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm3;
+ vinserti32x4 $0, %xmm15, %zmm9, %zmm9;
+ vpxord %zmm30, %zmm0, %zmm0;
+ vpxord %zmm30, %zmm1, %zmm1;
+ vpxord %zmm30, %zmm2, %zmm2;
+ vpxord %zmm30, %zmm3, %zmm3;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+ vpxord %zmm31, %zmm9, %zmm9;
+ vpxord (3 * 16)(%rcx), %zmm31, %zmm10;
+ vpxord (7 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (11 * 16)(%rcx), %zmm31, %zmm12;
+ vmovdqu (15 * 16)(%rcx), %xmm15;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ /* AES rounds */
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lcbc_dec_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lcbc_dec_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lcbc_dec_blk16_last:
+ vaesdeclast %zmm9, %zmm0, %zmm0;
+ vaesdeclast %zmm10, %zmm1, %zmm1;
+ vaesdeclast %zmm11, %zmm2, %zmm2;
+ vaesdeclast %zmm12, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+.align 16
+.Lcbc_dec_tail:
+ /* Store IV. */
+ vmovdqu %xmm15, (%rsi);
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm16, %ymm16, %ymm16;
+ vpxord %ymm17, %ymm17, %ymm17;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+ vzeroall;
+
+.align 16
+.Lcbc_dec_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_cbc_dec_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_cbc_dec_amd64,.-_gcry_vaes_avx512_cbc_dec_amd64)
+
+/**********************************************************************
+ CFB-mode decryption
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_cfb_dec_amd64, at function)
+.globl _gcry_vaes_avx512_cfb_dec_amd64
+.align 16
+_gcry_vaes_avx512_cfb_dec_amd64:
+ /* input:
+ * %rdi: round keys
+ * %rsi: iv
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lcfb_dec_skip_avx512;
+
+ spec_stop_avx512;
+
+ /* Load IV. */
+ vmovdqu (%rsi), %xmm15;
+
+ /* Load first and last key. */
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm30;
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm31;
+
+ /* Process 32 blocks per loop. */
+.align 16
+.Lcfb_dec_blk32:
+ cmpq $32, %r8;
+ jb .Lcfb_dec_blk16;
+
+ leaq -32(%r8), %r8;
+
+ /* Load input and xor first key. Update IV. */
+ vmovdqu32 (0 * 16)(%rcx), %zmm9;
+ vshufi32x4 $0b10010011, %zmm9, %zmm9, %zmm0;
+ vmovdqu32 (3 * 16)(%rcx), %zmm1;
+ vinserti32x4 $0, %xmm15, %zmm0, %zmm0;
+ vmovdqu32 (7 * 16)(%rcx), %zmm2;
+ vmovdqu32 (11 * 16)(%rcx), %zmm3;
+ vmovdqu32 (15 * 16)(%rcx), %zmm4;
+ vmovdqu32 (19 * 16)(%rcx), %zmm5;
+ vmovdqu32 (23 * 16)(%rcx), %zmm6;
+ vmovdqu32 (27 * 16)(%rcx), %zmm7;
+ vmovdqu (31 * 16)(%rcx), %xmm15;
+ vpxord %zmm30, %zmm0, %zmm0;
+ vpxord %zmm30, %zmm1, %zmm1;
+ vpxord %zmm30, %zmm2, %zmm2;
+ vpxord %zmm30, %zmm3, %zmm3;
+ vpxord %zmm30, %zmm4, %zmm4;
+ vpxord %zmm30, %zmm5, %zmm5;
+ vpxord %zmm30, %zmm6, %zmm6;
+ vpxord %zmm30, %zmm7, %zmm7;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+ vpxord %zmm31, %zmm9, %zmm9;
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm10;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm12;
+ vpxord (16 * 16)(%rcx), %zmm31, %zmm13;
+ vpxord (20 * 16)(%rcx), %zmm31, %zmm14;
+ vpxord (24 * 16)(%rcx), %zmm31, %zmm16;
+ vpxord (28 * 16)(%rcx), %zmm31, %zmm17;
+ leaq (32 * 16)(%rcx), %rcx;
+
+ /* AES rounds */
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lcfb_dec_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lcfb_dec_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lcfb_dec_blk32_last:
+ vaesenclast %zmm9, %zmm0, %zmm0;
+ vaesenclast %zmm10, %zmm1, %zmm1;
+ vaesenclast %zmm11, %zmm2, %zmm2;
+ vaesenclast %zmm12, %zmm3, %zmm3;
+ vaesenclast %zmm13, %zmm4, %zmm4;
+ vaesenclast %zmm14, %zmm5, %zmm5;
+ vaesenclast %zmm16, %zmm6, %zmm6;
+ vaesenclast %zmm17, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ jmp .Lcfb_dec_blk32;
+
+ /* Handle trailing 16 blocks. */
+.align 16
+.Lcfb_dec_blk16:
+ cmpq $16, %r8;
+ jb .Lcfb_dec_tail;
+
+ leaq -16(%r8), %r8;
+
+ /* Load input and xor first key. Update IV. */
+ vmovdqu32 (0 * 16)(%rcx), %zmm10;
+ vshufi32x4 $0b10010011, %zmm10, %zmm10, %zmm0;
+ vmovdqu32 (3 * 16)(%rcx), %zmm1;
+ vinserti32x4 $0, %xmm15, %zmm0, %zmm0;
+ vmovdqu32 (7 * 16)(%rcx), %zmm2;
+ vmovdqu32 (11 * 16)(%rcx), %zmm3;
+ vmovdqu (15 * 16)(%rcx), %xmm15;
+ vpxord %zmm30, %zmm0, %zmm0;
+ vpxord %zmm30, %zmm1, %zmm1;
+ vpxord %zmm30, %zmm2, %zmm2;
+ vpxord %zmm30, %zmm3, %zmm3;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ vpxord %zmm31, %zmm10, %zmm10;
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm12;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm13;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ /* AES rounds */
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lcfb_dec_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lcfb_dec_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lcfb_dec_blk16_last:
+ vaesenclast %zmm10, %zmm0, %zmm0;
+ vaesenclast %zmm11, %zmm1, %zmm1;
+ vaesenclast %zmm12, %zmm2, %zmm2;
+ vaesenclast %zmm13, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+.align 16
+.Lcfb_dec_tail:
+ /* Store IV. */
+ vmovdqu %xmm15, (%rsi);
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm16, %ymm16, %ymm16;
+ vpxord %ymm17, %ymm17, %ymm17;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+ vzeroall;
+
+.align 16
+.Lcfb_dec_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_cfb_dec_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_cfb_dec_amd64,.-_gcry_vaes_avx512_cfb_dec_amd64)
+
+/**********************************************************************
+ CTR-mode encryption
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_ctr_enc_amd64, at function)
+.globl _gcry_vaes_avx512_ctr_enc_amd64
+.align 16
+_gcry_vaes_avx512_ctr_enc_amd64:
+ /* input:
+ * %rdi: round keys
+ * %rsi: counter
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lctr_enc_skip_avx512;
+
+ spec_stop_avx512;
+
+ movq 8(%rsi), %r10;
+ movq 0(%rsi), %r11;
+ bswapq %r10;
+ bswapq %r11;
+
+ vmovdqa32 .Lbige_addb_0 rRIP, %zmm20;
+ vmovdqa32 .Lbige_addb_4 rRIP, %zmm21;
+ vmovdqa32 .Lbige_addb_8 rRIP, %zmm22;
+ vmovdqa32 .Lbige_addb_12 rRIP, %zmm23;
+
+ /* Load first and last key. */
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm30;
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm31;
+
+ cmpq $32, %r8;
+ jb .Lctr_enc_blk16;
+
+ vmovdqa32 .Lbige_addb_16 rRIP, %zmm24;
+ vmovdqa32 .Lbige_addb_20 rRIP, %zmm25;
+ vmovdqa32 .Lbige_addb_24 rRIP, %zmm26;
+ vmovdqa32 .Lbige_addb_28 rRIP, %zmm27;
+
+#define add_le128(out, in, lo_counter, hi_counter1) \
+ vpaddq lo_counter, in, out; \
+ vpcmpuq $1, lo_counter, out, %k1; \
+ kaddb %k1, %k1, %k1; \
+ vpaddq hi_counter1, out, out{%k1};
+
+#define handle_ctr_128bit_add(nblks) \
+ addq $(nblks), %r10; \
+ adcq $0, %r11; \
+ bswapq %r10; \
+ bswapq %r11; \
+ movq %r10, 8(%rsi); \
+ movq %r11, 0(%rsi); \
+ bswapq %r10; \
+ bswapq %r11;
+
+ /* Process 32 blocks per loop. */
+.align 16
+.Lctr_enc_blk32:
+ leaq -32(%r8), %r8;
+
+ vbroadcasti32x4 (%rsi), %zmm7;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+
+ /* detect if carry handling is needed */
+ addb $32, 15(%rsi);
+ jc .Lctr_enc_blk32_handle_carry;
+
+ leaq 32(%r10), %r10;
+
+ .Lctr_enc_blk32_byte_bige_add:
+ /* Increment counters. */
+ vpaddb %zmm20, %zmm7, %zmm0;
+ vpaddb %zmm21, %zmm7, %zmm1;
+ vpaddb %zmm22, %zmm7, %zmm2;
+ vpaddb %zmm23, %zmm7, %zmm3;
+ vpaddb %zmm24, %zmm7, %zmm4;
+ vpaddb %zmm25, %zmm7, %zmm5;
+ vpaddb %zmm26, %zmm7, %zmm6;
+ vpaddb %zmm27, %zmm7, %zmm7;
+
+ .Lctr_enc_blk32_rounds:
+ /* AES rounds */
+ XOR8(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lctr_enc_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lctr_enc_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lctr_enc_blk32_last:
+ vpxord (0 * 16)(%rcx), %zmm31, %zmm9; /* Xor src to last round key. */
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm10;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm12;
+ vpxord (16 * 16)(%rcx), %zmm31, %zmm13;
+ vpxord (20 * 16)(%rcx), %zmm31, %zmm14;
+ vpxord (24 * 16)(%rcx), %zmm31, %zmm15;
+ vpxord (28 * 16)(%rcx), %zmm31, %zmm8;
+ leaq (32 * 16)(%rcx), %rcx;
+ vaesenclast %zmm9, %zmm0, %zmm0;
+ vaesenclast %zmm10, %zmm1, %zmm1;
+ vaesenclast %zmm11, %zmm2, %zmm2;
+ vaesenclast %zmm12, %zmm3, %zmm3;
+ vaesenclast %zmm13, %zmm4, %zmm4;
+ vaesenclast %zmm14, %zmm5, %zmm5;
+ vaesenclast %zmm15, %zmm6, %zmm6;
+ vaesenclast %zmm8, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ cmpq $32, %r8;
+ jnb .Lctr_enc_blk32;
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm24, %ymm24, %ymm24;
+ vpxord %ymm25, %ymm25, %ymm25;
+ vpxord %ymm26, %ymm26, %ymm26;
+ vpxord %ymm27, %ymm27, %ymm27;
+
+ jmp .Lctr_enc_blk16;
+
+ .align 16
+ .Lctr_enc_blk32_handle_only_ctr_carry:
+ handle_ctr_128bit_add(32);
+ jmp .Lctr_enc_blk32_byte_bige_add;
+
+ .align 16
+ .Lctr_enc_blk32_handle_carry:
+ jz .Lctr_enc_blk32_handle_only_ctr_carry;
+ /* Increment counters (handle carry). */
+ vbroadcasti32x4 .Lbswap128_mask rRIP, %zmm15;
+ vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm10;
+ vpmovzxbq .Lcounter1_1_1_1_hi_bq rRIP, %zmm13;
+ vpshufb %zmm15, %zmm7, %zmm7; /* be => le */
+ vpmovzxbq .Lcounter4_4_4_4_lo_bq rRIP, %zmm11;
+ vpmovzxbq .Lcounter8_8_8_8_lo_bq rRIP, %zmm12;
+ handle_ctr_128bit_add(32);
+ add_le128(%zmm0, %zmm7, %zmm10, %zmm13); /* +0:+1:+2:+3 */
+ add_le128(%zmm1, %zmm0, %zmm11, %zmm13); /* +4:+5:+6:+7 */
+ add_le128(%zmm2, %zmm0, %zmm12, %zmm13); /* +8:... */
+ vpshufb %zmm15, %zmm0, %zmm0; /* le => be */
+ add_le128(%zmm3, %zmm1, %zmm12, %zmm13); /* +12:... */
+ vpshufb %zmm15, %zmm1, %zmm1; /* le => be */
+ add_le128(%zmm4, %zmm2, %zmm12, %zmm13); /* +16:... */
+ vpshufb %zmm15, %zmm2, %zmm2; /* le => be */
+ add_le128(%zmm5, %zmm3, %zmm12, %zmm13); /* +20:... */
+ vpshufb %zmm15, %zmm3, %zmm3; /* le => be */
+ add_le128(%zmm6, %zmm4, %zmm12, %zmm13); /* +24:... */
+ vpshufb %zmm15, %zmm4, %zmm4; /* le => be */
+ add_le128(%zmm7, %zmm5, %zmm12, %zmm13); /* +28:... */
+ vpshufb %zmm15, %zmm5, %zmm5; /* le => be */
+ vpshufb %zmm15, %zmm6, %zmm6; /* le => be */
+ vpshufb %zmm15, %zmm7, %zmm7; /* le => be */
+
+ jmp .Lctr_enc_blk32_rounds;
+
+ /* Handle trailing 16 blocks. */
+.align 16
+.Lctr_enc_blk16:
+ cmpq $16, %r8;
+ jb .Lctr_enc_tail;
+
+ leaq -16(%r8), %r8;
+
+ vbroadcasti32x4 (%rsi), %zmm3;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+
+ /* detect if carry handling is needed */
+ addb $16, 15(%rsi);
+ jc .Lctr_enc_blk16_handle_carry;
+
+ leaq 16(%r10), %r10;
+
+ .Lctr_enc_blk16_byte_bige_add:
+ /* Increment counters. */
+ vpaddb %zmm20, %zmm3, %zmm0;
+ vpaddb %zmm21, %zmm3, %zmm1;
+ vpaddb %zmm22, %zmm3, %zmm2;
+ vpaddb %zmm23, %zmm3, %zmm3;
+
+ .Lctr_enc_blk16_rounds:
+ /* AES rounds */
+ XOR4(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3);
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lctr_enc_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lctr_enc_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lctr_enc_blk16_last:
+ vpxord (0 * 16)(%rcx), %zmm31, %zmm5; /* Xor src to last round key. */
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm6;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm7;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm4;
+ leaq (16 * 16)(%rcx), %rcx;
+ vaesenclast %zmm5, %zmm0, %zmm0;
+ vaesenclast %zmm6, %zmm1, %zmm1;
+ vaesenclast %zmm7, %zmm2, %zmm2;
+ vaesenclast %zmm4, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ jmp .Lctr_enc_tail;
+
+ .align 16
+ .Lctr_enc_blk16_handle_only_ctr_carry:
+ handle_ctr_128bit_add(16);
+ jmp .Lctr_enc_blk16_byte_bige_add;
+
+ .align 16
+ .Lctr_enc_blk16_handle_carry:
+ jz .Lctr_enc_blk16_handle_only_ctr_carry;
+ /* Increment counters (handle carry). */
+ vbroadcasti32x4 .Lbswap128_mask rRIP, %zmm15;
+ vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm10;
+ vpmovzxbq .Lcounter1_1_1_1_hi_bq rRIP, %zmm13;
+ vpshufb %zmm15, %zmm3, %zmm3; /* be => le */
+ vpmovzxbq .Lcounter4_4_4_4_lo_bq rRIP, %zmm11;
+ vpmovzxbq .Lcounter8_8_8_8_lo_bq rRIP, %zmm12;
+ handle_ctr_128bit_add(16);
+ add_le128(%zmm0, %zmm3, %zmm10, %zmm13); /* +0:+1:+2:+3 */
+ add_le128(%zmm1, %zmm0, %zmm11, %zmm13); /* +4:+5:+6:+7 */
+ add_le128(%zmm2, %zmm0, %zmm12, %zmm13); /* +8:... */
+ vpshufb %zmm15, %zmm0, %zmm0; /* le => be */
+ add_le128(%zmm3, %zmm1, %zmm12, %zmm13); /* +12:... */
+ vpshufb %zmm15, %zmm1, %zmm1; /* le => be */
+ vpshufb %zmm15, %zmm2, %zmm2; /* le => be */
+ vpshufb %zmm15, %zmm3, %zmm3; /* le => be */
+
+ jmp .Lctr_enc_blk16_rounds;
+
+.align 16
+.Lctr_enc_tail:
+ xorl %r10d, %r10d;
+ xorl %r11d, %r11d;
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm20, %ymm20, %ymm20;
+ vpxord %ymm21, %ymm21, %ymm21;
+ vpxord %ymm22, %ymm22, %ymm22;
+ vpxord %ymm23, %ymm23, %ymm23;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+ kxorq %k1, %k1, %k1;
+ vzeroall;
+
+.align 16
+.Lctr_enc_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_ctr_enc_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_ctr_enc_amd64,.-_gcry_vaes_avx512_ctr_enc_amd64)
+
+/**********************************************************************
+ Little-endian 32-bit CTR-mode encryption (GCM-SIV)
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_ctr32le_enc_amd64, at function)
+.globl _gcry_vaes_avx512_ctr32le_enc_amd64
+.align 16
+_gcry_vaes_avx512_ctr32le_enc_amd64:
+ /* input:
+ * %rdi: round keys
+ * %rsi: counter
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lctr32le_enc_skip_avx512;
+
+ spec_stop_avx512;
+
+ /* Load counter. */
+ vbroadcasti32x4 (%rsi), %zmm15;
+
+ vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm20;
+ vpmovzxbq .Lcounter4_5_6_7_lo_bq rRIP, %zmm21;
+ vpmovzxbq .Lcounter8_9_10_11_lo_bq rRIP, %zmm22;
+ vpmovzxbq .Lcounter12_13_14_15_lo_bq rRIP, %zmm23;
+
+ /* Load first and last key. */
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm30;
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm31;
+
+ cmpq $32, %r8;
+ jb .Lctr32le_enc_blk16;
+
+ vpmovzxbq .Lcounter16_17_18_19_lo_bq rRIP, %zmm24;
+ vpmovzxbq .Lcounter20_21_22_23_lo_bq rRIP, %zmm25;
+ vpmovzxbq .Lcounter24_25_26_27_lo_bq rRIP, %zmm26;
+ vpmovzxbq .Lcounter28_29_30_31_lo_bq rRIP, %zmm27;
+
+ /* Process 32 blocks per loop. */
+.align 16
+.Lctr32le_enc_blk32:
+ leaq -32(%r8), %r8;
+
+ /* Increment counters. */
+ vpmovzxbq .Lcounter32_32_32_32_lo_bq rRIP, %zmm9;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+ vpaddd %zmm20, %zmm15, %zmm0;
+ vpaddd %zmm21, %zmm15, %zmm1;
+ vpaddd %zmm22, %zmm15, %zmm2;
+ vpaddd %zmm23, %zmm15, %zmm3;
+ vpaddd %zmm24, %zmm15, %zmm4;
+ vpaddd %zmm25, %zmm15, %zmm5;
+ vpaddd %zmm26, %zmm15, %zmm6;
+ vpaddd %zmm27, %zmm15, %zmm7;
+
+ vpaddd %zmm9, %zmm15, %zmm15;
+
+ /* AES rounds */
+ XOR8(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lctr32le_enc_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lctr32le_enc_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lctr32le_enc_blk32_last:
+ vpxord (0 * 16)(%rcx), %zmm31, %zmm9; /* Xor src to last round key. */
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm10;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm11;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm12;
+ vpxord (16 * 16)(%rcx), %zmm31, %zmm13;
+ vpxord (20 * 16)(%rcx), %zmm31, %zmm14;
+ vpxord (24 * 16)(%rcx), %zmm31, %zmm16;
+ vpxord (28 * 16)(%rcx), %zmm31, %zmm8;
+ leaq (32 * 16)(%rcx), %rcx;
+ vaesenclast %zmm9, %zmm0, %zmm0;
+ vaesenclast %zmm10, %zmm1, %zmm1;
+ vaesenclast %zmm11, %zmm2, %zmm2;
+ vaesenclast %zmm12, %zmm3, %zmm3;
+ vaesenclast %zmm13, %zmm4, %zmm4;
+ vaesenclast %zmm14, %zmm5, %zmm5;
+ vaesenclast %zmm16, %zmm6, %zmm6;
+ vaesenclast %zmm8, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ cmpq $32, %r8;
+ jnb .Lctr32le_enc_blk32;
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm16, %ymm16, %ymm16;
+ vpxord %ymm24, %ymm24, %ymm24;
+ vpxord %ymm25, %ymm25, %ymm25;
+ vpxord %ymm26, %ymm26, %ymm26;
+ vpxord %ymm27, %ymm27, %ymm27;
+
+ /* Handle trailing 16 blocks. */
+.align 16
+.Lctr32le_enc_blk16:
+ cmpq $16, %r8;
+ jb .Lctr32le_enc_tail;
+
+ leaq -16(%r8), %r8;
+
+ /* Increment counters. */
+ vpmovzxbq .Lcounter16_16_16_16_lo_bq rRIP, %zmm5;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ vpaddd %zmm20, %zmm15, %zmm0;
+ vpaddd %zmm21, %zmm15, %zmm1;
+ vpaddd %zmm22, %zmm15, %zmm2;
+ vpaddd %zmm23, %zmm15, %zmm3;
+
+ vpaddd %zmm5, %zmm15, %zmm15;
+
+ /* AES rounds */
+ XOR4(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3);
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lctr32le_enc_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lctr32le_enc_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lctr32le_enc_blk16_last:
+ vpxord (0 * 16)(%rcx), %zmm31, %zmm5; /* Xor src to last round key. */
+ vpxord (4 * 16)(%rcx), %zmm31, %zmm6;
+ vpxord (8 * 16)(%rcx), %zmm31, %zmm7;
+ vpxord (12 * 16)(%rcx), %zmm31, %zmm4;
+ leaq (16 * 16)(%rcx), %rcx;
+ vaesenclast %zmm5, %zmm0, %zmm0;
+ vaesenclast %zmm6, %zmm1, %zmm1;
+ vaesenclast %zmm7, %zmm2, %zmm2;
+ vaesenclast %zmm4, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+.align 16
+.Lctr32le_enc_tail:
+ /* Store IV. */
+ vmovdqu %xmm15, (%rsi);
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm20, %ymm20, %ymm20;
+ vpxord %ymm21, %ymm21, %ymm21;
+ vpxord %ymm22, %ymm22, %ymm22;
+ vpxord %ymm23, %ymm23, %ymm23;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+ vzeroall;
+
+.align 16
+.Lctr32le_enc_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_ctr32le_enc_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_ctr32le_enc_amd64,.-_gcry_vaes_avx512_ctr32le_enc_amd64)
+
+/**********************************************************************
+ OCB-mode encryption/decryption/authentication
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_ocb_aligned_crypt_amd64, at function)
+.globl _gcry_vaes_avx512_ocb_aligned_crypt_amd64
+.align 16
+_gcry_vaes_avx512_ocb_aligned_crypt_amd64:
+ /* input:
+ * %rdi: round keys
+ * %esi: nblk
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ * 0(%rsp): offset
+ * 8(%rsp): checksum
+ * 16(%rsp): L-array
+ * 24(%rsp): decrypt/encrypt/auth
+ */
+ CFI_STARTPROC();
+
+ cmpq $32, %r8;
+ jb .Locb_skip_avx512;
+
+ spec_stop_avx512;
+
+ pushq %r12;
+ CFI_PUSH(%r12);
+ pushq %r13;
+ CFI_PUSH(%r13);
+ pushq %r14;
+ CFI_PUSH(%r14);
+ pushq %rbx;
+ CFI_PUSH(%rbx);
+
+#define OFFSET_PTR_Q 0+5*8(%rsp)
+#define CHECKSUM_PTR_Q 8+5*8(%rsp)
+#define L_ARRAY_PTR_L 16+5*8(%rsp)
+#define OPER_MODE_L 24+5*8(%rsp)
+
+ movq OFFSET_PTR_Q, %r13; /* offset ptr. */
+ movq L_ARRAY_PTR_L, %r14; /* L-array ptr. */
+ movl OPER_MODE_L, %ebx; /* decrypt/encrypt/auth-mode. */
+ movq CHECKSUM_PTR_Q, %r12; /* checksum ptr. */
+
+ leal (, %r9d, 4), %eax;
+ vmovdqu (%r13), %xmm15; /* Load offset. */
+ vmovdqa (0 * 16)(%rdi), %xmm0; /* first key */
+ vpxor (%rdi, %rax, 4), %xmm0, %xmm0; /* first key ^ last key */
+ vpxor (0 * 16)(%rdi), %xmm15, %xmm15; /* offset ^ first key */
+ vshufi32x4 $0, %zmm0, %zmm0, %zmm30;
+ vpxord %ymm29, %ymm29, %ymm29;
+
+ vshufi32x4 $0, %zmm15, %zmm15, %zmm15;
+
+ /* Prepare L-array optimization.
+ * Since nblk is aligned to 16, offsets will have following
+ * construction:
+ * - block1 = ntz{0} = offset ^ L[0]
+ * - block2 = ntz{1} = offset ^ L[0] ^ L[1]
+ * - block3 = ntz{0} = offset ^ L[1]
+ * - block4 = ntz{2} = offset ^ L[1] ^ L[2]
+ * => zmm20
+ *
+ * - block5 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[2]
+ * - block6 = ntz{1} = offset ^ L[0] ^ L[2]
+ * - block7 = ntz{0} = offset ^ L[2]
+ * - block8 = ntz{3} = offset ^ L[2] ^ L[3]
+ * => zmm21
+ *
+ * - block9 = ntz{0} = offset ^ L[0] ^ L[2] ^ L[3]
+ * - block10 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[3]
+ * - block11 = ntz{0} = offset ^ L[1] ^ L[2] ^ L[3]
+ * - block12 = ntz{2} = offset ^ L[1] ^ L[3]
+ * => zmm22
+ *
+ * - block13 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[3]
+ * - block14 = ntz{1} = offset ^ L[0] ^ L[3]
+ * - block15 = ntz{0} = offset ^ L[3]
+ * - block16 = ntz{4} = offset ^ L[3] ^ L[4]
+ * => zmm23
+ */
+ vmovdqu (0 * 16)(%r14), %xmm0; /* L[0] */
+ vmovdqu (1 * 16)(%r14), %xmm1; /* L[1] */
+ vmovdqu (2 * 16)(%r14), %xmm2; /* L[2] */
+ vmovdqu (3 * 16)(%r14), %xmm3; /* L[3] */
+ vmovdqu32 (4 * 16)(%r14), %xmm16; /* L[4] */
+ vpxor %xmm0, %xmm1, %xmm4; /* L[0] ^ L[1] */
+ vpxor %xmm0, %xmm2, %xmm5; /* L[0] ^ L[2] */
+ vpxor %xmm0, %xmm3, %xmm6; /* L[0] ^ L[3] */
+ vpxor %xmm1, %xmm2, %xmm7; /* L[1] ^ L[2] */
+ vpxor %xmm1, %xmm3, %xmm8; /* L[1] ^ L[3] */
+ vpxor %xmm2, %xmm3, %xmm9; /* L[2] ^ L[3] */
+ vpxord %xmm16, %xmm3, %xmm17; /* L[3] ^ L[4] */
+ vpxor %xmm4, %xmm2, %xmm10; /* L[0] ^ L[1] ^ L[2] */
+ vpxor %xmm5, %xmm3, %xmm11; /* L[0] ^ L[2] ^ L[3] */
+ vpxor %xmm7, %xmm3, %xmm12; /* L[1] ^ L[2] ^ L[3] */
+ vpxor %xmm0, %xmm8, %xmm13; /* L[0] ^ L[1] ^ L[3] */
+ vpxor %xmm4, %xmm9, %xmm14; /* L[0] ^ L[1] ^ L[2] ^ L[3] */
+ vinserti128 $1, %xmm4, %ymm0, %ymm0;
+ vinserti128 $1, %xmm7, %ymm1, %ymm1;
+ vinserti32x8 $1, %ymm1, %zmm0, %zmm20;
+ vinserti128 $1, %xmm5, %ymm10, %ymm10;
+ vinserti128 $1, %xmm9, %ymm2, %ymm2;
+ vinserti32x8 $1, %ymm2, %zmm10, %zmm21;
+ vinserti128 $1, %xmm14, %ymm11, %ymm11;
+ vinserti128 $1, %xmm8, %ymm12, %ymm12;
+ vinserti32x8 $1, %ymm12, %zmm11, %zmm22;
+ vinserti128 $1, %xmm6, %ymm13, %ymm13;
+ vinserti32x4 $1, %xmm17, %ymm3, %ymm23;
+ vinserti32x8 $1, %ymm23, %zmm13, %zmm23;
+
+ /*
+ * - block17 = ntz{0} = offset ^ L[0] ^ L[3] ^ L[4]
+ * - block18 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[3] ^ L[4]
+ * - block19 = ntz{0} = offset ^ L[1] ^ L[3] ^ L[4]
+ * - block20 = ntz{2} = offset ^ L[1] ^ L[2] ^ L[3] ^ L[4]
+ * => zmm24
+ *
+ * - block21 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[3] ^ L[4]
+ * - block22 = ntz{1} = offset ^ L[0] ^ L[2] ^ L[3] ^ L[4]
+ * - block23 = ntz{0} = offset ^ L[2] ^ L[3] ^ L[4]
+ * - block24 = ntz{3} = offset ^ L[2] ^ L[4]
+ * => zmm25
+ *
+ * - block25 = ntz{0} = offset ^ L[0] ^ L[2] ^ L[4]
+ * - block26 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[4]
+ * - block27 = ntz{0} = offset ^ L[1] ^ L[2] ^ L[4]
+ * - block28 = ntz{2} = offset ^ L[1] ^ L[4]
+ * => zmm26
+ *
+ * - block29 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[4]
+ * - block30 = ntz{1} = offset ^ L[0] ^ L[4]
+ * - block31 = ntz{0} = offset ^ L[4]
+ * - block32 = 0 (later filled with ntz{x} = offset ^ L[4] ^ L[ntz{x}])
+ * => zmm16
+ */
+ vpxord %xmm16, %xmm0, %xmm0; /* L[0] ^ L[4] */
+ vpxord %xmm16, %xmm1, %xmm1; /* L[1] ^ L[4] */
+ vpxord %xmm16, %xmm2, %xmm2; /* L[2] ^ L[4] */
+ vpxord %xmm16, %xmm4, %xmm4; /* L[0] ^ L[1] ^ L[4] */
+ vpxord %xmm16, %xmm5, %xmm5; /* L[0] ^ L[2] ^ L[4] */
+ vpxord %xmm16, %xmm6, %xmm6; /* L[0] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm7, %xmm7; /* L[1] ^ L[2] ^ L[4] */
+ vpxord %xmm16, %xmm8, %xmm8; /* L[1] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm9, %xmm9; /* L[2] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm10, %xmm10; /* L[0] ^ L[1] ^ L[2] ^ L[4] */
+ vpxord %xmm16, %xmm11, %xmm11; /* L[0] ^ L[2] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm12, %xmm12; /* L[1] ^ L[2] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm13, %xmm13; /* L[0] ^ L[1] ^ L[3] ^ L[4] */
+ vpxord %xmm16, %xmm14, %xmm14; /* L[0] ^ L[1] ^ L[2] ^ L[3] ^ L[4] */
+ vinserti128 $1, %xmm13, %ymm6, %ymm6;
+ vinserti32x4 $1, %xmm12, %ymm8, %ymm24;
+ vinserti32x8 $1, %ymm24, %zmm6, %zmm24;
+ vinserti128 $1, %xmm11, %ymm14, %ymm14;
+ vinserti32x4 $1, %xmm2, %ymm9, %ymm25;
+ vinserti32x8 $1, %ymm25, %zmm14, %zmm25;
+ vinserti128 $1, %xmm10, %ymm5, %ymm5;
+ vinserti32x4 $1, %xmm1, %ymm7, %ymm26;
+ vinserti32x8 $1, %ymm26, %zmm5, %zmm26;
+ vinserti128 $1, %xmm0, %ymm4, %ymm4;
+ vinserti32x8 $1, %ymm16, %zmm4, %zmm16;
+
+ /* Aligned: Process 32 blocks per loop. */
+.align 16
+.Locb_aligned_blk32:
+ cmpq $32, %r8;
+ jb .Locb_aligned_blk16;
+
+ leaq -32(%r8), %r8;
+
+ leal 32(%esi), %esi;
+ tzcntl %esi, %eax;
+ shll $4, %eax;
+
+ vpxord %zmm20, %zmm15, %zmm8;
+ vpxord %zmm21, %zmm15, %zmm9;
+ vpxord %zmm22, %zmm15, %zmm10;
+ vpxord %zmm23, %zmm15, %zmm11;
+ vpxord %zmm24, %zmm15, %zmm12;
+ vpxord %zmm25, %zmm15, %zmm27;
+ vpxord %zmm26, %zmm15, %zmm28;
+
+ vmovdqa (4 * 16)(%r14), %xmm14;
+ vpxor (%r14, %rax), %xmm14, %xmm14; /* L[4] ^ L[ntz{nblk+16}] */
+ vinserti32x4 $3, %xmm14, %zmm16, %zmm14;
+
+ vpxord %zmm14, %zmm15, %zmm14;
+
+ cmpl $1, %ebx;
+ jb .Locb_aligned_blk32_dec;
+ ja .Locb_aligned_blk32_auth;
+ vmovdqu32 (0 * 16)(%rcx), %zmm17;
+ vpxord %zmm17, %zmm8, %zmm0;
+ vmovdqu32 (4 * 16)(%rcx), %zmm18;
+ vpxord %zmm18, %zmm9, %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm19;
+ vpxord %zmm19, %zmm10, %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm31;
+ vpxord %zmm31, %zmm11, %zmm3;
+
+ vpternlogd $0x96, %zmm17, %zmm18, %zmm19;
+
+ vmovdqu32 (16 * 16)(%rcx), %zmm17;
+ vpxord %zmm17, %zmm12, %zmm4;
+ vmovdqu32 (20 * 16)(%rcx), %zmm18;
+ vpxord %zmm18, %zmm27, %zmm5;
+
+ vpternlogd $0x96, %zmm31, %zmm17, %zmm18;
+
+ vmovdqu32 (24 * 16)(%rcx), %zmm31;
+ vpxord %zmm31, %zmm28, %zmm6;
+ vmovdqu32 (28 * 16)(%rcx), %zmm17;
+ vpxord %zmm17, %zmm14, %zmm7;
+ leaq (32 * 16)(%rcx), %rcx;
+
+ vpternlogd $0x96, %zmm31, %zmm17, %zmm19;
+ vpternlogd $0x96, %zmm18, %zmm19, %zmm29;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vpxord %zmm8, %zmm30, %zmm8;
+ vpxord %zmm9, %zmm30, %zmm9;
+ vpxord %zmm10, %zmm30, %zmm10;
+ vpxord %zmm11, %zmm30, %zmm11;
+ vpxord %zmm12, %zmm30, %zmm12;
+ vpxord %zmm27, %zmm30, %zmm27;
+ vpxord %zmm28, %zmm30, %zmm28;
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+ vpxord %zmm14, %zmm30, %zmm14;
+
+ /* AES rounds */
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk32_enc_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Locb_aligned_blk32_enc_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk32_enc_last:
+ vaesenclast %zmm8, %zmm0, %zmm0;
+ vaesenclast %zmm9, %zmm1, %zmm1;
+ vaesenclast %zmm10, %zmm2, %zmm2;
+ vaesenclast %zmm11, %zmm3, %zmm3;
+ vaesenclast %zmm12, %zmm4, %zmm4;
+ vaesenclast %zmm27, %zmm5, %zmm5;
+ vaesenclast %zmm28, %zmm6, %zmm6;
+ vaesenclast %zmm14, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ jmp .Locb_aligned_blk32;
+
+ .align 16
+ .Locb_aligned_blk32_auth:
+ vpxord (0 * 16)(%rcx), %zmm8, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm9, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm10, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm11, %zmm3;
+ vpxord (16 * 16)(%rcx), %zmm12, %zmm4;
+ vpxord (20 * 16)(%rcx), %zmm27, %zmm5;
+ vpxord (24 * 16)(%rcx), %zmm28, %zmm6;
+ vpxord (28 * 16)(%rcx), %zmm14, %zmm7;
+ leaq (32 * 16)(%rcx), %rcx;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+
+ /* AES rounds */
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk32_auth_last;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ jz .Locb_aligned_blk32_auth_last;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (14 * 16)(%rdi), %zmm13;
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk32_auth_last:
+ vaesenclast %zmm13, %zmm0, %zmm0;
+ vaesenclast %zmm13, %zmm1, %zmm1;
+ vaesenclast %zmm13, %zmm2, %zmm2;
+ vaesenclast %zmm13, %zmm3, %zmm3;
+ vaesenclast %zmm13, %zmm4, %zmm4;
+ vaesenclast %zmm13, %zmm5, %zmm5;
+ vaesenclast %zmm13, %zmm6, %zmm6;
+ vaesenclast %zmm13, %zmm7, %zmm7;
+
+ vpternlogd $0x96, %zmm0, %zmm1, %zmm2;
+ vpternlogd $0x96, %zmm3, %zmm4, %zmm5;
+ vpternlogd $0x96, %zmm6, %zmm7, %zmm29;
+ vpternlogd $0x96, %zmm2, %zmm5, %zmm29;
+
+ jmp .Locb_aligned_blk32;
+
+ .align 16
+ .Locb_aligned_blk32_dec:
+ vpxord (0 * 16)(%rcx), %zmm8, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm9, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm10, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm11, %zmm3;
+ vpxord (16 * 16)(%rcx), %zmm12, %zmm4;
+ vpxord (20 * 16)(%rcx), %zmm27, %zmm5;
+ vpxord (24 * 16)(%rcx), %zmm28, %zmm6;
+ vpxord (28 * 16)(%rcx), %zmm14, %zmm7;
+ leaq (32 * 16)(%rcx), %rcx;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vpxord %zmm8, %zmm30, %zmm8;
+ vpxord %zmm9, %zmm30, %zmm9;
+ vpxord %zmm10, %zmm30, %zmm10;
+ vpxord %zmm11, %zmm30, %zmm11;
+ vpxord %zmm12, %zmm30, %zmm12;
+ vpxord %zmm27, %zmm30, %zmm27;
+ vpxord %zmm28, %zmm30, %zmm28;
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+ vpxord %zmm14, %zmm30, %zmm14;
+
+ /* AES rounds */
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk32_dec_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Locb_aligned_blk32_dec_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk32_dec_last:
+ vaesdeclast %zmm8, %zmm0, %zmm0;
+ vaesdeclast %zmm9, %zmm1, %zmm1;
+ vaesdeclast %zmm10, %zmm2, %zmm2;
+ vaesdeclast %zmm11, %zmm3, %zmm3;
+ vaesdeclast %zmm12, %zmm4, %zmm4;
+ vaesdeclast %zmm27, %zmm5, %zmm5;
+ vaesdeclast %zmm28, %zmm6, %zmm6;
+ vaesdeclast %zmm14, %zmm7, %zmm7;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ vpternlogd $0x96, %zmm0, %zmm1, %zmm2;
+ vpternlogd $0x96, %zmm3, %zmm4, %zmm5;
+ vpternlogd $0x96, %zmm6, %zmm7, %zmm29;
+ vpternlogd $0x96, %zmm2, %zmm5, %zmm29;
+
+ jmp .Locb_aligned_blk32;
+
+ /* Aligned: Process trailing 16 blocks. */
+.align 16
+.Locb_aligned_blk16:
+ cmpq $16, %r8;
+ jb .Locb_aligned_done;
+
+ leaq -16(%r8), %r8;
+
+ leal 16(%esi), %esi;
+
+ vpxord %zmm20, %zmm15, %zmm8;
+ vpxord %zmm21, %zmm15, %zmm9;
+ vpxord %zmm22, %zmm15, %zmm10;
+ vpxord %zmm23, %zmm15, %zmm14;
+
+ cmpl $1, %ebx;
+ jb .Locb_aligned_blk16_dec;
+ ja .Locb_aligned_blk16_auth;
+ vmovdqu32 (0 * 16)(%rcx), %zmm17;
+ vpxord %zmm17, %zmm8, %zmm0;
+ vmovdqu32 (4 * 16)(%rcx), %zmm18;
+ vpxord %zmm18, %zmm9, %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm19;
+ vpxord %zmm19, %zmm10, %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm31;
+ vpxord %zmm31, %zmm14, %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ vpternlogd $0x96, %zmm17, %zmm18, %zmm19;
+ vpternlogd $0x96, %zmm31, %zmm19, %zmm29;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vpxord %zmm8, %zmm30, %zmm8;
+ vpxord %zmm9, %zmm30, %zmm9;
+ vpxord %zmm10, %zmm30, %zmm10;
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+ vpxord %zmm14, %zmm30, %zmm14;
+
+ /* AES rounds */
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk16_enc_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Locb_aligned_blk16_enc_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk16_enc_last:
+ vaesenclast %zmm8, %zmm0, %zmm0;
+ vaesenclast %zmm9, %zmm1, %zmm1;
+ vaesenclast %zmm10, %zmm2, %zmm2;
+ vaesenclast %zmm14, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ jmp .Locb_aligned_done;
+
+ .align 16
+ .Locb_aligned_blk16_auth:
+ vpxord (0 * 16)(%rcx), %zmm8, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm9, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm10, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm14, %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+
+ /* AES rounds */
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk16_auth_last;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ jz .Locb_aligned_blk16_auth_last;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (14 * 16)(%rdi), %zmm13;
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk16_auth_last:
+ vaesenclast %zmm13, %zmm0, %zmm0;
+ vaesenclast %zmm13, %zmm1, %zmm1;
+ vaesenclast %zmm13, %zmm2, %zmm2;
+ vaesenclast %zmm13, %zmm3, %zmm3;
+
+ vpternlogd $0x96, %zmm0, %zmm1, %zmm2;
+ vpternlogd $0x96, %zmm3, %zmm2, %zmm29;
+
+ jmp .Locb_aligned_done;
+
+ .align 16
+ .Locb_aligned_blk16_dec:
+ vpxord (0 * 16)(%rcx), %zmm8, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm9, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm10, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm14, %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm13;
+
+ vpxord %zmm8, %zmm30, %zmm8;
+ vpxord %zmm9, %zmm30, %zmm9;
+ vpxord %zmm10, %zmm30, %zmm10;
+ vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15;
+ vpxord %zmm14, %zmm30, %zmm14;
+
+ /* AES rounds */
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Locb_aligned_blk16_dec_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Locb_aligned_blk16_dec_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm13;
+ VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Locb_aligned_blk16_dec_last:
+ vaesdeclast %zmm8, %zmm0, %zmm0;
+ vaesdeclast %zmm9, %zmm1, %zmm1;
+ vaesdeclast %zmm10, %zmm2, %zmm2;
+ vaesdeclast %zmm14, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ vpternlogd $0x96, %zmm0, %zmm1, %zmm2;
+ vpternlogd $0x96, %zmm3, %zmm2, %zmm29;
+
+.align 16
+.Locb_aligned_done:
+ vpxor (0 * 16)(%rdi), %xmm15, %xmm15; /* offset ^ first key ^ first key */
+
+ vextracti32x8 $1, %zmm29, %ymm0;
+ vpxord %ymm29, %ymm0, %ymm0;
+ vextracti128 $1, %ymm0, %xmm1;
+ vpternlogd $0x96, (%r12), %xmm1, %xmm0;
+ vmovdqu %xmm0, (%r12);
+
+ vmovdqu %xmm15, (%r13); /* Store offset. */
+
+ popq %rbx;
+ CFI_POP(%rbx);
+ popq %r14;
+ CFI_POP(%r14);
+ popq %r13;
+ CFI_POP(%r13);
+ popq %r12;
+ CFI_POP(%r12);
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm16, %ymm16, %ymm16;
+ vpxord %ymm17, %ymm17, %ymm17;
+ vpxord %ymm18, %ymm18, %ymm18;
+ vpxord %ymm19, %ymm19, %ymm19;
+ vpxord %ymm20, %ymm20, %ymm20;
+ vpxord %ymm21, %ymm21, %ymm21;
+ vpxord %ymm22, %ymm22, %ymm22;
+ vpxord %ymm23, %ymm23, %ymm23;
+ vzeroall;
+ vpxord %ymm24, %ymm24, %ymm24;
+ vpxord %ymm25, %ymm25, %ymm25;
+ vpxord %ymm26, %ymm26, %ymm26;
+ vpxord %ymm27, %ymm27, %ymm27;
+ vpxord %ymm28, %ymm28, %ymm28;
+ vpxord %ymm29, %ymm29, %ymm29;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+
+.align 16
+.Locb_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_ocb_crypt_amd64;
+
+ xorl %eax, %eax;
+ ret_spec_stop
+
+#undef STACK_REGS_POS
+#undef STACK_ALLOC
+
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_ocb_aligned_crypt_amd64,
+ .-_gcry_vaes_avx512_ocb_aligned_crypt_amd64)
+
+/**********************************************************************
+ XTS-mode encryption
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_xts_crypt_amd64, at function)
+.globl _gcry_vaes_avx512_xts_crypt_amd64
+.align 16
+_gcry_vaes_avx512_xts_crypt_amd64:
+ /* input:
+ * %rdi: round keys
+ * %rsi: tweak
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ * 8(%rsp): encrypt
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lxts_crypt_skip_avx512;
+
+ spec_stop_avx512;
+
+ /* Load first and last key. */
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm30;
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm31;
+
+ movl 8(%rsp), %eax;
+
+ vpmovzxbd .Lxts_gfmul_clmul_bd rRIP, %zmm20;
+ vbroadcasti32x4 .Lxts_high_bit_shuf rRIP, %zmm21;
+
+#define tweak_clmul(shift, out, tweak, hi_tweak, gfmul_clmul, tmp1, tmp2) \
+ vpsrld $(32-(shift)), hi_tweak, tmp2; \
+ vpsllq $(shift), tweak, out; \
+ vpclmulqdq $0, gfmul_clmul, tmp2, tmp1; \
+ vpunpckhqdq tmp2, tmp1, tmp1; \
+ vpxord tmp1, out, out;
+
+ /* Prepare tweak. */
+ vmovdqu (%rsi), %xmm15;
+ vpshufb %xmm21, %xmm15, %xmm13;
+ tweak_clmul(1, %xmm11, %xmm15, %xmm13, %xmm20, %xmm0, %xmm1);
+ vinserti128 $1, %xmm11, %ymm15, %ymm15; /* tweak:tweak1 */
+ vpshufb %ymm21, %ymm15, %ymm13;
+ tweak_clmul(2, %ymm11, %ymm15, %ymm13, %ymm20, %ymm0, %ymm1);
+ vinserti32x8 $1, %ymm11, %zmm15, %zmm15; /* tweak:tweak1:tweak2:tweak3 */
+ vpshufb %zmm21, %zmm15, %zmm13;
+
+ cmpq $16, %r8;
+ jb .Lxts_crypt_done;
+
+ /* Process 16 blocks per loop. */
+ leaq -16(%r8), %r8;
+
+ vmovdqa32 %zmm15, %zmm5;
+ tweak_clmul(4, %zmm6, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1);
+ tweak_clmul(8, %zmm7, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1);
+ tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1);
+ tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1);
+ vpshufb %zmm21, %zmm15, %zmm13;
+
+ vmovdqu32 (0 * 16)(%rcx), %zmm0;
+ vmovdqu32 (4 * 16)(%rcx), %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+ vpternlogd $0x96, %zmm30, %zmm5, %zmm0;
+ vpternlogd $0x96, %zmm30, %zmm6, %zmm1;
+ vpternlogd $0x96, %zmm30, %zmm7, %zmm2;
+ vpternlogd $0x96, %zmm30, %zmm8, %zmm3;
+
+.align 16
+.Lxts_crypt_blk16_loop:
+ cmpq $16, %r8;
+ jb .Lxts_crypt_blk16_tail;
+ leaq -16(%r8), %r8;
+
+ testl %eax, %eax;
+ jz .Lxts_dec_blk16;
+ /* AES rounds */
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vmovdqa32 %zmm15, %zmm9;
+ tweak_clmul(4, %zmm10, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ tweak_clmul(8, %zmm11, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lxts_enc_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lxts_enc_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lxts_enc_blk16_last:
+ vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */
+ vpxord %zmm31, %zmm6, %zmm6;
+ vpxord %zmm31, %zmm7, %zmm7;
+ vpxord %zmm31, %zmm8, %zmm4;
+ tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vaesenclast %zmm5, %zmm0, %zmm16;
+ vaesenclast %zmm6, %zmm1, %zmm17;
+ vaesenclast %zmm7, %zmm2, %zmm18;
+ vaesenclast %zmm4, %zmm3, %zmm19;
+ tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vpshufb %zmm21, %zmm15, %zmm13;
+
+ vmovdqu32 (0 * 16)(%rcx), %zmm0;
+ vmovdqu32 (4 * 16)(%rcx), %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ vmovdqu32 %zmm16, (0 * 16)(%rdx);
+ vmovdqu32 %zmm17, (4 * 16)(%rdx);
+ vmovdqu32 %zmm18, (8 * 16)(%rdx);
+ vmovdqu32 %zmm19, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ vpternlogd $0x96, %zmm30, %zmm9, %zmm0;
+ vpternlogd $0x96, %zmm30, %zmm10, %zmm1;
+ vpternlogd $0x96, %zmm30, %zmm11, %zmm2;
+ vpternlogd $0x96, %zmm30, %zmm8, %zmm3;
+
+ vmovdqa32 %zmm9, %zmm5;
+ vmovdqa32 %zmm10, %zmm6;
+ vmovdqa32 %zmm11, %zmm7;
+
+ jmp .Lxts_crypt_blk16_loop;
+
+ .align 16
+ .Lxts_dec_blk16:
+ /* AES rounds */
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vmovdqa32 %zmm15, %zmm9;
+ tweak_clmul(4, %zmm10, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ tweak_clmul(8, %zmm11, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lxts_dec_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lxts_dec_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lxts_dec_blk16_last:
+ vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */
+ vpxord %zmm31, %zmm6, %zmm6;
+ vpxord %zmm31, %zmm7, %zmm7;
+ vpxord %zmm31, %zmm8, %zmm4;
+ tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vaesdeclast %zmm5, %zmm0, %zmm16;
+ vaesdeclast %zmm6, %zmm1, %zmm17;
+ vaesdeclast %zmm7, %zmm2, %zmm18;
+ vaesdeclast %zmm4, %zmm3, %zmm19;
+ tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14);
+ vpshufb %zmm21, %zmm15, %zmm13;
+
+ vmovdqu32 (0 * 16)(%rcx), %zmm0;
+ vmovdqu32 (4 * 16)(%rcx), %zmm1;
+ vmovdqu32 (8 * 16)(%rcx), %zmm2;
+ vmovdqu32 (12 * 16)(%rcx), %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+
+ vmovdqu32 %zmm16, (0 * 16)(%rdx);
+ vmovdqu32 %zmm17, (4 * 16)(%rdx);
+ vmovdqu32 %zmm18, (8 * 16)(%rdx);
+ vmovdqu32 %zmm19, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ vpternlogd $0x96, %zmm30, %zmm9, %zmm0;
+ vpternlogd $0x96, %zmm30, %zmm10, %zmm1;
+ vpternlogd $0x96, %zmm30, %zmm11, %zmm2;
+ vpternlogd $0x96, %zmm30, %zmm8, %zmm3;
+
+ vmovdqa32 %zmm9, %zmm5;
+ vmovdqa32 %zmm10, %zmm6;
+ vmovdqa32 %zmm11, %zmm7;
+
+ jmp .Lxts_crypt_blk16_loop;
+
+ .align 16
+ .Lxts_crypt_blk16_tail:
+ testl %eax, %eax;
+ jz .Lxts_dec_tail_blk16;
+ /* AES rounds */
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lxts_enc_blk16_tail_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lxts_enc_blk16_tail_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lxts_enc_blk16_tail_last:
+ vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */
+ vpxord %zmm31, %zmm6, %zmm6;
+ vpxord %zmm31, %zmm7, %zmm7;
+ vpxord %zmm31, %zmm8, %zmm4;
+ vaesenclast %zmm5, %zmm0, %zmm0;
+ vaesenclast %zmm6, %zmm1, %zmm1;
+ vaesenclast %zmm7, %zmm2, %zmm2;
+ vaesenclast %zmm4, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+ jmp .Lxts_crypt_done;
+
+ .align 16
+ .Lxts_dec_tail_blk16:
+ /* AES rounds */
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lxts_dec_blk16_tail_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lxts_dec_blk16_tail_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+
+ /* Last round and output handling. */
+ .align 16
+ .Lxts_dec_blk16_tail_last:
+ vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */
+ vpxord %zmm31, %zmm6, %zmm6;
+ vpxord %zmm31, %zmm7, %zmm7;
+ vpxord %zmm31, %zmm8, %zmm4;
+ vaesdeclast %zmm5, %zmm0, %zmm0;
+ vaesdeclast %zmm6, %zmm1, %zmm1;
+ vaesdeclast %zmm7, %zmm2, %zmm2;
+ vaesdeclast %zmm4, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+.align 16
+.Lxts_crypt_done:
+ /* Store IV. */
+ vmovdqu %xmm15, (%rsi);
+
+ /* Clear used AVX512 registers. */
+ vpxord %ymm16, %ymm16, %ymm16;
+ vpxord %ymm17, %ymm17, %ymm17;
+ vpxord %ymm18, %ymm18, %ymm18;
+ vpxord %ymm19, %ymm19, %ymm19;
+ vpxord %ymm20, %ymm20, %ymm20;
+ vpxord %ymm21, %ymm21, %ymm21;
+ vpxord %ymm30, %ymm30, %ymm30;
+ vpxord %ymm31, %ymm31, %ymm31;
+ vzeroall;
+
+.align 16
+.Lxts_crypt_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_xts_crypt_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_xts_crypt_amd64,.-_gcry_vaes_avx512_xts_crypt_amd64)
+
+/**********************************************************************
+ ECB-mode encryption
+ **********************************************************************/
+ELF(.type _gcry_vaes_avx512_ecb_crypt_amd64, at function)
+.globl _gcry_vaes_avx512_ecb_crypt_amd64
+.align 16
+_gcry_vaes_avx512_ecb_crypt_amd64:
+ /* input:
+ * %rdi: round keys
+ * %esi: encrypt
+ * %rdx: dst
+ * %rcx: src
+ * %r8: nblocks
+ * %r9: nrounds
+ */
+ CFI_STARTPROC();
+
+ cmpq $16, %r8;
+ jb .Lecb_crypt_skip_avx512;
+
+ spec_stop_avx512;
+
+ leal (, %r9d, 4), %eax;
+ vbroadcasti32x4 (%rdi), %zmm14; /* first key */
+ vbroadcasti32x4 (%rdi, %rax, 4), %zmm15; /* last key */
+
+ /* Process 32 blocks per loop. */
+.align 16
+.Lecb_blk32:
+ cmpq $32, %r8;
+ jb .Lecb_blk16;
+
+ leaq -32(%r8), %r8;
+
+ /* Load input and xor first key. */
+ vpxord (0 * 16)(%rcx), %zmm14, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm14, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm14, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm14, %zmm3;
+ vpxord (16 * 16)(%rcx), %zmm14, %zmm4;
+ vpxord (20 * 16)(%rcx), %zmm14, %zmm5;
+ vpxord (24 * 16)(%rcx), %zmm14, %zmm6;
+ vpxord (28 * 16)(%rcx), %zmm14, %zmm7;
+ leaq (32 * 16)(%rcx), %rcx;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm8;
+
+ testl %esi, %esi;
+ jz .Lecb_dec_blk32;
+ /* AES rounds */
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lecb_enc_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lecb_enc_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ .align 16
+ .Lecb_enc_blk32_last:
+ vaesenclast %zmm15, %zmm0, %zmm0;
+ vaesenclast %zmm15, %zmm1, %zmm1;
+ vaesenclast %zmm15, %zmm2, %zmm2;
+ vaesenclast %zmm15, %zmm3, %zmm3;
+ vaesenclast %zmm15, %zmm4, %zmm4;
+ vaesenclast %zmm15, %zmm5, %zmm5;
+ vaesenclast %zmm15, %zmm6, %zmm6;
+ vaesenclast %zmm15, %zmm7, %zmm7;
+ jmp .Lecb_blk32_end;
+
+ .align 16
+ .Lecb_dec_blk32:
+ /* AES rounds */
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ cmpl $12, %r9d;
+ jb .Lecb_dec_blk32_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ jz .Lecb_dec_blk32_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm8;
+ VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7);
+ .align 16
+ .Lecb_dec_blk32_last:
+ vaesdeclast %zmm15, %zmm0, %zmm0;
+ vaesdeclast %zmm15, %zmm1, %zmm1;
+ vaesdeclast %zmm15, %zmm2, %zmm2;
+ vaesdeclast %zmm15, %zmm3, %zmm3;
+ vaesdeclast %zmm15, %zmm4, %zmm4;
+ vaesdeclast %zmm15, %zmm5, %zmm5;
+ vaesdeclast %zmm15, %zmm6, %zmm6;
+ vaesdeclast %zmm15, %zmm7, %zmm7;
+
+ .align 16
+ .Lecb_blk32_end:
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ vmovdqu32 %zmm4, (16 * 16)(%rdx);
+ vmovdqu32 %zmm5, (20 * 16)(%rdx);
+ vmovdqu32 %zmm6, (24 * 16)(%rdx);
+ vmovdqu32 %zmm7, (28 * 16)(%rdx);
+ leaq (32 * 16)(%rdx), %rdx;
+
+ jmp .Lecb_blk32;
+
+ /* Handle trailing 16 blocks. */
+.align 16
+.Lecb_blk16:
+ cmpq $16, %r8;
+ jmp .Lecb_crypt_tail;
+
+ leaq -16(%r8), %r8;
+
+ /* Load input and xor first key. */
+ vpxord (0 * 16)(%rcx), %zmm14, %zmm0;
+ vpxord (4 * 16)(%rcx), %zmm14, %zmm1;
+ vpxord (8 * 16)(%rcx), %zmm14, %zmm2;
+ vpxord (12 * 16)(%rcx), %zmm14, %zmm3;
+ leaq (16 * 16)(%rcx), %rcx;
+ vbroadcasti32x4 (1 * 16)(%rdi), %zmm4;
+
+ testl %esi, %esi;
+ jz .Lecb_dec_blk16;
+ /* AES rounds */
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lecb_enc_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lecb_enc_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ .align 16
+ .Lecb_enc_blk16_last:
+ vaesenclast %zmm15, %zmm0, %zmm0;
+ vaesenclast %zmm15, %zmm1, %zmm1;
+ vaesenclast %zmm15, %zmm2, %zmm2;
+ vaesenclast %zmm15, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+ jmp .Lecb_crypt_tail;
+
+ .align 16
+ .Lecb_dec_blk16:
+ /* AES rounds */
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (2 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (3 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (4 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (5 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (6 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (7 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (8 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (9 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ cmpl $12, %r9d;
+ jb .Lecb_dec_blk16_last;
+ vbroadcasti32x4 (10 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (11 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ jz .Lecb_dec_blk16_last;
+ vbroadcasti32x4 (12 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ vbroadcasti32x4 (13 * 16)(%rdi), %zmm4;
+ VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3);
+ .align 16
+ .Lecb_dec_blk16_last:
+ vaesdeclast %zmm15, %zmm0, %zmm0;
+ vaesdeclast %zmm15, %zmm1, %zmm1;
+ vaesdeclast %zmm15, %zmm2, %zmm2;
+ vaesdeclast %zmm15, %zmm3, %zmm3;
+ vmovdqu32 %zmm0, (0 * 16)(%rdx);
+ vmovdqu32 %zmm1, (4 * 16)(%rdx);
+ vmovdqu32 %zmm2, (8 * 16)(%rdx);
+ vmovdqu32 %zmm3, (12 * 16)(%rdx);
+ leaq (16 * 16)(%rdx), %rdx;
+
+.align 16
+.Lecb_crypt_tail:
+ /* Clear used AVX512 registers. */
+ vzeroall;
+
+.align 16
+.Lecb_crypt_skip_avx512:
+ /* Handle trailing blocks with AVX2 implementation. */
+ cmpq $0, %r8;
+ ja _gcry_vaes_avx2_ecb_crypt_amd64;
+
+ ret_spec_stop
+ CFI_ENDPROC();
+ELF(.size _gcry_vaes_avx512_ecb_crypt_amd64,.-_gcry_vaes_avx512_ecb_crypt_amd64)
+
+/**********************************************************************
+ constants
+ **********************************************************************/
+SECTION_RODATA
+
+.align 64
+ELF(.type _gcry_vaes_avx512_consts, at object)
+_gcry_vaes_avx512_consts:
+
+.Lbige_addb_0:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+.Lbige_addb_1:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
+.Lbige_addb_2:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2
+.Lbige_addb_3:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3
+.Lbige_addb_4:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4
+.Lbige_addb_5:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5
+.Lbige_addb_6:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6
+.Lbige_addb_7:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7
+.Lbige_addb_8:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8
+.Lbige_addb_9:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9
+.Lbige_addb_10:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10
+.Lbige_addb_11:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11
+.Lbige_addb_12:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12
+.Lbige_addb_13:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13
+.Lbige_addb_14:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14
+.Lbige_addb_15:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15
+.Lbige_addb_16:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16
+.Lbige_addb_17:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17
+.Lbige_addb_18:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18
+.Lbige_addb_19:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 19
+.Lbige_addb_20:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20
+.Lbige_addb_21:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21
+.Lbige_addb_22:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22
+.Lbige_addb_23:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23
+.Lbige_addb_24:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24
+.Lbige_addb_25:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25
+.Lbige_addb_26:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 26
+.Lbige_addb_27:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27
+.Lbige_addb_28:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 28
+.Lbige_addb_29:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29
+.Lbige_addb_30:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30
+.Lbige_addb_31:
+ .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31
+
+.Lbswap128_mask:
+ .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+
+.Lxts_high_bit_shuf:
+ .byte -1, -1, -1, -1, 12, 13, 14, 15, 4, 5, 6, 7, -1, -1, -1, -1
+.Lxts_gfmul_clmul_bd:
+ .byte 0x00, 0x87, 0x00, 0x00
+ .byte 0x00, 0x87, 0x00, 0x00
+ .byte 0x00, 0x87, 0x00, 0x00
+ .byte 0x00, 0x87, 0x00, 0x00
+
+.Lcounter0_1_2_3_lo_bq:
+ .byte 0, 0, 1, 0, 2, 0, 3, 0
+.Lcounter4_5_6_7_lo_bq:
+ .byte 4, 0, 5, 0, 6, 0, 7, 0
+.Lcounter8_9_10_11_lo_bq:
+ .byte 8, 0, 9, 0, 10, 0, 11, 0
+.Lcounter12_13_14_15_lo_bq:
+ .byte 12, 0, 13, 0, 14, 0, 15, 0
+.Lcounter16_17_18_19_lo_bq:
+ .byte 16, 0, 17, 0, 18, 0, 19, 0
+.Lcounter20_21_22_23_lo_bq:
+ .byte 20, 0, 21, 0, 22, 0, 23, 0
+.Lcounter24_25_26_27_lo_bq:
+ .byte 24, 0, 25, 0, 26, 0, 27, 0
+.Lcounter28_29_30_31_lo_bq:
+ .byte 28, 0, 29, 0, 30, 0, 31, 0
+.Lcounter4_4_4_4_lo_bq:
+ .byte 4, 0, 4, 0, 4, 0, 4, 0
+.Lcounter8_8_8_8_lo_bq:
+ .byte 8, 0, 8, 0, 8, 0, 8, 0
+.Lcounter16_16_16_16_lo_bq:
+ .byte 16, 0, 16, 0, 16, 0, 16, 0
+.Lcounter32_32_32_32_lo_bq:
+ .byte 32, 0, 32, 0, 32, 0, 32, 0
+.Lcounter1_1_1_1_hi_bq:
+ .byte 0, 1, 0, 1, 0, 1, 0, 1
+
+ELF(.size _gcry_vaes_avx512_consts,.-_gcry_vaes_avx512_consts)
+
+#endif /* HAVE_GCC_INLINE_ASM_VAES */
+#endif /* __x86_64__ */
diff --git a/cipher/rijndael-vaes.c b/cipher/rijndael-vaes.c
index 478904d0..81650e77 100644
--- a/cipher/rijndael-vaes.c
+++ b/cipher/rijndael-vaes.c
@@ -1,5 +1,5 @@
/* VAES/AVX2 AMD64 accelerated AES for Libgcrypt
- * Copyright (C) 2021 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ * Copyright (C) 2021,2026 Jussi Kivilinna <jussi.kivilinna at iki.fi>
*
* This file is part of Libgcrypt.
*
@@ -99,6 +99,66 @@ extern void _gcry_vaes_avx2_ecb_crypt_amd64 (const void *keysched,
unsigned int nrounds) ASM_FUNC_ABI;
+#ifdef USE_VAES_AVX512
+extern void _gcry_vaes_avx512_cbc_dec_amd64 (const void *keysched,
+ unsigned char *iv,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds) ASM_FUNC_ABI;
+
+extern void _gcry_vaes_avx512_cfb_dec_amd64 (const void *keysched,
+ unsigned char *iv,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds) ASM_FUNC_ABI;
+
+extern void _gcry_vaes_avx512_ctr_enc_amd64 (const void *keysched,
+ unsigned char *ctr,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds) ASM_FUNC_ABI;
+
+extern void _gcry_vaes_avx512_ctr32le_enc_amd64 (const void *keysched,
+ unsigned char *ctr,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds)
+ ASM_FUNC_ABI;
+
+extern size_t
+_gcry_vaes_avx512_ocb_aligned_crypt_amd64 (const void *keysched,
+ unsigned int blkn,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds,
+ unsigned char *offset,
+ unsigned char *checksum,
+ unsigned char *L_table,
+ int enc_dec_auth) ASM_FUNC_ABI;
+
+extern void _gcry_vaes_avx512_xts_crypt_amd64 (const void *keysched,
+ unsigned char *tweak,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds,
+ int encrypt) ASM_FUNC_ABI;
+
+extern void _gcry_vaes_avx512_ecb_crypt_amd64 (const void *keysched,
+ int encrypt,
+ void *outbuf_arg,
+ const void *inbuf_arg,
+ size_t nblocks,
+ unsigned int nrounds)
+ ASM_FUNC_ABI;
+#endif
+
+
void
_gcry_aes_vaes_ecb_crypt (void *context, void *outbuf,
const void *inbuf, size_t nblocks,
@@ -114,6 +174,15 @@ _gcry_aes_vaes_ecb_crypt (void *context, void *outbuf,
ctx->decryption_prepared = 1;
}
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_ecb_crypt_amd64 (keysched, encrypt, outbuf, inbuf,
+ nblocks, nrounds);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_ecb_crypt_amd64 (keysched, encrypt, outbuf, inbuf,
nblocks, nrounds);
}
@@ -133,6 +202,15 @@ _gcry_aes_vaes_cbc_dec (void *context, unsigned char *iv,
ctx->decryption_prepared = 1;
}
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_cbc_dec_amd64 (keysched, iv, outbuf, inbuf,
+ nblocks, nrounds);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_cbc_dec_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
}
@@ -145,6 +223,15 @@ _gcry_aes_vaes_cfb_dec (void *context, unsigned char *iv,
const void *keysched = ctx->keyschenc32;
unsigned int nrounds = ctx->rounds;
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_cfb_dec_amd64 (keysched, iv, outbuf, inbuf,
+ nblocks, nrounds);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_cfb_dec_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
}
@@ -157,6 +244,15 @@ _gcry_aes_vaes_ctr_enc (void *context, unsigned char *iv,
const void *keysched = ctx->keyschenc32;
unsigned int nrounds = ctx->rounds;
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_ctr_enc_amd64 (keysched, iv, outbuf, inbuf,
+ nblocks, nrounds);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_ctr_enc_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds);
}
@@ -169,6 +265,15 @@ _gcry_aes_vaes_ctr32le_enc (void *context, unsigned char *iv,
const void *keysched = ctx->keyschenc32;
unsigned int nrounds = ctx->rounds;
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_ctr32le_enc_amd64 (keysched, iv, outbuf, inbuf,
+ nblocks, nrounds);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_ctr32le_enc_amd64 (keysched, iv, outbuf, inbuf, nblocks,
nrounds);
}
@@ -191,6 +296,40 @@ _gcry_aes_vaes_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
ctx->decryption_prepared = 1;
}
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512 && nblocks >= 32)
+ {
+ /* Get number of blocks to align nblk to 32 for L-array optimization. */
+ unsigned int num_to_align = (-blkn) & 31;
+ if (nblocks - num_to_align >= 32)
+ {
+ if (num_to_align)
+ {
+ _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn,
+ outbuf, inbuf, num_to_align,
+ nrounds, c->u_iv.iv,
+ c->u_ctr.ctr, c->u_mode.ocb.L[0],
+ encrypt);
+ blkn += num_to_align;
+ outbuf += num_to_align * BLOCKSIZE;
+ inbuf += num_to_align * BLOCKSIZE;
+ nblocks -= num_to_align;
+ }
+
+ c->u_mode.ocb.data_nblocks = blkn + nblocks;
+
+ return _gcry_vaes_avx512_ocb_aligned_crypt_amd64 (keysched,
+ (unsigned int)blkn,
+ outbuf, inbuf,
+ nblocks,
+ nrounds, c->u_iv.iv,
+ c->u_ctr.ctr,
+ c->u_mode.ocb.L[0],
+ encrypt);
+ }
+ }
+#endif
+
c->u_mode.ocb.data_nblocks = blkn + nblocks;
return _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, outbuf,
@@ -209,6 +348,36 @@ _gcry_aes_vaes_ocb_auth (gcry_cipher_hd_t c, const void *inbuf_arg,
unsigned int nrounds = ctx->rounds;
u64 blkn = c->u_mode.ocb.aad_nblocks;
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512 && nblocks >= 32)
+ {
+ /* Get number of blocks to align nblk to 32 for L-array optimization. */
+ unsigned int num_to_align = (-blkn) & 31;
+ if (nblocks - num_to_align >= 32)
+ {
+ if (num_to_align)
+ {
+ _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn,
+ NULL, inbuf, num_to_align,
+ nrounds,
+ c->u_mode.ocb.aad_offset,
+ c->u_mode.ocb.aad_sum,
+ c->u_mode.ocb.L[0], 2);
+ blkn += num_to_align;
+ inbuf += num_to_align * BLOCKSIZE;
+ nblocks -= num_to_align;
+ }
+
+ c->u_mode.ocb.aad_nblocks = blkn + nblocks;
+
+ return _gcry_vaes_avx512_ocb_aligned_crypt_amd64 (
+ keysched, (unsigned int)blkn, NULL, inbuf,
+ nblocks, nrounds, c->u_mode.ocb.aad_offset,
+ c->u_mode.ocb.aad_sum, c->u_mode.ocb.L[0], 2);
+ }
+ }
+#endif
+
c->u_mode.ocb.aad_nblocks = blkn + nblocks;
return _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, NULL,
@@ -233,6 +402,15 @@ _gcry_aes_vaes_xts_crypt (void *context, unsigned char *tweak,
ctx->decryption_prepared = 1;
}
+#ifdef USE_VAES_AVX512
+ if (ctx->use_vaes_avx512)
+ {
+ _gcry_vaes_avx512_xts_crypt_amd64 (keysched, tweak, outbuf, inbuf,
+ nblocks, nrounds, encrypt);
+ return;
+ }
+#endif
+
_gcry_vaes_avx2_xts_crypt_amd64 (keysched, tweak, outbuf, inbuf, nblocks,
nrounds, encrypt);
}
diff --git a/cipher/rijndael.c b/cipher/rijndael.c
index 910073d2..f3daf35a 100644
--- a/cipher/rijndael.c
+++ b/cipher/rijndael.c
@@ -46,6 +46,7 @@
#include "g10lib.h"
#include "cipher.h"
#include "bufhelp.h"
+#include "hwf-common.h"
#include "rijndael-internal.h"
#include "./cipher-internal.h"
@@ -726,6 +727,10 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen,
if ((hwfeatures & HWF_INTEL_VAES_VPCLMUL) &&
(hwfeatures & HWF_INTEL_AVX2))
{
+#ifdef USE_VAES_AVX512
+ ctx->use_vaes_avx512 = !!(hwfeatures & HWF_INTEL_AVX512);
+#endif
+
/* Setup VAES bulk encryption routines. */
bulk_ops->cfb_dec = _gcry_aes_vaes_cfb_dec;
bulk_ops->cbc_dec = _gcry_aes_vaes_cbc_dec;
diff --git a/configure.ac b/configure.ac
index da6f1970..319ff539 100644
--- a/configure.ac
+++ b/configure.ac
@@ -3464,6 +3464,7 @@ if test "$found" = "1" ; then
# Build with the VAES/AVX2 implementation
GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes.lo"
GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-avx2-amd64.lo"
+ GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-avx512-amd64.lo"
;;
arm*-*-*)
# Build with the assembly implementation
--
2.51.0
More information about the Gcrypt-devel
mailing list