From jussi.kivilinna at iki.fi Thu Jan 1 16:34:39 2026 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 1 Jan 2026 17:34:39 +0200 Subject: [PATCH] rijndael: add VAES/AVX512 accelerated implementation Message-ID: <20260101153439.2739073-1-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'rijndael-vaes-avx512-amd64.S'. * cipher/rijndael-internal.h (USE_VAES_AVX512): New. (RIJNDAEL_context_s) [USE_VAES_AVX512]: Add 'use_vaes_avx512'. * cipher/rijndael-vaes-avx2-amd64.S (_gcry_vaes_avx2_ocb_crypt_amd64): Minor optimization for aligned blk8 OCB path. * cipher/rijndael-vaes-avx512-amd64.S: New. * cipher/rijndael-vaes.c [USE_VAES_AVX512] (_gcry_vaes_avx512_cbc_dec_amd64, _gcry_vaes_avx512_cfb_dec_amd64) (_gcry_vaes_avx512_ctr_enc_amd64) (_gcry_vaes_avx512_ctr32le_enc_amd64) (_gcry_vaes_avx512_ocb_aligned_crypt_amd64) (_gcry_vaes_avx512_xts_crypt_amd64) (_gcry_vaes_avx512_ecb_crypt_amd64): New. (_gcry_aes_vaes_ecb_crypt, _gcry_aes_vaes_cbc_dec) (_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_ctr_enc) (_gcry_aes_vaes_ctr32le_enc, _gcry_aes_vaes_ocb_crypt) (_gcry_aes_vaes_ocb_auth, _gcry_aes_vaes_xts_crypt) [USE_VAES_AVX512]: Add AVX512 code paths. * cipher/rijndael.c (do_setkey) [USE_VAES_AVX512]: Add setup for 'ctx->use_vaes_avx512'. * configure.ac: Add 'rijndael-vaes-avx512-amd64.lo'. -- Commit adds VAES/AVX512 acceleration for AES. New implementation is about ~2x faster (for parallel modes, such as OCB) compared to VAES/AVX2 implementation on AMD zen5. With AMD zen4 and Intel tigerlake, VAES/AVX512 is about same speed as VAES/AVX2 since HW supports only 256bit wide processing for AES instructions. Benchmark on AMD Ryzen 9 9950X3D (zen5): Before (VAES/AVX2): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.029 ns/B 32722 MiB/s 0.162 c/B 5566?1 ECB dec | 0.029 ns/B 32824 MiB/s 0.162 c/B 5563 CBC enc | 0.449 ns/B 2123 MiB/s 2.50 c/B 5563 CBC dec | 0.029 ns/B 32735 MiB/s 0.162 c/B 5566 CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565 CFB dec | 0.029 ns/B 32752 MiB/s 0.162 c/B 5565 CTR enc | 0.030 ns/B 31694 MiB/s 0.167 c/B 5565 CTR dec | 0.030 ns/B 31727 MiB/s 0.167 c/B 5568 XTS enc | 0.033 ns/B 28776 MiB/s 0.184 c/B 5560 XTS dec | 0.033 ns/B 28517 MiB/s 0.186 c/B 5551?4 GCM enc | 0.074 ns/B 12841 MiB/s 0.413 c/B 5565 GCM dec | 0.075 ns/B 12658 MiB/s 0.419 c/B 5566 GCM auth | 0.045 ns/B 21322 MiB/s 0.249 c/B 5566 OCB enc | 0.030 ns/B 32298 MiB/s 0.164 c/B 5543?4 OCB dec | 0.029 ns/B 32476 MiB/s 0.163 c/B 5545?6 OCB auth | 0.029 ns/B 32961 MiB/s 0.161 c/B 5561?2 After (VAES/AVX512): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.015 ns/B 62011 MiB/s 0.085 c/B 5553?5 ECB dec | 0.015 ns/B 63315 MiB/s 0.084 c/B 5552?3 CBC enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565 CBC dec | 0.015 ns/B 63800 MiB/s 0.083 c/B 5557?4 CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5562 CFB dec | 0.015 ns/B 62510 MiB/s 0.085 c/B 5557?1 CTR enc | 0.016 ns/B 60975 MiB/s 0.087 c/B 5564 CTR dec | 0.016 ns/B 60737 MiB/s 0.087 c/B 5556?2 XTS enc | 0.018 ns/B 53861 MiB/s 0.098 c/B 5561?1 XTS dec | 0.018 ns/B 53604 MiB/s 0.099 c/B 5549?3 GCM enc | 0.037 ns/B 25806 MiB/s 0.206 c/B 5561?3 GCM dec | 0.038 ns/B 25223 MiB/s 0.210 c/B 5555?5 GCM auth | 0.021 ns/B 44365 MiB/s 0.120 c/B 5562 OCB enc | 0.016 ns/B 61035 MiB/s 0.087 c/B 5545?6 OCB dec | 0.015 ns/B 62190 MiB/s 0.085 c/B 5544?5 OCB auth | 0.015 ns/B 63886 MiB/s 0.083 c/B 5543?7 Benchmark on AMD Ryzen 9 7900X (zen4): Before (VAES/AVX2): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.028 ns/B 33759 MiB/s 0.160 c/B 5676 ECB dec | 0.028 ns/B 33560 MiB/s 0.161 c/B 5676 CBC enc | 0.441 ns/B 2165 MiB/s 2.50 c/B 5676 CBC dec | 0.029 ns/B 32766 MiB/s 0.165 c/B 5677?2 CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5676 CFB dec | 0.029 ns/B 33053 MiB/s 0.164 c/B 5686?4 CTR enc | 0.029 ns/B 32420 MiB/s 0.167 c/B 5677?1 CTR dec | 0.029 ns/B 32531 MiB/s 0.167 c/B 5690?5 XTS enc | 0.038 ns/B 25081 MiB/s 0.215 c/B 5650 XTS dec | 0.038 ns/B 25020 MiB/s 0.217 c/B 5704?6 GCM enc | 0.067 ns/B 14170 MiB/s 0.370 c/B 5500 GCM dec | 0.067 ns/B 14205 MiB/s 0.369 c/B 5500 GCM auth | 0.038 ns/B 25110 MiB/s 0.209 c/B 5500 OCB enc | 0.030 ns/B 31579 MiB/s 0.172 c/B 5708?20 OCB dec | 0.030 ns/B 31613 MiB/s 0.173 c/B 5722?5 OCB auth | 0.029 ns/B 32535 MiB/s 0.167 c/B 5688?1 After (VAES/AVX2): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.028 ns/B 33551 MiB/s 0.161 c/B 5676 ECB dec | 0.029 ns/B 33346 MiB/s 0.162 c/B 5675 CBC enc | 0.440 ns/B 2166 MiB/s 2.50 c/B 5675 CBC dec | 0.029 ns/B 33308 MiB/s 0.163 c/B 5685?3 CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5675 CFB dec | 0.029 ns/B 33254 MiB/s 0.163 c/B 5671?1 CTR enc | 0.029 ns/B 33367 MiB/s 0.163 c/B 5686 CTR dec | 0.029 ns/B 33447 MiB/s 0.162 c/B 5687 XTS enc | 0.034 ns/B 27705 MiB/s 0.195 c/B 5673?1 XTS dec | 0.035 ns/B 27429 MiB/s 0.197 c/B 5677 GCM enc | 0.057 ns/B 16625 MiB/s 0.324 c/B 5652 GCM dec | 0.059 ns/B 16094 MiB/s 0.326 c/B 5510 GCM auth | 0.030 ns/B 31982 MiB/s 0.164 c/B 5500 OCB enc | 0.030 ns/B 31630 MiB/s 0.166 c/B 5500 OCB dec | 0.030 ns/B 32214 MiB/s 0.163 c/B 5500 OCB auth | 0.029 ns/B 33413 MiB/s 0.157 c/B 5500 Benchmark on Intel Core i3-1115G4I (tigerlake): Before (VAES/AVX512): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.038 ns/B 25068 MiB/s 0.156 c/B 4090 ECB dec | 0.038 ns/B 25157 MiB/s 0.155 c/B 4090 CBC enc | 0.459 ns/B 2080 MiB/s 1.88 c/B 4090 CBC dec | 0.038 ns/B 25091 MiB/s 0.155 c/B 4090 CFB enc | 0.458 ns/B 2081 MiB/s 1.87 c/B 4090 CFB dec | 0.038 ns/B 25176 MiB/s 0.155 c/B 4090 CTR enc | 0.039 ns/B 24466 MiB/s 0.159 c/B 4090 CTR dec | 0.039 ns/B 24428 MiB/s 0.160 c/B 4090 XTS enc | 0.057 ns/B 16760 MiB/s 0.233 c/B 4090 XTS dec | 0.056 ns/B 16952 MiB/s 0.230 c/B 4090 GCM enc | 0.102 ns/B 9344 MiB/s 0.417 c/B 4090 GCM dec | 0.102 ns/B 9312 MiB/s 0.419 c/B 4090 GCM auth | 0.063 ns/B 15243 MiB/s 0.256 c/B 4090 OCB enc | 0.042 ns/B 22451 MiB/s 0.174 c/B 4090 OCB dec | 0.042 ns/B 22613 MiB/s 0.172 c/B 4090 OCB auth | 0.040 ns/B 23770 MiB/s 0.164 c/B 4090 After (VAES/AVX2): AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 0.040 ns/B 24094 MiB/s 0.162 c/B 4097?3 ECB dec | 0.040 ns/B 24052 MiB/s 0.162 c/B 4097?3 CBC enc | 0.458 ns/B 2080 MiB/s 1.88 c/B 4090 CBC dec | 0.039 ns/B 24385 MiB/s 0.160 c/B 4097?3 CFB enc | 0.458 ns/B 2080 MiB/s 1.87 c/B 4090 CFB dec | 0.039 ns/B 24403 MiB/s 0.160 c/B 4097?3 CTR enc | 0.040 ns/B 24119 MiB/s 0.162 c/B 4097?3 CTR dec | 0.040 ns/B 24095 MiB/s 0.162 c/B 4097?3 XTS enc | 0.048 ns/B 19891 MiB/s 0.196 c/B 4097?3 XTS dec | 0.048 ns/B 20077 MiB/s 0.195 c/B 4097?3 GCM enc | 0.084 ns/B 11417 MiB/s 0.342 c/B 4097?3 GCM dec | 0.084 ns/B 11373 MiB/s 0.344 c/B 4097?3 GCM auth | 0.045 ns/B 21402 MiB/s 0.183 c/B 4097?3 OCB enc | 0.040 ns/B 23946 MiB/s 0.163 c/B 4097?3 OCB dec | 0.040 ns/B 23760 MiB/s 0.164 c/B 4097?4 OCB auth | 0.041 ns/B 23083 MiB/s 0.169 c/B 4097?4 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 1 + cipher/rijndael-internal.h | 11 +- cipher/rijndael-vaes-avx2-amd64.S | 7 +- cipher/rijndael-vaes-avx512-amd64.S | 2471 +++++++++++++++++++++++++++ cipher/rijndael-vaes.c | 180 +- cipher/rijndael.c | 5 + configure.ac | 1 + 7 files changed, 2668 insertions(+), 8 deletions(-) create mode 100644 cipher/rijndael-vaes-avx512-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index bbcd518a..11bb19d7 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -117,6 +117,7 @@ EXTRA_libcipher_la_SOURCES = \ rijndael-amd64.S rijndael-arm.S \ rijndael-ssse3-amd64.c rijndael-ssse3-amd64-asm.S \ rijndael-vaes.c rijndael-vaes-avx2-amd64.S \ + rijndael-vaes-avx512-amd64.S \ rijndael-vaes-i386.c rijndael-vaes-avx2-i386.S \ rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \ rijndael-armv8-aarch64-ce.S rijndael-aarch64.S \ diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h index 15084a69..bb8f97a0 100644 --- a/cipher/rijndael-internal.h +++ b/cipher/rijndael-internal.h @@ -89,7 +89,7 @@ # endif #endif /* ENABLE_AESNI_SUPPORT */ -/* USE_VAES inidicates whether to compile with AMD64 VAES code. */ +/* USE_VAES inidicates whether to compile with AMD64 VAES/AVX2 code. */ #undef USE_VAES #if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ @@ -99,6 +99,12 @@ # define USE_VAES 1 #endif +/* USE_VAES inidicates whether to compile with AMD64 VAES/AVX512 code. */ +#undef USE_VAES_AVX512 +#if defined(USE_VAES) && defined(ENABLE_AVX512_SUPPORT) +# define USE_VAES_AVX512 1 +#endif + /* USE_VAES_I386 inidicates whether to compile with i386 VAES code. */ #undef USE_VAES_I386 #if (defined(HAVE_COMPATIBLE_GCC_I386_PLATFORM_AS) || \ @@ -210,6 +216,9 @@ typedef struct RIJNDAEL_context_s unsigned int use_avx:1; /* AVX shall be used by AES-NI implementation. */ unsigned int use_avx2:1; /* AVX2 shall be used by AES-NI implementation. */ #endif /*USE_AESNI*/ +#ifdef USE_VAES_AVX512 + unsigned int use_vaes_avx512:1; /* AVX512 shall be used by VAES implementation. */ +#endif /*USE_VAES_AVX512*/ #ifdef USE_S390X_CRYPTO byte km_func; byte km_func_xts; diff --git a/cipher/rijndael-vaes-avx2-amd64.S b/cipher/rijndael-vaes-avx2-amd64.S index 51ccf932..07e6f1ca 100644 --- a/cipher/rijndael-vaes-avx2-amd64.S +++ b/cipher/rijndael-vaes-avx2-amd64.S @@ -2370,16 +2370,11 @@ _gcry_vaes_avx2_ocb_crypt_amd64: leaq -8(%r8), %r8; leal 8(%esi), %esi; - tzcntl %esi, %eax; - shll $4, %eax; vpxor (0 * 16)(%rsp), %ymm15, %ymm5; vpxor (2 * 16)(%rsp), %ymm15, %ymm6; vpxor (4 * 16)(%rsp), %ymm15, %ymm7; - - vpxor (2 * 16)(%r14), %xmm15, %xmm13; /* offset ^ first key ^ L[2] */ - vpxor (%r14, %rax), %xmm13, %xmm14; /* offset ^ first key ^ L[2] ^ L[ntz{nblk+8}] */ - vinserti128 $1, %xmm14, %ymm13, %ymm14; + vpxor (6 * 16)(%rsp), %ymm15, %ymm14; cmpl $1, %r15d; jb .Locb_aligned_blk8_dec; diff --git a/cipher/rijndael-vaes-avx512-amd64.S b/cipher/rijndael-vaes-avx512-amd64.S new file mode 100644 index 00000000..b7dba5e3 --- /dev/null +++ b/cipher/rijndael-vaes-avx512-amd64.S @@ -0,0 +1,2471 @@ +/* VAES/AVX512 AMD64 accelerated AES for Libgcrypt + * Copyright (C) 2026 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#if defined(__x86_64__) +#include +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) && \ + defined(ENABLE_AVX512_SUPPORT) && \ + defined(HAVE_GCC_INLINE_ASM_VAES_VPCLMUL) + +#include "asm-common-amd64.h" + +.text + +/********************************************************************** + helper macros + **********************************************************************/ +#define no(...) /*_*/ +#define yes(...) __VA_ARGS__ + +#define AES_OP8(op, key, b0, b1, b2, b3, b4, b5, b6, b7) \ + op key, b0, b0; \ + op key, b1, b1; \ + op key, b2, b2; \ + op key, b3, b3; \ + op key, b4, b4; \ + op key, b5, b5; \ + op key, b6, b6; \ + op key, b7, b7; + +#define VAESENC8(key, b0, b1, b2, b3, b4, b5, b6, b7) \ + AES_OP8(vaesenc, key, b0, b1, b2, b3, b4, b5, b6, b7) + +#define VAESDEC8(key, b0, b1, b2, b3, b4, b5, b6, b7) \ + AES_OP8(vaesdec, key, b0, b1, b2, b3, b4, b5, b6, b7) + +#define XOR8(key, b0, b1, b2, b3, b4, b5, b6, b7) \ + AES_OP8(vpxord, key, b0, b1, b2, b3, b4, b5, b6, b7) + +#define AES_OP4(op, key, b0, b1, b2, b3) \ + op key, b0, b0; \ + op key, b1, b1; \ + op key, b2, b2; \ + op key, b3, b3; + +#define VAESENC4(key, b0, b1, b2, b3) \ + AES_OP4(vaesenc, key, b0, b1, b2, b3) + +#define VAESDEC4(key, b0, b1, b2, b3) \ + AES_OP4(vaesdec, key, b0, b1, b2, b3) + +#define XOR4(key, b0, b1, b2, b3) \ + AES_OP4(vpxord, key, b0, b1, b2, b3) + +/********************************************************************** + CBC-mode decryption + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_cbc_dec_amd64, at function) +.globl _gcry_vaes_avx512_cbc_dec_amd64 +.align 16 +_gcry_vaes_avx512_cbc_dec_amd64: + /* input: + * %rdi: round keys + * %rsi: iv + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lcbc_dec_skip_avx512; + + spec_stop_avx512; + + /* Load IV. */ + vmovdqu (%rsi), %xmm15; + + /* Load first and last key. */ + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm30; + vbroadcasti32x4 (%rdi, %rax, 4), %zmm31; + + /* Process 32 blocks per loop. */ +.align 16 +.Lcbc_dec_blk32: + cmpq $32, %r8; + jb .Lcbc_dec_blk16; + + leaq -32(%r8), %r8; + + /* Load input and xor first key. Update IV. */ + vmovdqu32 (0 * 16)(%rcx), %zmm0; + vshufi32x4 $0b10010011, %zmm0, %zmm0, %zmm9; + vmovdqu32 (4 * 16)(%rcx), %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm3; + vmovdqu32 (16 * 16)(%rcx), %zmm4; + vmovdqu32 (20 * 16)(%rcx), %zmm5; + vmovdqu32 (24 * 16)(%rcx), %zmm6; + vmovdqu32 (28 * 16)(%rcx), %zmm7; + vinserti32x4 $0, %xmm15, %zmm9, %zmm9; + vpxord %zmm30, %zmm0, %zmm0; + vpxord %zmm30, %zmm1, %zmm1; + vpxord %zmm30, %zmm2, %zmm2; + vpxord %zmm30, %zmm3, %zmm3; + vpxord %zmm30, %zmm4, %zmm4; + vpxord %zmm30, %zmm5, %zmm5; + vpxord %zmm30, %zmm6, %zmm6; + vpxord %zmm30, %zmm7, %zmm7; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + vpxord %zmm31, %zmm9, %zmm9; + vpxord (3 * 16)(%rcx), %zmm31, %zmm10; + vpxord (7 * 16)(%rcx), %zmm31, %zmm11; + vpxord (11 * 16)(%rcx), %zmm31, %zmm12; + vpxord (15 * 16)(%rcx), %zmm31, %zmm13; + vpxord (19 * 16)(%rcx), %zmm31, %zmm14; + vpxord (23 * 16)(%rcx), %zmm31, %zmm16; + vpxord (27 * 16)(%rcx), %zmm31, %zmm17; + vmovdqu (31 * 16)(%rcx), %xmm15; + leaq (32 * 16)(%rcx), %rcx; + + /* AES rounds */ + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lcbc_dec_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lcbc_dec_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Lcbc_dec_blk32_last: + vaesdeclast %zmm9, %zmm0, %zmm0; + vaesdeclast %zmm10, %zmm1, %zmm1; + vaesdeclast %zmm11, %zmm2, %zmm2; + vaesdeclast %zmm12, %zmm3, %zmm3; + vaesdeclast %zmm13, %zmm4, %zmm4; + vaesdeclast %zmm14, %zmm5, %zmm5; + vaesdeclast %zmm16, %zmm6, %zmm6; + vaesdeclast %zmm17, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + jmp .Lcbc_dec_blk32; + + /* Process 16 blocks per loop. */ +.align 16 +.Lcbc_dec_blk16: + cmpq $16, %r8; + jb .Lcbc_dec_tail; + + leaq -16(%r8), %r8; + + /* Load input and xor first key. Update IV. */ + vmovdqu32 (0 * 16)(%rcx), %zmm0; + vshufi32x4 $0b10010011, %zmm0, %zmm0, %zmm9; + vmovdqu32 (4 * 16)(%rcx), %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm3; + vinserti32x4 $0, %xmm15, %zmm9, %zmm9; + vpxord %zmm30, %zmm0, %zmm0; + vpxord %zmm30, %zmm1, %zmm1; + vpxord %zmm30, %zmm2, %zmm2; + vpxord %zmm30, %zmm3, %zmm3; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + vpxord %zmm31, %zmm9, %zmm9; + vpxord (3 * 16)(%rcx), %zmm31, %zmm10; + vpxord (7 * 16)(%rcx), %zmm31, %zmm11; + vpxord (11 * 16)(%rcx), %zmm31, %zmm12; + vmovdqu (15 * 16)(%rcx), %xmm15; + leaq (16 * 16)(%rcx), %rcx; + + /* AES rounds */ + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lcbc_dec_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lcbc_dec_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESDEC4(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lcbc_dec_blk16_last: + vaesdeclast %zmm9, %zmm0, %zmm0; + vaesdeclast %zmm10, %zmm1, %zmm1; + vaesdeclast %zmm11, %zmm2, %zmm2; + vaesdeclast %zmm12, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + +.align 16 +.Lcbc_dec_tail: + /* Store IV. */ + vmovdqu %xmm15, (%rsi); + + /* Clear used AVX512 registers. */ + vpxord %ymm16, %ymm16, %ymm16; + vpxord %ymm17, %ymm17, %ymm17; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + vzeroall; + +.align 16 +.Lcbc_dec_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_cbc_dec_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_cbc_dec_amd64,.-_gcry_vaes_avx512_cbc_dec_amd64) + +/********************************************************************** + CFB-mode decryption + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_cfb_dec_amd64, at function) +.globl _gcry_vaes_avx512_cfb_dec_amd64 +.align 16 +_gcry_vaes_avx512_cfb_dec_amd64: + /* input: + * %rdi: round keys + * %rsi: iv + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lcfb_dec_skip_avx512; + + spec_stop_avx512; + + /* Load IV. */ + vmovdqu (%rsi), %xmm15; + + /* Load first and last key. */ + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm30; + vbroadcasti32x4 (%rdi, %rax, 4), %zmm31; + + /* Process 32 blocks per loop. */ +.align 16 +.Lcfb_dec_blk32: + cmpq $32, %r8; + jb .Lcfb_dec_blk16; + + leaq -32(%r8), %r8; + + /* Load input and xor first key. Update IV. */ + vmovdqu32 (0 * 16)(%rcx), %zmm9; + vshufi32x4 $0b10010011, %zmm9, %zmm9, %zmm0; + vmovdqu32 (3 * 16)(%rcx), %zmm1; + vinserti32x4 $0, %xmm15, %zmm0, %zmm0; + vmovdqu32 (7 * 16)(%rcx), %zmm2; + vmovdqu32 (11 * 16)(%rcx), %zmm3; + vmovdqu32 (15 * 16)(%rcx), %zmm4; + vmovdqu32 (19 * 16)(%rcx), %zmm5; + vmovdqu32 (23 * 16)(%rcx), %zmm6; + vmovdqu32 (27 * 16)(%rcx), %zmm7; + vmovdqu (31 * 16)(%rcx), %xmm15; + vpxord %zmm30, %zmm0, %zmm0; + vpxord %zmm30, %zmm1, %zmm1; + vpxord %zmm30, %zmm2, %zmm2; + vpxord %zmm30, %zmm3, %zmm3; + vpxord %zmm30, %zmm4, %zmm4; + vpxord %zmm30, %zmm5, %zmm5; + vpxord %zmm30, %zmm6, %zmm6; + vpxord %zmm30, %zmm7, %zmm7; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + vpxord %zmm31, %zmm9, %zmm9; + vpxord (4 * 16)(%rcx), %zmm31, %zmm10; + vpxord (8 * 16)(%rcx), %zmm31, %zmm11; + vpxord (12 * 16)(%rcx), %zmm31, %zmm12; + vpxord (16 * 16)(%rcx), %zmm31, %zmm13; + vpxord (20 * 16)(%rcx), %zmm31, %zmm14; + vpxord (24 * 16)(%rcx), %zmm31, %zmm16; + vpxord (28 * 16)(%rcx), %zmm31, %zmm17; + leaq (32 * 16)(%rcx), %rcx; + + /* AES rounds */ + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lcfb_dec_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lcfb_dec_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Lcfb_dec_blk32_last: + vaesenclast %zmm9, %zmm0, %zmm0; + vaesenclast %zmm10, %zmm1, %zmm1; + vaesenclast %zmm11, %zmm2, %zmm2; + vaesenclast %zmm12, %zmm3, %zmm3; + vaesenclast %zmm13, %zmm4, %zmm4; + vaesenclast %zmm14, %zmm5, %zmm5; + vaesenclast %zmm16, %zmm6, %zmm6; + vaesenclast %zmm17, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + jmp .Lcfb_dec_blk32; + + /* Handle trailing 16 blocks. */ +.align 16 +.Lcfb_dec_blk16: + cmpq $16, %r8; + jb .Lcfb_dec_tail; + + leaq -16(%r8), %r8; + + /* Load input and xor first key. Update IV. */ + vmovdqu32 (0 * 16)(%rcx), %zmm10; + vshufi32x4 $0b10010011, %zmm10, %zmm10, %zmm0; + vmovdqu32 (3 * 16)(%rcx), %zmm1; + vinserti32x4 $0, %xmm15, %zmm0, %zmm0; + vmovdqu32 (7 * 16)(%rcx), %zmm2; + vmovdqu32 (11 * 16)(%rcx), %zmm3; + vmovdqu (15 * 16)(%rcx), %xmm15; + vpxord %zmm30, %zmm0, %zmm0; + vpxord %zmm30, %zmm1, %zmm1; + vpxord %zmm30, %zmm2, %zmm2; + vpxord %zmm30, %zmm3, %zmm3; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + vpxord %zmm31, %zmm10, %zmm10; + vpxord (4 * 16)(%rcx), %zmm31, %zmm11; + vpxord (8 * 16)(%rcx), %zmm31, %zmm12; + vpxord (12 * 16)(%rcx), %zmm31, %zmm13; + leaq (16 * 16)(%rcx), %rcx; + + /* AES rounds */ + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lcfb_dec_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lcfb_dec_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lcfb_dec_blk16_last: + vaesenclast %zmm10, %zmm0, %zmm0; + vaesenclast %zmm11, %zmm1, %zmm1; + vaesenclast %zmm12, %zmm2, %zmm2; + vaesenclast %zmm13, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + +.align 16 +.Lcfb_dec_tail: + /* Store IV. */ + vmovdqu %xmm15, (%rsi); + + /* Clear used AVX512 registers. */ + vpxord %ymm16, %ymm16, %ymm16; + vpxord %ymm17, %ymm17, %ymm17; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + vzeroall; + +.align 16 +.Lcfb_dec_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_cfb_dec_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_cfb_dec_amd64,.-_gcry_vaes_avx512_cfb_dec_amd64) + +/********************************************************************** + CTR-mode encryption + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_ctr_enc_amd64, at function) +.globl _gcry_vaes_avx512_ctr_enc_amd64 +.align 16 +_gcry_vaes_avx512_ctr_enc_amd64: + /* input: + * %rdi: round keys + * %rsi: counter + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lctr_enc_skip_avx512; + + spec_stop_avx512; + + movq 8(%rsi), %r10; + movq 0(%rsi), %r11; + bswapq %r10; + bswapq %r11; + + vmovdqa32 .Lbige_addb_0 rRIP, %zmm20; + vmovdqa32 .Lbige_addb_4 rRIP, %zmm21; + vmovdqa32 .Lbige_addb_8 rRIP, %zmm22; + vmovdqa32 .Lbige_addb_12 rRIP, %zmm23; + + /* Load first and last key. */ + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm30; + vbroadcasti32x4 (%rdi, %rax, 4), %zmm31; + + cmpq $32, %r8; + jb .Lctr_enc_blk16; + + vmovdqa32 .Lbige_addb_16 rRIP, %zmm24; + vmovdqa32 .Lbige_addb_20 rRIP, %zmm25; + vmovdqa32 .Lbige_addb_24 rRIP, %zmm26; + vmovdqa32 .Lbige_addb_28 rRIP, %zmm27; + +#define add_le128(out, in, lo_counter, hi_counter1) \ + vpaddq lo_counter, in, out; \ + vpcmpuq $1, lo_counter, out, %k1; \ + kaddb %k1, %k1, %k1; \ + vpaddq hi_counter1, out, out{%k1}; + +#define handle_ctr_128bit_add(nblks) \ + addq $(nblks), %r10; \ + adcq $0, %r11; \ + bswapq %r10; \ + bswapq %r11; \ + movq %r10, 8(%rsi); \ + movq %r11, 0(%rsi); \ + bswapq %r10; \ + bswapq %r11; + + /* Process 32 blocks per loop. */ +.align 16 +.Lctr_enc_blk32: + leaq -32(%r8), %r8; + + vbroadcasti32x4 (%rsi), %zmm7; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + + /* detect if carry handling is needed */ + addb $32, 15(%rsi); + jc .Lctr_enc_blk32_handle_carry; + + leaq 32(%r10), %r10; + + .Lctr_enc_blk32_byte_bige_add: + /* Increment counters. */ + vpaddb %zmm20, %zmm7, %zmm0; + vpaddb %zmm21, %zmm7, %zmm1; + vpaddb %zmm22, %zmm7, %zmm2; + vpaddb %zmm23, %zmm7, %zmm3; + vpaddb %zmm24, %zmm7, %zmm4; + vpaddb %zmm25, %zmm7, %zmm5; + vpaddb %zmm26, %zmm7, %zmm6; + vpaddb %zmm27, %zmm7, %zmm7; + + .Lctr_enc_blk32_rounds: + /* AES rounds */ + XOR8(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lctr_enc_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lctr_enc_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Lctr_enc_blk32_last: + vpxord (0 * 16)(%rcx), %zmm31, %zmm9; /* Xor src to last round key. */ + vpxord (4 * 16)(%rcx), %zmm31, %zmm10; + vpxord (8 * 16)(%rcx), %zmm31, %zmm11; + vpxord (12 * 16)(%rcx), %zmm31, %zmm12; + vpxord (16 * 16)(%rcx), %zmm31, %zmm13; + vpxord (20 * 16)(%rcx), %zmm31, %zmm14; + vpxord (24 * 16)(%rcx), %zmm31, %zmm15; + vpxord (28 * 16)(%rcx), %zmm31, %zmm8; + leaq (32 * 16)(%rcx), %rcx; + vaesenclast %zmm9, %zmm0, %zmm0; + vaesenclast %zmm10, %zmm1, %zmm1; + vaesenclast %zmm11, %zmm2, %zmm2; + vaesenclast %zmm12, %zmm3, %zmm3; + vaesenclast %zmm13, %zmm4, %zmm4; + vaesenclast %zmm14, %zmm5, %zmm5; + vaesenclast %zmm15, %zmm6, %zmm6; + vaesenclast %zmm8, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + cmpq $32, %r8; + jnb .Lctr_enc_blk32; + + /* Clear used AVX512 registers. */ + vpxord %ymm24, %ymm24, %ymm24; + vpxord %ymm25, %ymm25, %ymm25; + vpxord %ymm26, %ymm26, %ymm26; + vpxord %ymm27, %ymm27, %ymm27; + + jmp .Lctr_enc_blk16; + + .align 16 + .Lctr_enc_blk32_handle_only_ctr_carry: + handle_ctr_128bit_add(32); + jmp .Lctr_enc_blk32_byte_bige_add; + + .align 16 + .Lctr_enc_blk32_handle_carry: + jz .Lctr_enc_blk32_handle_only_ctr_carry; + /* Increment counters (handle carry). */ + vbroadcasti32x4 .Lbswap128_mask rRIP, %zmm15; + vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm10; + vpmovzxbq .Lcounter1_1_1_1_hi_bq rRIP, %zmm13; + vpshufb %zmm15, %zmm7, %zmm7; /* be => le */ + vpmovzxbq .Lcounter4_4_4_4_lo_bq rRIP, %zmm11; + vpmovzxbq .Lcounter8_8_8_8_lo_bq rRIP, %zmm12; + handle_ctr_128bit_add(32); + add_le128(%zmm0, %zmm7, %zmm10, %zmm13); /* +0:+1:+2:+3 */ + add_le128(%zmm1, %zmm0, %zmm11, %zmm13); /* +4:+5:+6:+7 */ + add_le128(%zmm2, %zmm0, %zmm12, %zmm13); /* +8:... */ + vpshufb %zmm15, %zmm0, %zmm0; /* le => be */ + add_le128(%zmm3, %zmm1, %zmm12, %zmm13); /* +12:... */ + vpshufb %zmm15, %zmm1, %zmm1; /* le => be */ + add_le128(%zmm4, %zmm2, %zmm12, %zmm13); /* +16:... */ + vpshufb %zmm15, %zmm2, %zmm2; /* le => be */ + add_le128(%zmm5, %zmm3, %zmm12, %zmm13); /* +20:... */ + vpshufb %zmm15, %zmm3, %zmm3; /* le => be */ + add_le128(%zmm6, %zmm4, %zmm12, %zmm13); /* +24:... */ + vpshufb %zmm15, %zmm4, %zmm4; /* le => be */ + add_le128(%zmm7, %zmm5, %zmm12, %zmm13); /* +28:... */ + vpshufb %zmm15, %zmm5, %zmm5; /* le => be */ + vpshufb %zmm15, %zmm6, %zmm6; /* le => be */ + vpshufb %zmm15, %zmm7, %zmm7; /* le => be */ + + jmp .Lctr_enc_blk32_rounds; + + /* Handle trailing 16 blocks. */ +.align 16 +.Lctr_enc_blk16: + cmpq $16, %r8; + jb .Lctr_enc_tail; + + leaq -16(%r8), %r8; + + vbroadcasti32x4 (%rsi), %zmm3; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + + /* detect if carry handling is needed */ + addb $16, 15(%rsi); + jc .Lctr_enc_blk16_handle_carry; + + leaq 16(%r10), %r10; + + .Lctr_enc_blk16_byte_bige_add: + /* Increment counters. */ + vpaddb %zmm20, %zmm3, %zmm0; + vpaddb %zmm21, %zmm3, %zmm1; + vpaddb %zmm22, %zmm3, %zmm2; + vpaddb %zmm23, %zmm3, %zmm3; + + .Lctr_enc_blk16_rounds: + /* AES rounds */ + XOR4(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3); + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lctr_enc_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lctr_enc_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lctr_enc_blk16_last: + vpxord (0 * 16)(%rcx), %zmm31, %zmm5; /* Xor src to last round key. */ + vpxord (4 * 16)(%rcx), %zmm31, %zmm6; + vpxord (8 * 16)(%rcx), %zmm31, %zmm7; + vpxord (12 * 16)(%rcx), %zmm31, %zmm4; + leaq (16 * 16)(%rcx), %rcx; + vaesenclast %zmm5, %zmm0, %zmm0; + vaesenclast %zmm6, %zmm1, %zmm1; + vaesenclast %zmm7, %zmm2, %zmm2; + vaesenclast %zmm4, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + jmp .Lctr_enc_tail; + + .align 16 + .Lctr_enc_blk16_handle_only_ctr_carry: + handle_ctr_128bit_add(16); + jmp .Lctr_enc_blk16_byte_bige_add; + + .align 16 + .Lctr_enc_blk16_handle_carry: + jz .Lctr_enc_blk16_handle_only_ctr_carry; + /* Increment counters (handle carry). */ + vbroadcasti32x4 .Lbswap128_mask rRIP, %zmm15; + vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm10; + vpmovzxbq .Lcounter1_1_1_1_hi_bq rRIP, %zmm13; + vpshufb %zmm15, %zmm3, %zmm3; /* be => le */ + vpmovzxbq .Lcounter4_4_4_4_lo_bq rRIP, %zmm11; + vpmovzxbq .Lcounter8_8_8_8_lo_bq rRIP, %zmm12; + handle_ctr_128bit_add(16); + add_le128(%zmm0, %zmm3, %zmm10, %zmm13); /* +0:+1:+2:+3 */ + add_le128(%zmm1, %zmm0, %zmm11, %zmm13); /* +4:+5:+6:+7 */ + add_le128(%zmm2, %zmm0, %zmm12, %zmm13); /* +8:... */ + vpshufb %zmm15, %zmm0, %zmm0; /* le => be */ + add_le128(%zmm3, %zmm1, %zmm12, %zmm13); /* +12:... */ + vpshufb %zmm15, %zmm1, %zmm1; /* le => be */ + vpshufb %zmm15, %zmm2, %zmm2; /* le => be */ + vpshufb %zmm15, %zmm3, %zmm3; /* le => be */ + + jmp .Lctr_enc_blk16_rounds; + +.align 16 +.Lctr_enc_tail: + xorl %r10d, %r10d; + xorl %r11d, %r11d; + + /* Clear used AVX512 registers. */ + vpxord %ymm20, %ymm20, %ymm20; + vpxord %ymm21, %ymm21, %ymm21; + vpxord %ymm22, %ymm22, %ymm22; + vpxord %ymm23, %ymm23, %ymm23; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + kxorq %k1, %k1, %k1; + vzeroall; + +.align 16 +.Lctr_enc_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_ctr_enc_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_ctr_enc_amd64,.-_gcry_vaes_avx512_ctr_enc_amd64) + +/********************************************************************** + Little-endian 32-bit CTR-mode encryption (GCM-SIV) + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_ctr32le_enc_amd64, at function) +.globl _gcry_vaes_avx512_ctr32le_enc_amd64 +.align 16 +_gcry_vaes_avx512_ctr32le_enc_amd64: + /* input: + * %rdi: round keys + * %rsi: counter + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lctr32le_enc_skip_avx512; + + spec_stop_avx512; + + /* Load counter. */ + vbroadcasti32x4 (%rsi), %zmm15; + + vpmovzxbq .Lcounter0_1_2_3_lo_bq rRIP, %zmm20; + vpmovzxbq .Lcounter4_5_6_7_lo_bq rRIP, %zmm21; + vpmovzxbq .Lcounter8_9_10_11_lo_bq rRIP, %zmm22; + vpmovzxbq .Lcounter12_13_14_15_lo_bq rRIP, %zmm23; + + /* Load first and last key. */ + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm30; + vbroadcasti32x4 (%rdi, %rax, 4), %zmm31; + + cmpq $32, %r8; + jb .Lctr32le_enc_blk16; + + vpmovzxbq .Lcounter16_17_18_19_lo_bq rRIP, %zmm24; + vpmovzxbq .Lcounter20_21_22_23_lo_bq rRIP, %zmm25; + vpmovzxbq .Lcounter24_25_26_27_lo_bq rRIP, %zmm26; + vpmovzxbq .Lcounter28_29_30_31_lo_bq rRIP, %zmm27; + + /* Process 32 blocks per loop. */ +.align 16 +.Lctr32le_enc_blk32: + leaq -32(%r8), %r8; + + /* Increment counters. */ + vpmovzxbq .Lcounter32_32_32_32_lo_bq rRIP, %zmm9; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + vpaddd %zmm20, %zmm15, %zmm0; + vpaddd %zmm21, %zmm15, %zmm1; + vpaddd %zmm22, %zmm15, %zmm2; + vpaddd %zmm23, %zmm15, %zmm3; + vpaddd %zmm24, %zmm15, %zmm4; + vpaddd %zmm25, %zmm15, %zmm5; + vpaddd %zmm26, %zmm15, %zmm6; + vpaddd %zmm27, %zmm15, %zmm7; + + vpaddd %zmm9, %zmm15, %zmm15; + + /* AES rounds */ + XOR8(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lctr32le_enc_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lctr32le_enc_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Lctr32le_enc_blk32_last: + vpxord (0 * 16)(%rcx), %zmm31, %zmm9; /* Xor src to last round key. */ + vpxord (4 * 16)(%rcx), %zmm31, %zmm10; + vpxord (8 * 16)(%rcx), %zmm31, %zmm11; + vpxord (12 * 16)(%rcx), %zmm31, %zmm12; + vpxord (16 * 16)(%rcx), %zmm31, %zmm13; + vpxord (20 * 16)(%rcx), %zmm31, %zmm14; + vpxord (24 * 16)(%rcx), %zmm31, %zmm16; + vpxord (28 * 16)(%rcx), %zmm31, %zmm8; + leaq (32 * 16)(%rcx), %rcx; + vaesenclast %zmm9, %zmm0, %zmm0; + vaesenclast %zmm10, %zmm1, %zmm1; + vaesenclast %zmm11, %zmm2, %zmm2; + vaesenclast %zmm12, %zmm3, %zmm3; + vaesenclast %zmm13, %zmm4, %zmm4; + vaesenclast %zmm14, %zmm5, %zmm5; + vaesenclast %zmm16, %zmm6, %zmm6; + vaesenclast %zmm8, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + cmpq $32, %r8; + jnb .Lctr32le_enc_blk32; + + /* Clear used AVX512 registers. */ + vpxord %ymm16, %ymm16, %ymm16; + vpxord %ymm24, %ymm24, %ymm24; + vpxord %ymm25, %ymm25, %ymm25; + vpxord %ymm26, %ymm26, %ymm26; + vpxord %ymm27, %ymm27, %ymm27; + + /* Handle trailing 16 blocks. */ +.align 16 +.Lctr32le_enc_blk16: + cmpq $16, %r8; + jb .Lctr32le_enc_tail; + + leaq -16(%r8), %r8; + + /* Increment counters. */ + vpmovzxbq .Lcounter16_16_16_16_lo_bq rRIP, %zmm5; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + vpaddd %zmm20, %zmm15, %zmm0; + vpaddd %zmm21, %zmm15, %zmm1; + vpaddd %zmm22, %zmm15, %zmm2; + vpaddd %zmm23, %zmm15, %zmm3; + + vpaddd %zmm5, %zmm15, %zmm15; + + /* AES rounds */ + XOR4(%zmm30, %zmm0, %zmm1, %zmm2, %zmm3); + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lctr32le_enc_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lctr32le_enc_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lctr32le_enc_blk16_last: + vpxord (0 * 16)(%rcx), %zmm31, %zmm5; /* Xor src to last round key. */ + vpxord (4 * 16)(%rcx), %zmm31, %zmm6; + vpxord (8 * 16)(%rcx), %zmm31, %zmm7; + vpxord (12 * 16)(%rcx), %zmm31, %zmm4; + leaq (16 * 16)(%rcx), %rcx; + vaesenclast %zmm5, %zmm0, %zmm0; + vaesenclast %zmm6, %zmm1, %zmm1; + vaesenclast %zmm7, %zmm2, %zmm2; + vaesenclast %zmm4, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + +.align 16 +.Lctr32le_enc_tail: + /* Store IV. */ + vmovdqu %xmm15, (%rsi); + + /* Clear used AVX512 registers. */ + vpxord %ymm20, %ymm20, %ymm20; + vpxord %ymm21, %ymm21, %ymm21; + vpxord %ymm22, %ymm22, %ymm22; + vpxord %ymm23, %ymm23, %ymm23; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + vzeroall; + +.align 16 +.Lctr32le_enc_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_ctr32le_enc_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_ctr32le_enc_amd64,.-_gcry_vaes_avx512_ctr32le_enc_amd64) + +/********************************************************************** + OCB-mode encryption/decryption/authentication + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_ocb_aligned_crypt_amd64, at function) +.globl _gcry_vaes_avx512_ocb_aligned_crypt_amd64 +.align 16 +_gcry_vaes_avx512_ocb_aligned_crypt_amd64: + /* input: + * %rdi: round keys + * %esi: nblk + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + * 0(%rsp): offset + * 8(%rsp): checksum + * 16(%rsp): L-array + * 24(%rsp): decrypt/encrypt/auth + */ + CFI_STARTPROC(); + + cmpq $32, %r8; + jb .Locb_skip_avx512; + + spec_stop_avx512; + + pushq %r12; + CFI_PUSH(%r12); + pushq %r13; + CFI_PUSH(%r13); + pushq %r14; + CFI_PUSH(%r14); + pushq %rbx; + CFI_PUSH(%rbx); + +#define OFFSET_PTR_Q 0+5*8(%rsp) +#define CHECKSUM_PTR_Q 8+5*8(%rsp) +#define L_ARRAY_PTR_L 16+5*8(%rsp) +#define OPER_MODE_L 24+5*8(%rsp) + + movq OFFSET_PTR_Q, %r13; /* offset ptr. */ + movq L_ARRAY_PTR_L, %r14; /* L-array ptr. */ + movl OPER_MODE_L, %ebx; /* decrypt/encrypt/auth-mode. */ + movq CHECKSUM_PTR_Q, %r12; /* checksum ptr. */ + + leal (, %r9d, 4), %eax; + vmovdqu (%r13), %xmm15; /* Load offset. */ + vmovdqa (0 * 16)(%rdi), %xmm0; /* first key */ + vpxor (%rdi, %rax, 4), %xmm0, %xmm0; /* first key ^ last key */ + vpxor (0 * 16)(%rdi), %xmm15, %xmm15; /* offset ^ first key */ + vshufi32x4 $0, %zmm0, %zmm0, %zmm30; + vpxord %ymm29, %ymm29, %ymm29; + + vshufi32x4 $0, %zmm15, %zmm15, %zmm15; + + /* Prepare L-array optimization. + * Since nblk is aligned to 16, offsets will have following + * construction: + * - block1 = ntz{0} = offset ^ L[0] + * - block2 = ntz{1} = offset ^ L[0] ^ L[1] + * - block3 = ntz{0} = offset ^ L[1] + * - block4 = ntz{2} = offset ^ L[1] ^ L[2] + * => zmm20 + * + * - block5 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[2] + * - block6 = ntz{1} = offset ^ L[0] ^ L[2] + * - block7 = ntz{0} = offset ^ L[2] + * - block8 = ntz{3} = offset ^ L[2] ^ L[3] + * => zmm21 + * + * - block9 = ntz{0} = offset ^ L[0] ^ L[2] ^ L[3] + * - block10 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[3] + * - block11 = ntz{0} = offset ^ L[1] ^ L[2] ^ L[3] + * - block12 = ntz{2} = offset ^ L[1] ^ L[3] + * => zmm22 + * + * - block13 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[3] + * - block14 = ntz{1} = offset ^ L[0] ^ L[3] + * - block15 = ntz{0} = offset ^ L[3] + * - block16 = ntz{4} = offset ^ L[3] ^ L[4] + * => zmm23 + */ + vmovdqu (0 * 16)(%r14), %xmm0; /* L[0] */ + vmovdqu (1 * 16)(%r14), %xmm1; /* L[1] */ + vmovdqu (2 * 16)(%r14), %xmm2; /* L[2] */ + vmovdqu (3 * 16)(%r14), %xmm3; /* L[3] */ + vmovdqu32 (4 * 16)(%r14), %xmm16; /* L[4] */ + vpxor %xmm0, %xmm1, %xmm4; /* L[0] ^ L[1] */ + vpxor %xmm0, %xmm2, %xmm5; /* L[0] ^ L[2] */ + vpxor %xmm0, %xmm3, %xmm6; /* L[0] ^ L[3] */ + vpxor %xmm1, %xmm2, %xmm7; /* L[1] ^ L[2] */ + vpxor %xmm1, %xmm3, %xmm8; /* L[1] ^ L[3] */ + vpxor %xmm2, %xmm3, %xmm9; /* L[2] ^ L[3] */ + vpxord %xmm16, %xmm3, %xmm17; /* L[3] ^ L[4] */ + vpxor %xmm4, %xmm2, %xmm10; /* L[0] ^ L[1] ^ L[2] */ + vpxor %xmm5, %xmm3, %xmm11; /* L[0] ^ L[2] ^ L[3] */ + vpxor %xmm7, %xmm3, %xmm12; /* L[1] ^ L[2] ^ L[3] */ + vpxor %xmm0, %xmm8, %xmm13; /* L[0] ^ L[1] ^ L[3] */ + vpxor %xmm4, %xmm9, %xmm14; /* L[0] ^ L[1] ^ L[2] ^ L[3] */ + vinserti128 $1, %xmm4, %ymm0, %ymm0; + vinserti128 $1, %xmm7, %ymm1, %ymm1; + vinserti32x8 $1, %ymm1, %zmm0, %zmm20; + vinserti128 $1, %xmm5, %ymm10, %ymm10; + vinserti128 $1, %xmm9, %ymm2, %ymm2; + vinserti32x8 $1, %ymm2, %zmm10, %zmm21; + vinserti128 $1, %xmm14, %ymm11, %ymm11; + vinserti128 $1, %xmm8, %ymm12, %ymm12; + vinserti32x8 $1, %ymm12, %zmm11, %zmm22; + vinserti128 $1, %xmm6, %ymm13, %ymm13; + vinserti32x4 $1, %xmm17, %ymm3, %ymm23; + vinserti32x8 $1, %ymm23, %zmm13, %zmm23; + + /* + * - block17 = ntz{0} = offset ^ L[0] ^ L[3] ^ L[4] + * - block18 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[3] ^ L[4] + * - block19 = ntz{0} = offset ^ L[1] ^ L[3] ^ L[4] + * - block20 = ntz{2} = offset ^ L[1] ^ L[2] ^ L[3] ^ L[4] + * => zmm24 + * + * - block21 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[3] ^ L[4] + * - block22 = ntz{1} = offset ^ L[0] ^ L[2] ^ L[3] ^ L[4] + * - block23 = ntz{0} = offset ^ L[2] ^ L[3] ^ L[4] + * - block24 = ntz{3} = offset ^ L[2] ^ L[4] + * => zmm25 + * + * - block25 = ntz{0} = offset ^ L[0] ^ L[2] ^ L[4] + * - block26 = ntz{1} = offset ^ L[0] ^ L[1] ^ L[2] ^ L[4] + * - block27 = ntz{0} = offset ^ L[1] ^ L[2] ^ L[4] + * - block28 = ntz{2} = offset ^ L[1] ^ L[4] + * => zmm26 + * + * - block29 = ntz{0} = offset ^ L[0] ^ L[1] ^ L[4] + * - block30 = ntz{1} = offset ^ L[0] ^ L[4] + * - block31 = ntz{0} = offset ^ L[4] + * - block32 = 0 (later filled with ntz{x} = offset ^ L[4] ^ L[ntz{x}]) + * => zmm16 + */ + vpxord %xmm16, %xmm0, %xmm0; /* L[0] ^ L[4] */ + vpxord %xmm16, %xmm1, %xmm1; /* L[1] ^ L[4] */ + vpxord %xmm16, %xmm2, %xmm2; /* L[2] ^ L[4] */ + vpxord %xmm16, %xmm4, %xmm4; /* L[0] ^ L[1] ^ L[4] */ + vpxord %xmm16, %xmm5, %xmm5; /* L[0] ^ L[2] ^ L[4] */ + vpxord %xmm16, %xmm6, %xmm6; /* L[0] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm7, %xmm7; /* L[1] ^ L[2] ^ L[4] */ + vpxord %xmm16, %xmm8, %xmm8; /* L[1] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm9, %xmm9; /* L[2] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm10, %xmm10; /* L[0] ^ L[1] ^ L[2] ^ L[4] */ + vpxord %xmm16, %xmm11, %xmm11; /* L[0] ^ L[2] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm12, %xmm12; /* L[1] ^ L[2] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm13, %xmm13; /* L[0] ^ L[1] ^ L[3] ^ L[4] */ + vpxord %xmm16, %xmm14, %xmm14; /* L[0] ^ L[1] ^ L[2] ^ L[3] ^ L[4] */ + vinserti128 $1, %xmm13, %ymm6, %ymm6; + vinserti32x4 $1, %xmm12, %ymm8, %ymm24; + vinserti32x8 $1, %ymm24, %zmm6, %zmm24; + vinserti128 $1, %xmm11, %ymm14, %ymm14; + vinserti32x4 $1, %xmm2, %ymm9, %ymm25; + vinserti32x8 $1, %ymm25, %zmm14, %zmm25; + vinserti128 $1, %xmm10, %ymm5, %ymm5; + vinserti32x4 $1, %xmm1, %ymm7, %ymm26; + vinserti32x8 $1, %ymm26, %zmm5, %zmm26; + vinserti128 $1, %xmm0, %ymm4, %ymm4; + vinserti32x8 $1, %ymm16, %zmm4, %zmm16; + + /* Aligned: Process 32 blocks per loop. */ +.align 16 +.Locb_aligned_blk32: + cmpq $32, %r8; + jb .Locb_aligned_blk16; + + leaq -32(%r8), %r8; + + leal 32(%esi), %esi; + tzcntl %esi, %eax; + shll $4, %eax; + + vpxord %zmm20, %zmm15, %zmm8; + vpxord %zmm21, %zmm15, %zmm9; + vpxord %zmm22, %zmm15, %zmm10; + vpxord %zmm23, %zmm15, %zmm11; + vpxord %zmm24, %zmm15, %zmm12; + vpxord %zmm25, %zmm15, %zmm27; + vpxord %zmm26, %zmm15, %zmm28; + + vmovdqa (4 * 16)(%r14), %xmm14; + vpxor (%r14, %rax), %xmm14, %xmm14; /* L[4] ^ L[ntz{nblk+16}] */ + vinserti32x4 $3, %xmm14, %zmm16, %zmm14; + + vpxord %zmm14, %zmm15, %zmm14; + + cmpl $1, %ebx; + jb .Locb_aligned_blk32_dec; + ja .Locb_aligned_blk32_auth; + vmovdqu32 (0 * 16)(%rcx), %zmm17; + vpxord %zmm17, %zmm8, %zmm0; + vmovdqu32 (4 * 16)(%rcx), %zmm18; + vpxord %zmm18, %zmm9, %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm19; + vpxord %zmm19, %zmm10, %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm31; + vpxord %zmm31, %zmm11, %zmm3; + + vpternlogd $0x96, %zmm17, %zmm18, %zmm19; + + vmovdqu32 (16 * 16)(%rcx), %zmm17; + vpxord %zmm17, %zmm12, %zmm4; + vmovdqu32 (20 * 16)(%rcx), %zmm18; + vpxord %zmm18, %zmm27, %zmm5; + + vpternlogd $0x96, %zmm31, %zmm17, %zmm18; + + vmovdqu32 (24 * 16)(%rcx), %zmm31; + vpxord %zmm31, %zmm28, %zmm6; + vmovdqu32 (28 * 16)(%rcx), %zmm17; + vpxord %zmm17, %zmm14, %zmm7; + leaq (32 * 16)(%rcx), %rcx; + + vpternlogd $0x96, %zmm31, %zmm17, %zmm19; + vpternlogd $0x96, %zmm18, %zmm19, %zmm29; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vpxord %zmm8, %zmm30, %zmm8; + vpxord %zmm9, %zmm30, %zmm9; + vpxord %zmm10, %zmm30, %zmm10; + vpxord %zmm11, %zmm30, %zmm11; + vpxord %zmm12, %zmm30, %zmm12; + vpxord %zmm27, %zmm30, %zmm27; + vpxord %zmm28, %zmm30, %zmm28; + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + vpxord %zmm14, %zmm30, %zmm14; + + /* AES rounds */ + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Locb_aligned_blk32_enc_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Locb_aligned_blk32_enc_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk32_enc_last: + vaesenclast %zmm8, %zmm0, %zmm0; + vaesenclast %zmm9, %zmm1, %zmm1; + vaesenclast %zmm10, %zmm2, %zmm2; + vaesenclast %zmm11, %zmm3, %zmm3; + vaesenclast %zmm12, %zmm4, %zmm4; + vaesenclast %zmm27, %zmm5, %zmm5; + vaesenclast %zmm28, %zmm6, %zmm6; + vaesenclast %zmm14, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + jmp .Locb_aligned_blk32; + + .align 16 + .Locb_aligned_blk32_auth: + vpxord (0 * 16)(%rcx), %zmm8, %zmm0; + vpxord (4 * 16)(%rcx), %zmm9, %zmm1; + vpxord (8 * 16)(%rcx), %zmm10, %zmm2; + vpxord (12 * 16)(%rcx), %zmm11, %zmm3; + vpxord (16 * 16)(%rcx), %zmm12, %zmm4; + vpxord (20 * 16)(%rcx), %zmm27, %zmm5; + vpxord (24 * 16)(%rcx), %zmm28, %zmm6; + vpxord (28 * 16)(%rcx), %zmm14, %zmm7; + leaq (32 * 16)(%rcx), %rcx; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + + /* AES rounds */ + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + cmpl $12, %r9d; + jb .Locb_aligned_blk32_auth_last; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + jz .Locb_aligned_blk32_auth_last; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (14 * 16)(%rdi), %zmm13; + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk32_auth_last: + vaesenclast %zmm13, %zmm0, %zmm0; + vaesenclast %zmm13, %zmm1, %zmm1; + vaesenclast %zmm13, %zmm2, %zmm2; + vaesenclast %zmm13, %zmm3, %zmm3; + vaesenclast %zmm13, %zmm4, %zmm4; + vaesenclast %zmm13, %zmm5, %zmm5; + vaesenclast %zmm13, %zmm6, %zmm6; + vaesenclast %zmm13, %zmm7, %zmm7; + + vpternlogd $0x96, %zmm0, %zmm1, %zmm2; + vpternlogd $0x96, %zmm3, %zmm4, %zmm5; + vpternlogd $0x96, %zmm6, %zmm7, %zmm29; + vpternlogd $0x96, %zmm2, %zmm5, %zmm29; + + jmp .Locb_aligned_blk32; + + .align 16 + .Locb_aligned_blk32_dec: + vpxord (0 * 16)(%rcx), %zmm8, %zmm0; + vpxord (4 * 16)(%rcx), %zmm9, %zmm1; + vpxord (8 * 16)(%rcx), %zmm10, %zmm2; + vpxord (12 * 16)(%rcx), %zmm11, %zmm3; + vpxord (16 * 16)(%rcx), %zmm12, %zmm4; + vpxord (20 * 16)(%rcx), %zmm27, %zmm5; + vpxord (24 * 16)(%rcx), %zmm28, %zmm6; + vpxord (28 * 16)(%rcx), %zmm14, %zmm7; + leaq (32 * 16)(%rcx), %rcx; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vpxord %zmm8, %zmm30, %zmm8; + vpxord %zmm9, %zmm30, %zmm9; + vpxord %zmm10, %zmm30, %zmm10; + vpxord %zmm11, %zmm30, %zmm11; + vpxord %zmm12, %zmm30, %zmm12; + vpxord %zmm27, %zmm30, %zmm27; + vpxord %zmm28, %zmm30, %zmm28; + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + vpxord %zmm14, %zmm30, %zmm14; + + /* AES rounds */ + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Locb_aligned_blk32_dec_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Locb_aligned_blk32_dec_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESDEC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk32_dec_last: + vaesdeclast %zmm8, %zmm0, %zmm0; + vaesdeclast %zmm9, %zmm1, %zmm1; + vaesdeclast %zmm10, %zmm2, %zmm2; + vaesdeclast %zmm11, %zmm3, %zmm3; + vaesdeclast %zmm12, %zmm4, %zmm4; + vaesdeclast %zmm27, %zmm5, %zmm5; + vaesdeclast %zmm28, %zmm6, %zmm6; + vaesdeclast %zmm14, %zmm7, %zmm7; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + vpternlogd $0x96, %zmm0, %zmm1, %zmm2; + vpternlogd $0x96, %zmm3, %zmm4, %zmm5; + vpternlogd $0x96, %zmm6, %zmm7, %zmm29; + vpternlogd $0x96, %zmm2, %zmm5, %zmm29; + + jmp .Locb_aligned_blk32; + + /* Aligned: Process trailing 16 blocks. */ +.align 16 +.Locb_aligned_blk16: + cmpq $16, %r8; + jb .Locb_aligned_done; + + leaq -16(%r8), %r8; + + leal 16(%esi), %esi; + + vpxord %zmm20, %zmm15, %zmm8; + vpxord %zmm21, %zmm15, %zmm9; + vpxord %zmm22, %zmm15, %zmm10; + vpxord %zmm23, %zmm15, %zmm14; + + cmpl $1, %ebx; + jb .Locb_aligned_blk16_dec; + ja .Locb_aligned_blk16_auth; + vmovdqu32 (0 * 16)(%rcx), %zmm17; + vpxord %zmm17, %zmm8, %zmm0; + vmovdqu32 (4 * 16)(%rcx), %zmm18; + vpxord %zmm18, %zmm9, %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm19; + vpxord %zmm19, %zmm10, %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm31; + vpxord %zmm31, %zmm14, %zmm3; + leaq (16 * 16)(%rcx), %rcx; + + vpternlogd $0x96, %zmm17, %zmm18, %zmm19; + vpternlogd $0x96, %zmm31, %zmm19, %zmm29; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vpxord %zmm8, %zmm30, %zmm8; + vpxord %zmm9, %zmm30, %zmm9; + vpxord %zmm10, %zmm30, %zmm10; + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + vpxord %zmm14, %zmm30, %zmm14; + + /* AES rounds */ + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Locb_aligned_blk16_enc_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Locb_aligned_blk16_enc_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESENC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk16_enc_last: + vaesenclast %zmm8, %zmm0, %zmm0; + vaesenclast %zmm9, %zmm1, %zmm1; + vaesenclast %zmm10, %zmm2, %zmm2; + vaesenclast %zmm14, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + jmp .Locb_aligned_done; + + .align 16 + .Locb_aligned_blk16_auth: + vpxord (0 * 16)(%rcx), %zmm8, %zmm0; + vpxord (4 * 16)(%rcx), %zmm9, %zmm1; + vpxord (8 * 16)(%rcx), %zmm10, %zmm2; + vpxord (12 * 16)(%rcx), %zmm14, %zmm3; + leaq (16 * 16)(%rcx), %rcx; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + + /* AES rounds */ + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + cmpl $12, %r9d; + jb .Locb_aligned_blk16_auth_last; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + jz .Locb_aligned_blk16_auth_last; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESENC8(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (14 * 16)(%rdi), %zmm13; + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk16_auth_last: + vaesenclast %zmm13, %zmm0, %zmm0; + vaesenclast %zmm13, %zmm1, %zmm1; + vaesenclast %zmm13, %zmm2, %zmm2; + vaesenclast %zmm13, %zmm3, %zmm3; + + vpternlogd $0x96, %zmm0, %zmm1, %zmm2; + vpternlogd $0x96, %zmm3, %zmm2, %zmm29; + + jmp .Locb_aligned_done; + + .align 16 + .Locb_aligned_blk16_dec: + vpxord (0 * 16)(%rcx), %zmm8, %zmm0; + vpxord (4 * 16)(%rcx), %zmm9, %zmm1; + vpxord (8 * 16)(%rcx), %zmm10, %zmm2; + vpxord (12 * 16)(%rcx), %zmm14, %zmm3; + leaq (16 * 16)(%rcx), %rcx; + + vbroadcasti32x4 (1 * 16)(%rdi), %zmm13; + + vpxord %zmm8, %zmm30, %zmm8; + vpxord %zmm9, %zmm30, %zmm9; + vpxord %zmm10, %zmm30, %zmm10; + vshufi32x4 $0b11111111, %zmm14, %zmm14, %zmm15; + vpxord %zmm14, %zmm30, %zmm14; + + /* AES rounds */ + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Locb_aligned_blk16_dec_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Locb_aligned_blk16_dec_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm13; + VAESDEC4(%zmm13, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Locb_aligned_blk16_dec_last: + vaesdeclast %zmm8, %zmm0, %zmm0; + vaesdeclast %zmm9, %zmm1, %zmm1; + vaesdeclast %zmm10, %zmm2, %zmm2; + vaesdeclast %zmm14, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + vpternlogd $0x96, %zmm0, %zmm1, %zmm2; + vpternlogd $0x96, %zmm3, %zmm2, %zmm29; + +.align 16 +.Locb_aligned_done: + vpxor (0 * 16)(%rdi), %xmm15, %xmm15; /* offset ^ first key ^ first key */ + + vextracti32x8 $1, %zmm29, %ymm0; + vpxord %ymm29, %ymm0, %ymm0; + vextracti128 $1, %ymm0, %xmm1; + vpternlogd $0x96, (%r12), %xmm1, %xmm0; + vmovdqu %xmm0, (%r12); + + vmovdqu %xmm15, (%r13); /* Store offset. */ + + popq %rbx; + CFI_POP(%rbx); + popq %r14; + CFI_POP(%r14); + popq %r13; + CFI_POP(%r13); + popq %r12; + CFI_POP(%r12); + + /* Clear used AVX512 registers. */ + vpxord %ymm16, %ymm16, %ymm16; + vpxord %ymm17, %ymm17, %ymm17; + vpxord %ymm18, %ymm18, %ymm18; + vpxord %ymm19, %ymm19, %ymm19; + vpxord %ymm20, %ymm20, %ymm20; + vpxord %ymm21, %ymm21, %ymm21; + vpxord %ymm22, %ymm22, %ymm22; + vpxord %ymm23, %ymm23, %ymm23; + vzeroall; + vpxord %ymm24, %ymm24, %ymm24; + vpxord %ymm25, %ymm25, %ymm25; + vpxord %ymm26, %ymm26, %ymm26; + vpxord %ymm27, %ymm27, %ymm27; + vpxord %ymm28, %ymm28, %ymm28; + vpxord %ymm29, %ymm29, %ymm29; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + +.align 16 +.Locb_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_ocb_crypt_amd64; + + xorl %eax, %eax; + ret_spec_stop + +#undef STACK_REGS_POS +#undef STACK_ALLOC + + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_ocb_aligned_crypt_amd64, + .-_gcry_vaes_avx512_ocb_aligned_crypt_amd64) + +/********************************************************************** + XTS-mode encryption + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_xts_crypt_amd64, at function) +.globl _gcry_vaes_avx512_xts_crypt_amd64 +.align 16 +_gcry_vaes_avx512_xts_crypt_amd64: + /* input: + * %rdi: round keys + * %rsi: tweak + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + * 8(%rsp): encrypt + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lxts_crypt_skip_avx512; + + spec_stop_avx512; + + /* Load first and last key. */ + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm30; + vbroadcasti32x4 (%rdi, %rax, 4), %zmm31; + + movl 8(%rsp), %eax; + + vpmovzxbd .Lxts_gfmul_clmul_bd rRIP, %zmm20; + vbroadcasti32x4 .Lxts_high_bit_shuf rRIP, %zmm21; + +#define tweak_clmul(shift, out, tweak, hi_tweak, gfmul_clmul, tmp1, tmp2) \ + vpsrld $(32-(shift)), hi_tweak, tmp2; \ + vpsllq $(shift), tweak, out; \ + vpclmulqdq $0, gfmul_clmul, tmp2, tmp1; \ + vpunpckhqdq tmp2, tmp1, tmp1; \ + vpxord tmp1, out, out; + + /* Prepare tweak. */ + vmovdqu (%rsi), %xmm15; + vpshufb %xmm21, %xmm15, %xmm13; + tweak_clmul(1, %xmm11, %xmm15, %xmm13, %xmm20, %xmm0, %xmm1); + vinserti128 $1, %xmm11, %ymm15, %ymm15; /* tweak:tweak1 */ + vpshufb %ymm21, %ymm15, %ymm13; + tweak_clmul(2, %ymm11, %ymm15, %ymm13, %ymm20, %ymm0, %ymm1); + vinserti32x8 $1, %ymm11, %zmm15, %zmm15; /* tweak:tweak1:tweak2:tweak3 */ + vpshufb %zmm21, %zmm15, %zmm13; + + cmpq $16, %r8; + jb .Lxts_crypt_done; + + /* Process 16 blocks per loop. */ + leaq -16(%r8), %r8; + + vmovdqa32 %zmm15, %zmm5; + tweak_clmul(4, %zmm6, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1); + tweak_clmul(8, %zmm7, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1); + tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1); + tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm0, %zmm1); + vpshufb %zmm21, %zmm15, %zmm13; + + vmovdqu32 (0 * 16)(%rcx), %zmm0; + vmovdqu32 (4 * 16)(%rcx), %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm3; + leaq (16 * 16)(%rcx), %rcx; + vpternlogd $0x96, %zmm30, %zmm5, %zmm0; + vpternlogd $0x96, %zmm30, %zmm6, %zmm1; + vpternlogd $0x96, %zmm30, %zmm7, %zmm2; + vpternlogd $0x96, %zmm30, %zmm8, %zmm3; + +.align 16 +.Lxts_crypt_blk16_loop: + cmpq $16, %r8; + jb .Lxts_crypt_blk16_tail; + leaq -16(%r8), %r8; + + testl %eax, %eax; + jz .Lxts_dec_blk16; + /* AES rounds */ + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vmovdqa32 %zmm15, %zmm9; + tweak_clmul(4, %zmm10, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + tweak_clmul(8, %zmm11, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lxts_enc_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lxts_enc_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lxts_enc_blk16_last: + vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */ + vpxord %zmm31, %zmm6, %zmm6; + vpxord %zmm31, %zmm7, %zmm7; + vpxord %zmm31, %zmm8, %zmm4; + tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vaesenclast %zmm5, %zmm0, %zmm16; + vaesenclast %zmm6, %zmm1, %zmm17; + vaesenclast %zmm7, %zmm2, %zmm18; + vaesenclast %zmm4, %zmm3, %zmm19; + tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vpshufb %zmm21, %zmm15, %zmm13; + + vmovdqu32 (0 * 16)(%rcx), %zmm0; + vmovdqu32 (4 * 16)(%rcx), %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm3; + leaq (16 * 16)(%rcx), %rcx; + + vmovdqu32 %zmm16, (0 * 16)(%rdx); + vmovdqu32 %zmm17, (4 * 16)(%rdx); + vmovdqu32 %zmm18, (8 * 16)(%rdx); + vmovdqu32 %zmm19, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + vpternlogd $0x96, %zmm30, %zmm9, %zmm0; + vpternlogd $0x96, %zmm30, %zmm10, %zmm1; + vpternlogd $0x96, %zmm30, %zmm11, %zmm2; + vpternlogd $0x96, %zmm30, %zmm8, %zmm3; + + vmovdqa32 %zmm9, %zmm5; + vmovdqa32 %zmm10, %zmm6; + vmovdqa32 %zmm11, %zmm7; + + jmp .Lxts_crypt_blk16_loop; + + .align 16 + .Lxts_dec_blk16: + /* AES rounds */ + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vmovdqa32 %zmm15, %zmm9; + tweak_clmul(4, %zmm10, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + tweak_clmul(8, %zmm11, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lxts_dec_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lxts_dec_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lxts_dec_blk16_last: + vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */ + vpxord %zmm31, %zmm6, %zmm6; + vpxord %zmm31, %zmm7, %zmm7; + vpxord %zmm31, %zmm8, %zmm4; + tweak_clmul(12, %zmm8, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vaesdeclast %zmm5, %zmm0, %zmm16; + vaesdeclast %zmm6, %zmm1, %zmm17; + vaesdeclast %zmm7, %zmm2, %zmm18; + vaesdeclast %zmm4, %zmm3, %zmm19; + tweak_clmul(16, %zmm15, %zmm15, %zmm13, %zmm20, %zmm12, %zmm14); + vpshufb %zmm21, %zmm15, %zmm13; + + vmovdqu32 (0 * 16)(%rcx), %zmm0; + vmovdqu32 (4 * 16)(%rcx), %zmm1; + vmovdqu32 (8 * 16)(%rcx), %zmm2; + vmovdqu32 (12 * 16)(%rcx), %zmm3; + leaq (16 * 16)(%rcx), %rcx; + + vmovdqu32 %zmm16, (0 * 16)(%rdx); + vmovdqu32 %zmm17, (4 * 16)(%rdx); + vmovdqu32 %zmm18, (8 * 16)(%rdx); + vmovdqu32 %zmm19, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + vpternlogd $0x96, %zmm30, %zmm9, %zmm0; + vpternlogd $0x96, %zmm30, %zmm10, %zmm1; + vpternlogd $0x96, %zmm30, %zmm11, %zmm2; + vpternlogd $0x96, %zmm30, %zmm8, %zmm3; + + vmovdqa32 %zmm9, %zmm5; + vmovdqa32 %zmm10, %zmm6; + vmovdqa32 %zmm11, %zmm7; + + jmp .Lxts_crypt_blk16_loop; + + .align 16 + .Lxts_crypt_blk16_tail: + testl %eax, %eax; + jz .Lxts_dec_tail_blk16; + /* AES rounds */ + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lxts_enc_blk16_tail_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lxts_enc_blk16_tail_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lxts_enc_blk16_tail_last: + vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */ + vpxord %zmm31, %zmm6, %zmm6; + vpxord %zmm31, %zmm7, %zmm7; + vpxord %zmm31, %zmm8, %zmm4; + vaesenclast %zmm5, %zmm0, %zmm0; + vaesenclast %zmm6, %zmm1, %zmm1; + vaesenclast %zmm7, %zmm2, %zmm2; + vaesenclast %zmm4, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + + jmp .Lxts_crypt_done; + + .align 16 + .Lxts_dec_tail_blk16: + /* AES rounds */ + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lxts_dec_blk16_tail_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lxts_dec_blk16_tail_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + + /* Last round and output handling. */ + .align 16 + .Lxts_dec_blk16_tail_last: + vpxord %zmm31, %zmm5, %zmm5; /* Xor tweak to last round key. */ + vpxord %zmm31, %zmm6, %zmm6; + vpxord %zmm31, %zmm7, %zmm7; + vpxord %zmm31, %zmm8, %zmm4; + vaesdeclast %zmm5, %zmm0, %zmm0; + vaesdeclast %zmm6, %zmm1, %zmm1; + vaesdeclast %zmm7, %zmm2, %zmm2; + vaesdeclast %zmm4, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + +.align 16 +.Lxts_crypt_done: + /* Store IV. */ + vmovdqu %xmm15, (%rsi); + + /* Clear used AVX512 registers. */ + vpxord %ymm16, %ymm16, %ymm16; + vpxord %ymm17, %ymm17, %ymm17; + vpxord %ymm18, %ymm18, %ymm18; + vpxord %ymm19, %ymm19, %ymm19; + vpxord %ymm20, %ymm20, %ymm20; + vpxord %ymm21, %ymm21, %ymm21; + vpxord %ymm30, %ymm30, %ymm30; + vpxord %ymm31, %ymm31, %ymm31; + vzeroall; + +.align 16 +.Lxts_crypt_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_xts_crypt_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_xts_crypt_amd64,.-_gcry_vaes_avx512_xts_crypt_amd64) + +/********************************************************************** + ECB-mode encryption + **********************************************************************/ +ELF(.type _gcry_vaes_avx512_ecb_crypt_amd64, at function) +.globl _gcry_vaes_avx512_ecb_crypt_amd64 +.align 16 +_gcry_vaes_avx512_ecb_crypt_amd64: + /* input: + * %rdi: round keys + * %esi: encrypt + * %rdx: dst + * %rcx: src + * %r8: nblocks + * %r9: nrounds + */ + CFI_STARTPROC(); + + cmpq $16, %r8; + jb .Lecb_crypt_skip_avx512; + + spec_stop_avx512; + + leal (, %r9d, 4), %eax; + vbroadcasti32x4 (%rdi), %zmm14; /* first key */ + vbroadcasti32x4 (%rdi, %rax, 4), %zmm15; /* last key */ + + /* Process 32 blocks per loop. */ +.align 16 +.Lecb_blk32: + cmpq $32, %r8; + jb .Lecb_blk16; + + leaq -32(%r8), %r8; + + /* Load input and xor first key. */ + vpxord (0 * 16)(%rcx), %zmm14, %zmm0; + vpxord (4 * 16)(%rcx), %zmm14, %zmm1; + vpxord (8 * 16)(%rcx), %zmm14, %zmm2; + vpxord (12 * 16)(%rcx), %zmm14, %zmm3; + vpxord (16 * 16)(%rcx), %zmm14, %zmm4; + vpxord (20 * 16)(%rcx), %zmm14, %zmm5; + vpxord (24 * 16)(%rcx), %zmm14, %zmm6; + vpxord (28 * 16)(%rcx), %zmm14, %zmm7; + leaq (32 * 16)(%rcx), %rcx; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm8; + + testl %esi, %esi; + jz .Lecb_dec_blk32; + /* AES rounds */ + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lecb_enc_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lecb_enc_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESENC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + .align 16 + .Lecb_enc_blk32_last: + vaesenclast %zmm15, %zmm0, %zmm0; + vaesenclast %zmm15, %zmm1, %zmm1; + vaesenclast %zmm15, %zmm2, %zmm2; + vaesenclast %zmm15, %zmm3, %zmm3; + vaesenclast %zmm15, %zmm4, %zmm4; + vaesenclast %zmm15, %zmm5, %zmm5; + vaesenclast %zmm15, %zmm6, %zmm6; + vaesenclast %zmm15, %zmm7, %zmm7; + jmp .Lecb_blk32_end; + + .align 16 + .Lecb_dec_blk32: + /* AES rounds */ + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + cmpl $12, %r9d; + jb .Lecb_dec_blk32_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + jz .Lecb_dec_blk32_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm8; + VAESDEC8(%zmm8, %zmm0, %zmm1, %zmm2, %zmm3, %zmm4, %zmm5, %zmm6, %zmm7); + .align 16 + .Lecb_dec_blk32_last: + vaesdeclast %zmm15, %zmm0, %zmm0; + vaesdeclast %zmm15, %zmm1, %zmm1; + vaesdeclast %zmm15, %zmm2, %zmm2; + vaesdeclast %zmm15, %zmm3, %zmm3; + vaesdeclast %zmm15, %zmm4, %zmm4; + vaesdeclast %zmm15, %zmm5, %zmm5; + vaesdeclast %zmm15, %zmm6, %zmm6; + vaesdeclast %zmm15, %zmm7, %zmm7; + + .align 16 + .Lecb_blk32_end: + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + vmovdqu32 %zmm4, (16 * 16)(%rdx); + vmovdqu32 %zmm5, (20 * 16)(%rdx); + vmovdqu32 %zmm6, (24 * 16)(%rdx); + vmovdqu32 %zmm7, (28 * 16)(%rdx); + leaq (32 * 16)(%rdx), %rdx; + + jmp .Lecb_blk32; + + /* Handle trailing 16 blocks. */ +.align 16 +.Lecb_blk16: + cmpq $16, %r8; + jmp .Lecb_crypt_tail; + + leaq -16(%r8), %r8; + + /* Load input and xor first key. */ + vpxord (0 * 16)(%rcx), %zmm14, %zmm0; + vpxord (4 * 16)(%rcx), %zmm14, %zmm1; + vpxord (8 * 16)(%rcx), %zmm14, %zmm2; + vpxord (12 * 16)(%rcx), %zmm14, %zmm3; + leaq (16 * 16)(%rcx), %rcx; + vbroadcasti32x4 (1 * 16)(%rdi), %zmm4; + + testl %esi, %esi; + jz .Lecb_dec_blk16; + /* AES rounds */ + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lecb_enc_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lecb_enc_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESENC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + .align 16 + .Lecb_enc_blk16_last: + vaesenclast %zmm15, %zmm0, %zmm0; + vaesenclast %zmm15, %zmm1, %zmm1; + vaesenclast %zmm15, %zmm2, %zmm2; + vaesenclast %zmm15, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + jmp .Lecb_crypt_tail; + + .align 16 + .Lecb_dec_blk16: + /* AES rounds */ + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (2 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (3 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (4 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (5 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (6 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (7 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (8 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (9 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + cmpl $12, %r9d; + jb .Lecb_dec_blk16_last; + vbroadcasti32x4 (10 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (11 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + jz .Lecb_dec_blk16_last; + vbroadcasti32x4 (12 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + vbroadcasti32x4 (13 * 16)(%rdi), %zmm4; + VAESDEC4(%zmm4, %zmm0, %zmm1, %zmm2, %zmm3); + .align 16 + .Lecb_dec_blk16_last: + vaesdeclast %zmm15, %zmm0, %zmm0; + vaesdeclast %zmm15, %zmm1, %zmm1; + vaesdeclast %zmm15, %zmm2, %zmm2; + vaesdeclast %zmm15, %zmm3, %zmm3; + vmovdqu32 %zmm0, (0 * 16)(%rdx); + vmovdqu32 %zmm1, (4 * 16)(%rdx); + vmovdqu32 %zmm2, (8 * 16)(%rdx); + vmovdqu32 %zmm3, (12 * 16)(%rdx); + leaq (16 * 16)(%rdx), %rdx; + +.align 16 +.Lecb_crypt_tail: + /* Clear used AVX512 registers. */ + vzeroall; + +.align 16 +.Lecb_crypt_skip_avx512: + /* Handle trailing blocks with AVX2 implementation. */ + cmpq $0, %r8; + ja _gcry_vaes_avx2_ecb_crypt_amd64; + + ret_spec_stop + CFI_ENDPROC(); +ELF(.size _gcry_vaes_avx512_ecb_crypt_amd64,.-_gcry_vaes_avx512_ecb_crypt_amd64) + +/********************************************************************** + constants + **********************************************************************/ +SECTION_RODATA + +.align 64 +ELF(.type _gcry_vaes_avx512_consts, at object) +_gcry_vaes_avx512_consts: + +.Lbige_addb_0: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 +.Lbige_addb_1: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 +.Lbige_addb_2: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2 +.Lbige_addb_3: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3 +.Lbige_addb_4: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4 +.Lbige_addb_5: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 +.Lbige_addb_6: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6 +.Lbige_addb_7: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7 +.Lbige_addb_8: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8 +.Lbige_addb_9: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9 +.Lbige_addb_10: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10 +.Lbige_addb_11: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11 +.Lbige_addb_12: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12 +.Lbige_addb_13: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13 +.Lbige_addb_14: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14 +.Lbige_addb_15: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15 +.Lbige_addb_16: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16 +.Lbige_addb_17: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17 +.Lbige_addb_18: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18 +.Lbige_addb_19: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 19 +.Lbige_addb_20: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20 +.Lbige_addb_21: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21 +.Lbige_addb_22: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22 +.Lbige_addb_23: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23 +.Lbige_addb_24: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24 +.Lbige_addb_25: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25 +.Lbige_addb_26: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 26 +.Lbige_addb_27: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 27 +.Lbige_addb_28: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 28 +.Lbige_addb_29: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29 +.Lbige_addb_30: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30 +.Lbige_addb_31: + .byte 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 31 + +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 + +.Lxts_high_bit_shuf: + .byte -1, -1, -1, -1, 12, 13, 14, 15, 4, 5, 6, 7, -1, -1, -1, -1 +.Lxts_gfmul_clmul_bd: + .byte 0x00, 0x87, 0x00, 0x00 + .byte 0x00, 0x87, 0x00, 0x00 + .byte 0x00, 0x87, 0x00, 0x00 + .byte 0x00, 0x87, 0x00, 0x00 + +.Lcounter0_1_2_3_lo_bq: + .byte 0, 0, 1, 0, 2, 0, 3, 0 +.Lcounter4_5_6_7_lo_bq: + .byte 4, 0, 5, 0, 6, 0, 7, 0 +.Lcounter8_9_10_11_lo_bq: + .byte 8, 0, 9, 0, 10, 0, 11, 0 +.Lcounter12_13_14_15_lo_bq: + .byte 12, 0, 13, 0, 14, 0, 15, 0 +.Lcounter16_17_18_19_lo_bq: + .byte 16, 0, 17, 0, 18, 0, 19, 0 +.Lcounter20_21_22_23_lo_bq: + .byte 20, 0, 21, 0, 22, 0, 23, 0 +.Lcounter24_25_26_27_lo_bq: + .byte 24, 0, 25, 0, 26, 0, 27, 0 +.Lcounter28_29_30_31_lo_bq: + .byte 28, 0, 29, 0, 30, 0, 31, 0 +.Lcounter4_4_4_4_lo_bq: + .byte 4, 0, 4, 0, 4, 0, 4, 0 +.Lcounter8_8_8_8_lo_bq: + .byte 8, 0, 8, 0, 8, 0, 8, 0 +.Lcounter16_16_16_16_lo_bq: + .byte 16, 0, 16, 0, 16, 0, 16, 0 +.Lcounter32_32_32_32_lo_bq: + .byte 32, 0, 32, 0, 32, 0, 32, 0 +.Lcounter1_1_1_1_hi_bq: + .byte 0, 1, 0, 1, 0, 1, 0, 1 + +ELF(.size _gcry_vaes_avx512_consts,.-_gcry_vaes_avx512_consts) + +#endif /* HAVE_GCC_INLINE_ASM_VAES */ +#endif /* __x86_64__ */ diff --git a/cipher/rijndael-vaes.c b/cipher/rijndael-vaes.c index 478904d0..81650e77 100644 --- a/cipher/rijndael-vaes.c +++ b/cipher/rijndael-vaes.c @@ -1,5 +1,5 @@ /* VAES/AVX2 AMD64 accelerated AES for Libgcrypt - * Copyright (C) 2021 Jussi Kivilinna + * Copyright (C) 2021,2026 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -99,6 +99,66 @@ extern void _gcry_vaes_avx2_ecb_crypt_amd64 (const void *keysched, unsigned int nrounds) ASM_FUNC_ABI; +#ifdef USE_VAES_AVX512 +extern void _gcry_vaes_avx512_cbc_dec_amd64 (const void *keysched, + unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds) ASM_FUNC_ABI; + +extern void _gcry_vaes_avx512_cfb_dec_amd64 (const void *keysched, + unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds) ASM_FUNC_ABI; + +extern void _gcry_vaes_avx512_ctr_enc_amd64 (const void *keysched, + unsigned char *ctr, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds) ASM_FUNC_ABI; + +extern void _gcry_vaes_avx512_ctr32le_enc_amd64 (const void *keysched, + unsigned char *ctr, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds) + ASM_FUNC_ABI; + +extern size_t +_gcry_vaes_avx512_ocb_aligned_crypt_amd64 (const void *keysched, + unsigned int blkn, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds, + unsigned char *offset, + unsigned char *checksum, + unsigned char *L_table, + int enc_dec_auth) ASM_FUNC_ABI; + +extern void _gcry_vaes_avx512_xts_crypt_amd64 (const void *keysched, + unsigned char *tweak, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds, + int encrypt) ASM_FUNC_ABI; + +extern void _gcry_vaes_avx512_ecb_crypt_amd64 (const void *keysched, + int encrypt, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + unsigned int nrounds) + ASM_FUNC_ABI; +#endif + + void _gcry_aes_vaes_ecb_crypt (void *context, void *outbuf, const void *inbuf, size_t nblocks, @@ -114,6 +174,15 @@ _gcry_aes_vaes_ecb_crypt (void *context, void *outbuf, ctx->decryption_prepared = 1; } +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_ecb_crypt_amd64 (keysched, encrypt, outbuf, inbuf, + nblocks, nrounds); + return; + } +#endif + _gcry_vaes_avx2_ecb_crypt_amd64 (keysched, encrypt, outbuf, inbuf, nblocks, nrounds); } @@ -133,6 +202,15 @@ _gcry_aes_vaes_cbc_dec (void *context, unsigned char *iv, ctx->decryption_prepared = 1; } +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_cbc_dec_amd64 (keysched, iv, outbuf, inbuf, + nblocks, nrounds); + return; + } +#endif + _gcry_vaes_avx2_cbc_dec_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds); } @@ -145,6 +223,15 @@ _gcry_aes_vaes_cfb_dec (void *context, unsigned char *iv, const void *keysched = ctx->keyschenc32; unsigned int nrounds = ctx->rounds; +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_cfb_dec_amd64 (keysched, iv, outbuf, inbuf, + nblocks, nrounds); + return; + } +#endif + _gcry_vaes_avx2_cfb_dec_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds); } @@ -157,6 +244,15 @@ _gcry_aes_vaes_ctr_enc (void *context, unsigned char *iv, const void *keysched = ctx->keyschenc32; unsigned int nrounds = ctx->rounds; +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_ctr_enc_amd64 (keysched, iv, outbuf, inbuf, + nblocks, nrounds); + return; + } +#endif + _gcry_vaes_avx2_ctr_enc_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds); } @@ -169,6 +265,15 @@ _gcry_aes_vaes_ctr32le_enc (void *context, unsigned char *iv, const void *keysched = ctx->keyschenc32; unsigned int nrounds = ctx->rounds; +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_ctr32le_enc_amd64 (keysched, iv, outbuf, inbuf, + nblocks, nrounds); + return; + } +#endif + _gcry_vaes_avx2_ctr32le_enc_amd64 (keysched, iv, outbuf, inbuf, nblocks, nrounds); } @@ -191,6 +296,40 @@ _gcry_aes_vaes_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, ctx->decryption_prepared = 1; } +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512 && nblocks >= 32) + { + /* Get number of blocks to align nblk to 32 for L-array optimization. */ + unsigned int num_to_align = (-blkn) & 31; + if (nblocks - num_to_align >= 32) + { + if (num_to_align) + { + _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, + outbuf, inbuf, num_to_align, + nrounds, c->u_iv.iv, + c->u_ctr.ctr, c->u_mode.ocb.L[0], + encrypt); + blkn += num_to_align; + outbuf += num_to_align * BLOCKSIZE; + inbuf += num_to_align * BLOCKSIZE; + nblocks -= num_to_align; + } + + c->u_mode.ocb.data_nblocks = blkn + nblocks; + + return _gcry_vaes_avx512_ocb_aligned_crypt_amd64 (keysched, + (unsigned int)blkn, + outbuf, inbuf, + nblocks, + nrounds, c->u_iv.iv, + c->u_ctr.ctr, + c->u_mode.ocb.L[0], + encrypt); + } + } +#endif + c->u_mode.ocb.data_nblocks = blkn + nblocks; return _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, outbuf, @@ -209,6 +348,36 @@ _gcry_aes_vaes_ocb_auth (gcry_cipher_hd_t c, const void *inbuf_arg, unsigned int nrounds = ctx->rounds; u64 blkn = c->u_mode.ocb.aad_nblocks; +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512 && nblocks >= 32) + { + /* Get number of blocks to align nblk to 32 for L-array optimization. */ + unsigned int num_to_align = (-blkn) & 31; + if (nblocks - num_to_align >= 32) + { + if (num_to_align) + { + _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, + NULL, inbuf, num_to_align, + nrounds, + c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, + c->u_mode.ocb.L[0], 2); + blkn += num_to_align; + inbuf += num_to_align * BLOCKSIZE; + nblocks -= num_to_align; + } + + c->u_mode.ocb.aad_nblocks = blkn + nblocks; + + return _gcry_vaes_avx512_ocb_aligned_crypt_amd64 ( + keysched, (unsigned int)blkn, NULL, inbuf, + nblocks, nrounds, c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, c->u_mode.ocb.L[0], 2); + } + } +#endif + c->u_mode.ocb.aad_nblocks = blkn + nblocks; return _gcry_vaes_avx2_ocb_crypt_amd64 (keysched, (unsigned int)blkn, NULL, @@ -233,6 +402,15 @@ _gcry_aes_vaes_xts_crypt (void *context, unsigned char *tweak, ctx->decryption_prepared = 1; } +#ifdef USE_VAES_AVX512 + if (ctx->use_vaes_avx512) + { + _gcry_vaes_avx512_xts_crypt_amd64 (keysched, tweak, outbuf, inbuf, + nblocks, nrounds, encrypt); + return; + } +#endif + _gcry_vaes_avx2_xts_crypt_amd64 (keysched, tweak, outbuf, inbuf, nblocks, nrounds, encrypt); } diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 910073d2..f3daf35a 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -46,6 +46,7 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" +#include "hwf-common.h" #include "rijndael-internal.h" #include "./cipher-internal.h" @@ -726,6 +727,10 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, if ((hwfeatures & HWF_INTEL_VAES_VPCLMUL) && (hwfeatures & HWF_INTEL_AVX2)) { +#ifdef USE_VAES_AVX512 + ctx->use_vaes_avx512 = !!(hwfeatures & HWF_INTEL_AVX512); +#endif + /* Setup VAES bulk encryption routines. */ bulk_ops->cfb_dec = _gcry_aes_vaes_cfb_dec; bulk_ops->cbc_dec = _gcry_aes_vaes_cbc_dec; diff --git a/configure.ac b/configure.ac index da6f1970..319ff539 100644 --- a/configure.ac +++ b/configure.ac @@ -3464,6 +3464,7 @@ if test "$found" = "1" ; then # Build with the VAES/AVX2 implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-avx2-amd64.lo" + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vaes-avx512-amd64.lo" ;; arm*-*-*) # Build with the assembly implementation -- 2.51.0 From sam at gentoo.org Wed Jan 14 23:42:19 2026 From: sam at gentoo.org (Sam James) Date: Wed, 14 Jan 2026 22:42:19 +0000 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: References: Message-ID: <87zf6fakac.fsf@gentoo.org> Zachary Fogg via Gcrypt-devel writes: > **In-Reply-To:** > > Hi NIIBE Yutaka, > > Thank you for your response on October 22! I apologize for the delay - I am new to the list and didn't receive your email > until I checked the web archives today. Out of interest.. GnuPG's security policy is at https://gnupg.org/documentation/security.html. Is there a reason you don't seem to have followed that? > [...] -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 418 bytes Desc: not available URL: From zach.fogg at gmail.com Thu Jan 15 03:50:13 2026 From: zach.fogg at gmail.com (Zachary Fogg) Date: Wed, 14 Jan 2026 21:50:13 -0500 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: <87zf6fakac.fsf@gentoo.org> References: <87zf6fakac.fsf@gentoo.org> Message-ID: i've never found a security bug before and am new to the field, just tinkering with my own code only. i just happened to be coding on my project and found the bug and thought i'd tell the developers. i didn't think to check for a security policy, i just wanted to confirm it's a bug and get it fixed so my project will work. i'll submit a bug through the security policy, thanks. On Wed, Jan 14, 2026 at 5:42?PM Sam James wrote: > Zachary Fogg via Gcrypt-devel writes: > > > **In-Reply-To:** > > > > Hi NIIBE Yutaka, > > > > Thank you for your response on October 22! I apologize for the delay - I > am new to the list and didn't receive your email > > until I checked the web archives today. > > Out of interest.. > > GnuPG's security policy is at > https://gnupg.org/documentation/security.html. Is there a reason you > don't seem to have followed that? > > > [...] > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sam at gentoo.org Thu Jan 15 04:59:18 2026 From: sam at gentoo.org (Sam James) Date: Thu, 15 Jan 2026 03:59:18 +0000 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: References: <87zf6fakac.fsf@gentoo.org> Message-ID: <87cy3ba5m1.fsf@gentoo.org> Zachary Fogg writes: > i've never found a security bug before and am new to the field, just tinkering with my own code only. i just happened to > be coding on my project and found the bug and thought i'd tell the developers. i didn't think to check for a security > policy, i just wanted to confirm it's a bug and get it fixed so my project will work. i'll submit a bug through the > security policy, thanks. I don't think there's a point in doing it now. I was more curious as to why you didn't, because you called it a security bug in https://github.com/zfogg/ascii-chat/issues/92. You've already made the developers aware (and it's public), so if it is a security issue, there is no benefit to reporting it that way now. I was just noting it for future. > > On Wed, Jan 14, 2026 at 5:42?PM Sam James wrote: > > Zachary Fogg via Gcrypt-devel writes: > > > **In-Reply-To:** > > > > Hi NIIBE Yutaka, > > > > Thank you for your response on October 22! I apologize for the delay - I am new to the list and didn't receive your > email > > until I checked the web archives today. > > Out of interest.. > > GnuPG's security policy is at > https://gnupg.org/documentation/security.html. Is there a reason you > don't seem to have followed that? > > > [...] -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 418 bytes Desc: not available URL: From wk at gnupg.org Thu Jan 15 14:44:58 2026 From: wk at gnupg.org (Werner Koch) Date: Thu, 15 Jan 2026 14:44:58 +0100 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: (Zachary Fogg via Gcrypt-devel's message of "Tue, 30 Dec 2025 02:30:00 -0500") References: Message-ID: <87pl7b7zxh.fsf@jacob.g10code.de> Hi! Just a short note on your bug report. You gave a lot of examples and a nicely formated report at https://github.com/zfogg/ascii-chat/issues/92 but I can't read everything of it. On Tue, 30 Dec 2025 02:30, Zachary Fogg said: > **In-Reply-To:** > Your response mentioned using `(flags eddsa)` during key generation, which > is good practice. However, I want to clarify that **my bug report concerns > signature verification, not key generation**. If you look at the way GnuPG uses Libgcrypt will find in gnupg/g10/pkglue.c:pk_verify this: if (openpgp_oid_is_ed25519 (pkey[0])) fmt = "(public-key(ecc(curve %s)(flags eddsa)(q%m)))"; else fmt = "(public-key(ecc(curve %s)(q%m)))"; and this for the data: if (openpgp_oid_is_ed25519 (pkey[0])) fmt = "(data(flags eddsa)(hash-algo sha512)(value %m))"; else fmt = "(data(value %m))"; and more complicated stuff for re-formatting the signature data. It is a bit unfortunate that we need to have these special cases but that's the drawback of a having a stable API and protocol. > 1. Can you confirm this is a genuine bug in libgcrypt's verification logic? No, at least not as I understand it. ed25519 signatures are working well and are in active use since GnuPG 2.1 from 2014. > 2. Should I open a formal bug in the dev.gnupg.org tracker? I don't see a bug ;-) > 3. Would a patch fixing the PUBKEY_FLAG_PREHASH handling be acceptable? I do not understand exactly what you propose. A more concise description would be helpful. But note that API stability is a primary goal. BTW on your website your wrote: I've created a working exploit that demonstrates the severity of this bug. The exploit proves that GPG agent creates EdDSA signatures that cannot be verified by standard libgcrypt verification code, even with the correct keys. The term "exploit" is used to describe an attack method which undermines the security of a system. What you describe is a claimed inconsistent API. That may or may not be the case; I don't see a security bug here, though. Salam-Shalom, Werner p.s. I had a brief look at your project: In src/main.c I notice // Set global FPS from command-line option if provided extern int g_max_fps; The declaration of an external variable inside a function is a not a good coding style. Put this at the top of the file or into a header. A few lines above: #ifndef NDEBUG // Initialize lock debugging system after logging is fully set up log_debug("Initializing lock debug system..."); Never ever use NDEBUG. This is an idea of the 70ies. This also disables the assert(3) functionality and if you do this you won't get an assertion failure at all in your production code - either you know the code is correct or you are not sure. Never remove an assert from production code. I have noticed a lot of documentation inside the code - that's good. -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From zach.fogg at gmail.com Thu Jan 15 17:20:12 2026 From: zach.fogg at gmail.com (Zachary Fogg) Date: Thu, 15 Jan 2026 11:20:12 -0500 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: <87pl7b7zxh.fsf@jacob.g10code.de> References: <87pl7b7zxh.fsf@jacob.g10code.de> Message-ID: thanks for your time. you say you don't see the bug, but did you run my test programs? they show the bug. #include #include #include int main(void) { gcry_error_t err; gcry_sexp_t keypair = NULL, privkey = NULL, pubkey = NULL; gcry_sexp_t s_sig = NULL, s_data = NULL; gcry_check_version(NULL); gcry_control(GCRYCTL_DISABLE_SECMEM, 0); gcry_control(GCRYCTL_INITIALIZATION_FINISHED, 0); // Generate Ed25519 keypair err = gcry_sexp_build(&keypair, NULL, "(genkey (ecc (curve Ed25519)))"); gcry_pk_genkey(&keypair, keypair); privkey = gcry_sexp_find_token(keypair, "private-key", 0); pubkey = gcry_sexp_find_token(keypair, "public-key", 0); const char *msg = "Test message"; size_t msg_len = strlen(msg); // TEST 1: GPG agent Format 2 (THE BUG) printf("Format 2: (data (flags eddsa) (hash-algo sha512) (value %%b))\n"); err = gcry_sexp_build(&s_data, NULL, "(data (flags eddsa) (hash-algo sha512) (value %b))", msg_len, msg); err = gcry_pk_sign(&s_sig, s_data, privkey); printf(" Sign: %s\n", err ? "FAILED" : "OK"); err = gcry_pk_verify(s_sig, s_data, pubkey); printf(" Verify: %s\n", err ? "FAILED ?" : "OK"); gcry_sexp_release(s_sig); gcry_sexp_release(s_data); // TEST 2: Simple Format 3 (CONTROL - should work) printf("\nFormat 3: (data (value %%b))\n"); err = gcry_sexp_build(&s_data, NULL, "(data (value %b))", msg_len, msg); err = gcry_pk_sign(&s_sig, s_data, privkey); printf(" Sign: %s\n", err ? "FAILED" : "OK"); err = gcry_pk_verify(s_sig, s_data, pubkey); printf(" Verify: %s\n", err ? "FAILED" : "OK ?"); // Cleanup gcry_sexp_release(s_sig); gcry_sexp_release(s_data); gcry_sexp_release(pubkey); gcry_sexp_release(privkey); gcry_sexp_release(keypair); return 0; } the full issue has more example programs and information why integrating with libsodium fails in this way. https://github.com/zfogg/ascii-chat/issues/92 i did however find a work around: have gpg do verification through gpg-agent. so my ascii-chat program works now, but not via the libsodium+gcrypt verification method that i originally planned. On Thu, Jan 15, 2026 at 8:41?AM Werner Koch wrote: > Hi! > > Just a short note on your bug report. You gave a lot of examples and a > nicely formated report at https://github.com/zfogg/ascii-chat/issues/92 > but I can't read everything of it. > > On Tue, 30 Dec 2025 02:30, Zachary Fogg said: > > **In-Reply-To:** > > > Your response mentioned using `(flags eddsa)` during key generation, > which > > is good practice. However, I want to clarify that **my bug report > concerns > > signature verification, not key generation**. > > If you look at the way GnuPG uses Libgcrypt will find in > gnupg/g10/pkglue.c:pk_verify this: > > if (openpgp_oid_is_ed25519 (pkey[0])) > fmt = "(public-key(ecc(curve %s)(flags eddsa)(q%m)))"; > else > fmt = "(public-key(ecc(curve %s)(q%m)))"; > > and this for the data: > > if (openpgp_oid_is_ed25519 (pkey[0])) > fmt = "(data(flags eddsa)(hash-algo sha512)(value %m))"; > else > fmt = "(data(value %m))"; > > and more complicated stuff for re-formatting the signature data. It is > a bit unfortunate that we need to have these special cases but that's > the drawback of a having a stable API and protocol. > > > 1. Can you confirm this is a genuine bug in libgcrypt's verification > logic? > > No, at least not as I understand it. ed25519 signatures are working > well and are in active use since GnuPG 2.1 from 2014. > > > 2. Should I open a formal bug in the dev.gnupg.org tracker? > > I don't see a bug ;-) > > > 3. Would a patch fixing the PUBKEY_FLAG_PREHASH handling be acceptable? > > I do not understand exactly what you propose. A more concise > description would be helpful. But note that API stability is a primary > goal. > > BTW on your website your wrote: > > I've created a working exploit that demonstrates the severity of this > bug. The exploit proves that GPG agent creates EdDSA signatures that > cannot be verified by standard libgcrypt verification code, even with > the correct keys. > > The term "exploit" is used to describe an attack method which undermines > the security of a system. What you describe is a claimed inconsistent > API. That may or may not be the case; I don't see a security bug here, > though. > > > > Salam-Shalom, > > Werner > > > p.s. > I had a brief look at your project: In src/main.c I notice > > // Set global FPS from command-line option if provided > extern int g_max_fps; > > The declaration of an external variable inside a function is a not a > good coding style. Put this at the top of the file or into a header. > A few lines above: > > #ifndef NDEBUG > // Initialize lock debugging system after logging is fully set up > log_debug("Initializing lock debug system..."); > > Never ever use NDEBUG. This is an idea of the 70ies. This also > disables the assert(3) functionality and if you do this you won't get an > assertion failure at all in your production code - either you know the > code is correct or you are not sure. Never remove an assert from > production code. > > I have noticed a lot of documentation inside the code - that's good. > > -- > The pioneers of a warless world are the youth that > refuse military service. - A. Einstein > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gniibe at fsij.org Fri Jan 16 06:45:56 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 16 Jan 2026 14:45:56 +0900 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: References: <87pl7b7zxh.fsf@jacob.g10code.de> Message-ID: <878qdydsa3.fsf@haruna.fsij.org> Hello, Zachary Fogg wrote: > thanks for your time. you say you don't see the bug, but did you run my > test programs? they show the bug. Your program is wrong. If you have time, please read: https://lists.gnupg.org/pipermail/gcrypt-devel/2025-October/005982.html I repeat. [...] > // Generate Ed25519 keypair > err = gcry_sexp_build(&keypair, NULL, "(genkey (ecc (curve Ed25519)))"); Here, the key should have the flags with eddsa (as I wrote last year). So, for the key generation, it should be like: err = gcry_sexp_build(&keypair, NULL, "(genkey (ecc (flags eddsa) (curve Ed25519)))"); Ed25519 in libgcrypt is a bit difficult to use. With its history in libgcrypt, it can be used with ECDSA (!= EdDSA) for some reason, and its semantics are not well defined other than the code itself. I never know the real use cases of ECDSA with the curve Ed25519 except examples within libgcrypt. When using for EdDSA, we need to have the eddsa flag in its key. When using a key with no eddsa flag for EdDSA, the behaviour is undefined. Specifically, when the key is generated with no eddsa flag, the public key is computed for non-EdDSA use case, it won't work well with EdDSA. It could be kind enough if libgcrypt rejected wrong use case, like the modification of below. ========================== diff --git a/cipher/ecc.c b/cipher/ecc.c index 51364b64..74683bf4 100644 --- a/cipher/ecc.c +++ b/cipher/ecc.c @@ -1040,6 +1040,12 @@ ecc_sign (gcry_sexp_t *r_sig, gcry_sexp_t s_data, gcry_sexp_t keyparms) sig_s = mpi_new (0); if ((ctx.flags & PUBKEY_FLAG_EDDSA)) { + if (!(flags & PUBKEY_FLAG_EDDSA) && ec->dialect != ECC_DIALECT_SAFECURVE) + { + rc = GPG_ERR_INV_FLAG; + goto leave; + } + /* EdDSA requires the public key. */ rc = _gcry_ecc_eddsa_sign (data, ec, sig_r, sig_s, &ctx); if (!rc) @@ -1236,6 +1242,12 @@ ecc_verify (gcry_sexp_t s_sig, gcry_sexp_t s_data, gcry_sexp_t s_keyparms) */ if ((sigflags & PUBKEY_FLAG_EDDSA)) { + if (!(flags & PUBKEY_FLAG_EDDSA) && ec->dialect != ECC_DIALECT_SAFECURVE) + { + rc = GPG_ERR_INV_FLAG; + goto leave; + } + rc = _gcry_ecc_eddsa_verify (data, ec, sig_r, sig_s, &ctx); } else if ((sigflags & PUBKEY_FLAG_GOST)) ========================== ... but, I don't know if it's worth to apply. SEXP is lax format and use of SEXP in libgcrypt is not strict, only specific use cases of flags and values make sense. -- From wk at gnupg.org Fri Jan 16 10:07:39 2026 From: wk at gnupg.org (Werner Koch) Date: Fri, 16 Jan 2026 10:07:39 +0100 Subject: EdDSA Verification Bug - Clarification on Format 2 Verification Failure In-Reply-To: <878qdydsa3.fsf@haruna.fsij.org> (NIIBE Yutaka via Gcrypt-devel's message of "Fri, 16 Jan 2026 14:45:56 +0900") References: <87pl7b7zxh.fsf@jacob.g10code.de> <878qdydsa3.fsf@haruna.fsij.org> Message-ID: <87ms2e6i3o.fsf@jacob.g10code.de> On Fri, 16 Jan 2026 14:45, NIIBE Yutaka said: > semantics are not well defined other than the code itself. I never know > the real use cases of ECDSA with the curve Ed25519 except examples GNUnet uses Libgcrypt ins some special ways. That is the reason why we have support for E*C*DSA and E*d*DSA. Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From gniibe at fsij.org Wed Jan 21 03:29:39 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Wed, 21 Jan 2026 11:29:39 +0900 Subject: T7338: Make SHA1 non-FIPS and differentiate in the SLI In-Reply-To: <87ldtlechh.fsf@haruna.fsij.org> References: <8734gjfu2k.fsf@jacob.g10code.de> <875xldeqz4.fsf@jacob.g10code.de> <87ikpdqt7h.fsf@haruna.fsij.org> <87cyfeznz5.fsf@haruna.fsij.org> <87msehks8y.fsf@haruna.fsij.org> <877c5jyi0c.fsf@haruna.fsij.org> <87jz9eh3fm.fsf@haruna.fsij.org> <871pvl2s94.fsf@haruna.fsij.org> <87wmdd1d36.fsf@haruna.fsij.org> <87ldtlechh.fsf@haruna.fsij.org> Message-ID: <871pjjitpo.fsf@haruna.fsij.org> Hello, Replying the change of last year, NIIBE Yutaka (in 2025-03-04) wrote: > From: NIIBE Yutaka > Date: Tue, 4 Mar 2025 10:32:49 +0900 > Subject: [PATCH 6/6] fips,cipher: Do the computation when marking > non-compliant. This change introduced a regression (wrt disabled algo). Attached is a fix for this. Thanks to Pavel Kohout, Aisle Research. -- -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fips-cipher-Fix-the-regression-with-disabled-public-.patch Type: text/x-diff Size: 4347 bytes Desc: not available URL: From guidovranken at gmail.com Thu Jan 29 15:14:04 2026 From: guidovranken at gmail.com (Guido Vranken) Date: Thu, 29 Jan 2026 15:14:04 +0100 Subject: libgcrypt 1.12.0: gcry_mpi_ec_curve_point corrupts point Message-ID: #include #include int main(void) { gcry_control(GCRYCTL_DISABLE_SECMEM, 0); gcry_control(GCRYCTL_INITIALIZATION_FINISHED, 0); gcry_ctx_t ctx; gcry_mpi_ec_new(&ctx, NULL, "sm2p256v1"); // Point with small X (1 limb), Z != 1 gcry_mpi_t X = gcry_mpi_set_ui(NULL, 9); gcry_mpi_t Y, Z, scalar; gcry_mpi_scan(&Y, GCRYMPI_FMT_HEX, "3EE15EF0050F0FD70857D63C72B31A7066E9D02AEECCCE8B00D27AC9AC7A673A", 0, NULL); Z = gcry_mpi_set_ui(NULL, 27); scalar = gcry_mpi_set_ui(NULL, 27); gcry_mpi_point_t P = gcry_mpi_point_new(0); gcry_mpi_point_set(P, X, Y, Z); gcry_mpi_point_t R1 = gcry_mpi_point_new(0); gcry_mpi_ec_mul(R1, scalar, P, ctx); gcry_mpi_point_set(P, X, Y, Z); gcry_mpi_ec_curve_point(P, ctx); // BUG: corrupts P gcry_mpi_point_t R2 = gcry_mpi_point_new(0); gcry_mpi_ec_mul(R2, scalar, P, ctx); gcry_mpi_t r1x = gcry_mpi_new(0), r2x = gcry_mpi_new(0); gcry_mpi_ec_get_affine(r1x, NULL, R1, ctx); gcry_mpi_ec_get_affine(r2x, NULL, R2, ctx); unsigned char *buf; size_t len; printf("Test 1 (clean point): "); gcry_mpi_aprint(GCRYMPI_FMT_HEX, &buf, &len, r1x); printf("x = %s\n", buf); gcry_free(buf); printf("Test 2 (after gcry_mpi_ec_curve_point): "); gcry_mpi_aprint(GCRYMPI_FMT_HEX, &buf, &len, r2x); printf("x = %s\n", buf); gcry_free(buf); if (gcry_mpi_cmp(r1x, r2x) == 0) printf("PASS: results match\n"); else printf("FAIL: results differ\n"); return gcry_mpi_cmp(r1x, r2x) != 0; } Afaik this affects Weierstrass curves with points satisfying Z >= 2 and limb count of X and/or Y != ctx->p->nlimbs. This is arguably a security vulnerability. Guido -------------- next part -------------- An HTML attachment was scrubbed... URL: From gniibe at fsij.org Fri Jan 30 08:08:59 2026 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 30 Jan 2026 16:08:59 +0900 Subject: libgcrypt 1.12.0: gcry_mpi_ec_curve_point corrupts point In-Reply-To: References: Message-ID: <87wm0zfugk.fsf@haruna.fsij.org> Hello, Thank you for testing. Guido Vranken wrote: > Afaik this affects Weierstrass curves with points satisfying Z >= 2 and > limb count of X and/or Y != ctx->p->nlimbs. This is arguably a security > vulnerability. I was not careful enough for the commit: 92bbe34514ee180c074b882d8459cdf6b873ba0c It changes the MPI of POINT for _gcry_mpi_ec_get_affine (thus, gcry_mpi_ec_curve_point). Here is the change to fix the regression. ========================== diff --git a/mpi/ec.c b/mpi/ec.c index d7bad4a6..b0b6f427 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -1220,18 +1220,20 @@ _gcry_mpi_ec_get_affine (gcry_mpi_t x, gcry_mpi_t y, mpi_point_t point, if (x) { - mpi_resize (point->x, ctx->p->nlimbs); - point->x->nlimbs = ctx->p->nlimbs; - ec_mulm_lli (x, point->x, z2, ctx); + mpi_set (x, point->x); + mpi_resize (x, ctx->p->nlimbs); + x->nlimbs = ctx->p->nlimbs; + ec_mulm_lli (x, x, z2, ctx); } if (y) { - mpi_resize (point->y, ctx->p->nlimbs); - point->y->nlimbs = ctx->p->nlimbs; + mpi_set (y, point->y); + mpi_resize (y, ctx->p->nlimbs); + y->nlimbs = ctx->p->nlimbs; z3 = mpi_new (0); ec_mulm_lli (z3, z2, z1, ctx); /* z3 = z^(-3) mod p */ - ec_mulm_lli (y, point->y, z3, ctx); + ec_mulm_lli (y, y, z3, ctx); mpi_free (z3); } -- From wk at gnupg.org Fri Jan 30 11:58:29 2026 From: wk at gnupg.org (Werner Koch) Date: Fri, 30 Jan 2026 11:58:29 +0100 Subject: libgcrypt 1.12.0: gcry_mpi_ec_curve_point corrupts point In-Reply-To: <87wm0zfugk.fsf@haruna.fsij.org> (NIIBE Yutaka via Gcrypt-devel's message of "Fri, 30 Jan 2026 16:08:59 +0900") References: <87wm0zfugk.fsf@haruna.fsij.org> Message-ID: <87o6mbwene.fsf@jacob.g10code.de> On Fri, 30 Jan 2026 16:08, NIIBE Yutaka said: > Here is the change to fix the regression. Thanks. Let's wait a few days before publishing 1.12.1 Salam-Shalom, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 284 bytes Desc: not available URL: From guidovranken at gmail.com Sat Jan 31 14:16:38 2026 From: guidovranken at gmail.com (Guido Vranken) Date: Sat, 31 Jan 2026 14:16:38 +0100 Subject: libgcrypt 1.8.12: STRIBOG carry overflow bug Message-ID: Fix is in da6cd4f but was not backported to 1.8. 1.8 is EOL but has "Extended Long Term Support contract available". Guido -------------- next part -------------- An HTML attachment was scrubbed... URL: