[PATCH] serpent: add x86/AVX512 implementation

Sun May 28 16:54:04 CEST 2023

* cipher/Makefile.am: Add `serpent-avx512-x86.c`; Add extra CFLAG
handling for `serpent-avx512-x86.o` and `serpent-avx512-x86.lo`.
* cipher/serpent-avx512-x86.c: New.
* cipher/serpent.c (USE_AVX512): New.
(serpent_context_t): Add `use_avx512`.
[USE_AVX512] (_gcry_serpent_avx512_cbc_dec)
(_gcry_serpent_avx512_cfb_dec, _gcry_serpent_avx512_ctr_enc)
(_gcry_serpent_avx512_ocb_crypt, _gcry_serpent_avx512_blk32): New.
(serpent_setkey_internal) [USE_AVX512]: Set `use_avx512` is
AVX512 HW available.
(_gcry_serpent_ctr_enc) [USE_AVX512]: New.
(_gcry_serpent_cbc_dec) [USE_AVX512]: New.
(_gcry_serpent_cfb_dec) [USE_AVX512]: New.
(_gcry_serpent_ocb_crypt) [USE_AVX512]: New.
(serpent_crypt_blk1_16): Rename to...
(serpent_crypt_blk1_32): ... this; Add AVX512 code-path; Adjust for
increase from max 16 blocks to max 32 blocks.
(serpent_encrypt_blk1_16): Rename to ...
(serpent_encrypt_blk1_32): ... this.
(serpent_decrypt_blk1_16): Rename to ...
(serpent_decrypt_blk1_32): ... this.
(_gcry_serpent_xts_crypt, _gcry_serpent_ecb_crypt): Increase bulk
block count from 16 to 32.
* configure.ac (gcry_cv_cc_x86_avx512_intrinsics)
(ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS): New.
(GCRYPT_ASM_CIPHERS): Add `serpent-avx512-x86.lo`.
--

Benchmark on AMD Ryzen 9 7900X:

Before:
Cipher:
 SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      1.52 ns/B     626.2 MiB/s      8.26 c/B      5425
        ECB dec |      1.48 ns/B     645.5 MiB/s      8.01 c/B      5425
        CBC enc |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        CBC dec |     0.722 ns/B      1322 MiB/s      3.91 c/B      5425
        CFB enc |      5.88 ns/B     162.3 MiB/s     32.31 c/B      5500
        CFB dec |     0.735 ns/B      1297 MiB/s      3.99 c/B      5424
        OFB enc |      5.77 ns/B     165.3 MiB/s     31.72 c/B      5500
        OFB dec |      5.77 ns/B     165.4 MiB/s     31.72 c/B      5500
        CTR enc |     0.756 ns/B      1262 MiB/s      4.10 c/B      5425
        CTR dec |     0.776 ns/B      1228 MiB/s      4.21 c/B      5424
        XTS enc |      1.68 ns/B     568.3 MiB/s      9.10 c/B      5424
        XTS dec |      1.58 ns/B     604.2 MiB/s      8.56 c/B      5425
        CCM enc |      6.60 ns/B     144.5 MiB/s     36.30 c/B      5500
        CCM dec |      6.60 ns/B     144.5 MiB/s     36.30 c/B      5500
       CCM auth |      5.86 ns/B     162.6 MiB/s     32.25 c/B      5500
        EAX enc |      6.54 ns/B     145.8 MiB/s     35.98 c/B      5500
        EAX dec |      6.54 ns/B     145.8 MiB/s     35.98 c/B      5500
       EAX auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        GCM enc |     0.787 ns/B      1212 MiB/s      4.27 c/B      5425
        GCM dec |     0.788 ns/B      1211 MiB/s      4.27 c/B      5425
       GCM auth |     0.038 ns/B     24932 MiB/s     0.210 c/B      5500
        OCB enc |     0.750 ns/B      1272 MiB/s      4.07 c/B      5424
        OCB dec |     0.743 ns/B      1284 MiB/s      4.03 c/B      5425
       OCB auth |     0.749 ns/B      1274 MiB/s      4.06 c/B      5425
        SIV enc |      6.54 ns/B     145.8 MiB/s     35.99 c/B      5500
        SIV dec |      6.55 ns/B     145.7 MiB/s     36.01 c/B      5500
       SIV auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
    GCM-SIV enc |      5.63 ns/B     169.4 MiB/s     30.97 c/B      5500
    GCM-SIV dec |      5.64 ns/B     169.2 MiB/s     31.00 c/B      5500
   GCM-SIV auth |     0.038 ns/B     25201 MiB/s     0.208 c/B      5500

After:
 SERPENT128     |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     0.578 ns/B      1649 MiB/s      3.14 c/B      5425
        ECB dec |     0.505 ns/B      1889 MiB/s      2.74 c/B      5424
        CBC enc |      5.81 ns/B     164.1 MiB/s     31.96 c/B      5500
        CBC dec |     0.527 ns/B      1810 MiB/s      2.86 c/B      5424
        CFB enc |      5.88 ns/B     162.3 MiB/s     32.31 c/B      5500
        CFB dec |     0.471 ns/B      2026 MiB/s      2.55 c/B      5425
        OFB enc |      5.77 ns/B     165.3 MiB/s     31.72 c/B      5500
        OFB dec |      5.77 ns/B     165.3 MiB/s     31.73 c/B      5501
        CTR enc |     0.464 ns/B      2053 MiB/s      2.52 c/B      5425
        CTR dec |     0.464 ns/B      2057 MiB/s      2.51 c/B      5425
        XTS enc |     0.551 ns/B      1732 MiB/s      2.99 c/B      5424
        XTS dec |     0.527 ns/B      1809 MiB/s      2.86 c/B      5424
        CCM enc |      6.32 ns/B     150.8 MiB/s     34.78 c/B      5501
        CCM dec |      6.32 ns/B     150.9 MiB/s     34.77 c/B      5500
       CCM auth |      5.86 ns/B     162.6 MiB/s     32.25 c/B      5500
        EAX enc |      6.26 ns/B     152.2 MiB/s     34.46 c/B      5500
        EAX dec |      6.27 ns/B     152.2 MiB/s     34.46 c/B      5500
       EAX auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
        GCM enc |     0.497 ns/B      1917 MiB/s      2.70 c/B      5425
        GCM dec |     0.499 ns/B      1913 MiB/s      2.70 c/B      5425
       GCM auth |     0.031 ns/B     30709 MiB/s     0.171 c/B      5500
        OCB enc |     0.482 ns/B      1979 MiB/s      2.61 c/B      5424
        OCB dec |     0.475 ns/B      2007 MiB/s      2.58 c/B      5424
       OCB auth |     0.748 ns/B      1274 MiB/s      4.06 c/B      5424
        SIV enc |      6.27 ns/B     152.0 MiB/s     34.50 c/B      5500
        SIV dec |      6.27 ns/B     152.1 MiB/s     34.48 c/B      5500
       SIV auth |      5.81 ns/B     164.2 MiB/s     31.94 c/B      5500
    GCM-SIV enc |      5.63 ns/B     169.5 MiB/s     30.95 c/B      5500
    GCM-SIV dec |      5.63 ns/B     169.3 MiB/s     30.98 c/B      5500
   GCM-SIV auth |     0.034 ns/B     28060 MiB/s     0.187 c/B      5500

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am          |  17 +-
 cipher/serpent-avx2-amd64.S |   4 +-
 cipher/serpent-avx512-x86.c | 994 ++++++++++++++++++++++++++++++++++++
 cipher/serpent.c            | 218 +++++++-
 configure.ac                |  45 ++
 5 files changed, 1257 insertions(+), 21 deletions(-)
 create mode 100644 cipher/serpent-avx512-x86.c

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index e67b1ee2..8c7ec095 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -119,12 +119,12 @@ EXTRA_libcipher_la_SOURCES = \
 	salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \
 	scrypt.c \
 	seed.c \
-	serpent.c serpent-sse2-amd64.S \
+	serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \
+	serpent-avx512-x86.c serpent-armv7-neon.S \
 	sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \
 	sm4-gfni-avx2-amd64.S sm4-gfni-avx512-amd64.S \
 	sm4-aarch64.S sm4-armv8-aarch64-ce.S sm4-armv9-aarch64-sve-ce.S \
 	sm4-ppc.c \
-	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
 	sha1-armv8-aarch64-ce.S sha1-intel-shaext.c \
@@ -316,3 +316,16 @@ sm4-ppc.o: $(srcdir)/sm4-ppc.c Makefile
 
 sm4-ppc.lo: $(srcdir)/sm4-ppc.c Makefile
 	`echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) `
+
+
+if ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS
+avx512f_cflags = -mavx512f
+else
+avx512f_cflags =
+endif
+
+serpent-avx512-x86.o: $(srcdir)/serpent-avx512-x86.c Makefile
+	`echo $(COMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) `
+
+serpent-avx512-x86.lo: $(srcdir)/serpent-avx512-x86.c Makefile
+	`echo $(LTCOMPILE) $(avx512f_cflags) -c $< | $(instrumentation_munging) `
diff --git a/cipher/serpent-avx2-amd64.S b/cipher/serpent-avx2-amd64.S
index e25e7d3b..7aba235f 100644
--- a/cipher/serpent-avx2-amd64.S
+++ b/cipher/serpent-avx2-amd64.S
@@ -589,8 +589,8 @@ ELF(.type   _gcry_serpent_avx2_blk16, at function;)
 _gcry_serpent_avx2_blk16:
 	/* input:
 	 *	%rdi: ctx, CTX
-	 *	%rsi: dst (8 blocks)
-	 *	%rdx: src (8 blocks)
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
 	 *	%ecx: encrypt
 	 */
 	CFI_STARTPROC();
diff --git a/cipher/serpent-avx512-x86.c b/cipher/serpent-avx512-x86.c
new file mode 100644
index 00000000..762c09e1
--- /dev/null
+++ b/cipher/serpent-avx512-x86.c
@@ -0,0 +1,994 @@
+/* serpent-avx512-x86.c  -  AVX512 implementation of Serpent cipher
+ *
+ * Copyright (C) 2023 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+
+#if defined(__x86_64) || defined(__i386)
+#if defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS) && \
+    defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT)
+
+#include <immintrin.h>
+#include <string.h>
+#include <stdio.h>
+
+#include "g10lib.h"
+#include "types.h"
+#include "cipher.h"
+#include "bithelp.h"
+#include "bufhelp.h"
+#include "cipher-internal.h"
+#include "bulkhelp.h"
+
+#define ALWAYS_INLINE inline __attribute__((always_inline))
+#define NO_INLINE __attribute__((noinline))
+
+/* Number of rounds per Serpent encrypt/decrypt operation.  */
+#define ROUNDS 32
+
+/* Serpent works on 128 bit blocks.  */
+typedef unsigned int serpent_block_t[4];
+
+/* The key schedule consists of 33 128 bit subkeys.  */
+typedef unsigned int serpent_subkeys_t[ROUNDS + 1][4];
+
+#define vpunpckhdq(a, b, o)  ((o) = _mm512_unpackhi_epi32((b), (a)))
+#define vpunpckldq(a, b, o)  ((o) = _mm512_unpacklo_epi32((b), (a)))
+#define vpunpckhqdq(a, b, o) ((o) = _mm512_unpackhi_epi64((b), (a)))
+#define vpunpcklqdq(a, b, o) ((o) = _mm512_unpacklo_epi64((b), (a)))
+
+#define vpbroadcastd(v) _mm512_set1_epi32(v)
+
+#define vrol(x, s) _mm512_rol_epi32((x), (s))
+#define vror(x, s) _mm512_ror_epi32((x), (s))
+#define vshl(x, s) _mm512_slli_epi32((x), (s))
+
+/* 4x4 32-bit integer matrix transpose */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2, t3) \
+	vpunpckhdq(x1, x0, t2); \
+	vpunpckldq(x1, x0, x0); \
+	\
+	vpunpckldq(x3, x2, t1); \
+	vpunpckhdq(x3, x2, x2); \
+	\
+	vpunpckhqdq(t1, x0, x1); \
+	vpunpcklqdq(t1, x0, x0); \
+	\
+	vpunpckhqdq(x2, t2, x3); \
+	vpunpcklqdq(x2, t2, x2);
+
+/*
+ * These are the S-Boxes of Serpent from following research paper.
+ *
+ *  D. A. Osvik, “Speeding up Serpent,” in Third AES Candidate Conference,
+ *   (New York, New York, USA), p. 317–329, National Institute of Standards and
+ *   Technology, 2000.
+ *
+ * Paper is also available at: http://www.ii.uib.no/~osvik/pub/aes3.pdf
+ *
+ * --
+ *
+ * Following logic gets heavily optimized by compiler to use AVX512F
+ * 'vpternlogq' instruction. This gives higher performance increase than
+ * would be expected from simple wideing of vectors from AVX2/256bit to
+ * AVX512/512bit.
+ *
+ */
+
+#define SBOX0(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r3 ^= r0; r4 =  r1; \
+    r1 &= r3; r4 ^= r2; \
+    r1 ^= r0; r0 |= r3; \
+    r0 ^= r4; r4 ^= r3; \
+    r3 ^= r2; r2 |= r1; \
+    r2 ^= r4; r4 = ~r4; \
+    r4 |= r1; r1 ^= r3; \
+    r1 ^= r4; r3 |= r0; \
+    r1 ^= r3; r4 ^= r3; \
+    \
+    w = r1; x = r4; y = r2; z = r0; \
+  }
+
+#define SBOX0_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 = ~r2; r4 =  r1; \
+    r1 |= r0; r4 = ~r4; \
+    r1 ^= r2; r2 |= r4; \
+    r1 ^= r3; r0 ^= r4; \
+    r2 ^= r0; r0 &= r3; \
+    r4 ^= r0; r0 |= r1; \
+    r0 ^= r2; r3 ^= r4; \
+    r2 ^= r1; r3 ^= r0; \
+    r3 ^= r1; \
+    r2 &= r3; \
+    r4 ^= r2; \
+    \
+    w = r0; x = r4; y = r1; z = r3; \
+  }
+
+#define SBOX1(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 = ~r0; r2 = ~r2; \
+    r4 =  r0; r0 &= r1; \
+    r2 ^= r0; r0 |= r3; \
+    r3 ^= r2; r1 ^= r0; \
+    r0 ^= r4; r4 |= r1; \
+    r1 ^= r3; r2 |= r0; \
+    r2 &= r4; r0 ^= r1; \
+    r1 &= r2; \
+    r1 ^= r0; r0 &= r2; \
+    r0 ^= r4; \
+    \
+    w = r2; x = r0; y = r3; z = r1; \
+  }
+
+#define SBOX1_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r1; r1 ^= r3; \
+    r3 &= r1; r4 ^= r2; \
+    r3 ^= r0; r0 |= r1; \
+    r2 ^= r3; r0 ^= r4; \
+    r0 |= r2; r1 ^= r3; \
+    r0 ^= r1; r1 |= r3; \
+    r1 ^= r0; r4 = ~r4; \
+    r4 ^= r1; r1 |= r0; \
+    r1 ^= r0; \
+    r1 |= r4; \
+    r3 ^= r1; \
+    \
+    w = r4; x = r0; y = r3; z = r2; \
+  }
+
+#define SBOX2(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r0; r0 &= r2; \
+    r0 ^= r3; r2 ^= r1; \
+    r2 ^= r0; r3 |= r4; \
+    r3 ^= r1; r4 ^= r2; \
+    r1 =  r3; r3 |= r4; \
+    r3 ^= r0; r0 &= r1; \
+    r4 ^= r0; r1 ^= r3; \
+    r1 ^= r4; r4 = ~r4; \
+    \
+    w = r2; x = r3; y = r1; z = r4; \
+  }
+
+#define SBOX2_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 ^= r3; r3 ^= r0; \
+    r4 =  r3; r3 &= r2; \
+    r3 ^= r1; r1 |= r2; \
+    r1 ^= r4; r4 &= r3; \
+    r2 ^= r3; r4 &= r0; \
+    r4 ^= r2; r2 &= r1; \
+    r2 |= r0; r3 = ~r3; \
+    r2 ^= r3; r0 ^= r3; \
+    r0 &= r1; r3 ^= r4; \
+    r3 ^= r0; \
+    \
+    w = r1; x = r4; y = r2; z = r3; \
+  }
+
+#define SBOX3(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r0; r0 |= r3; \
+    r3 ^= r1; r1 &= r4; \
+    r4 ^= r2; r2 ^= r3; \
+    r3 &= r0; r4 |= r1; \
+    r3 ^= r4; r0 ^= r1; \
+    r4 &= r0; r1 ^= r3; \
+    r4 ^= r2; r1 |= r0; \
+    r1 ^= r2; r0 ^= r3; \
+    r2 =  r1; r1 |= r3; \
+    r1 ^= r0; \
+    \
+    w = r1; x = r2; y = r3; z = r4; \
+  }
+
+#define SBOX3_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 ^= r1; \
+    r0 ^= r2; r4 &= r2; \
+    r4 ^= r0; r0 &= r1; \
+    r1 ^= r3; r3 |= r4; \
+    r2 ^= r3; r0 ^= r3; \
+    r1 ^= r4; r3 &= r2; \
+    r3 ^= r1; r1 ^= r0; \
+    r1 |= r2; r0 ^= r3; \
+    r1 ^= r4; \
+    r0 ^= r1; \
+    \
+    w = r2; x = r1; y = r3; z = r0; \
+  }
+
+#define SBOX4(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r1 ^= r3; r3 = ~r3; \
+    r2 ^= r3; r3 ^= r0; \
+    r4 =  r1; r1 &= r3; \
+    r1 ^= r2; r4 ^= r3; \
+    r0 ^= r4; r2 &= r4; \
+    r2 ^= r0; r0 &= r1; \
+    r3 ^= r0; r4 |= r1; \
+    r4 ^= r0; r0 |= r3; \
+    r0 ^= r2; r2 &= r3; \
+    r0 = ~r0; r4 ^= r2; \
+    \
+    w = r1; x = r4; y = r0; z = r3; \
+  }
+
+#define SBOX4_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 &= r3; \
+    r2 ^= r1; r1 |= r3; \
+    r1 &= r0; r4 ^= r2; \
+    r4 ^= r1; r1 &= r2; \
+    r0 = ~r0; r3 ^= r4; \
+    r1 ^= r3; r3 &= r0; \
+    r3 ^= r2; r0 ^= r1; \
+    r2 &= r0; r3 ^= r0; \
+    r2 ^= r4; \
+    r2 |= r3; r3 ^= r0; \
+    r2 ^= r1; \
+    \
+    w = r0; x = r3; y = r2; z = r4; \
+  }
+
+#define SBOX5(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 ^= r1; r1 ^= r3; \
+    r3 = ~r3; r4 =  r1; \
+    r1 &= r0; r2 ^= r3; \
+    r1 ^= r2; r2 |= r4; \
+    r4 ^= r3; r3 &= r1; \
+    r3 ^= r0; r4 ^= r1; \
+    r4 ^= r2; r2 ^= r0; \
+    r0 &= r3; r2 = ~r2; \
+    r0 ^= r4; r4 |= r3; \
+    r2 ^= r4; \
+    \
+    w = r1; x = r3; y = r0; z = r2; \
+  }
+
+#define SBOX5_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r1 = ~r1; r4 =  r3; \
+    r2 ^= r1; r3 |= r0; \
+    r3 ^= r2; r2 |= r1; \
+    r2 &= r0; r4 ^= r3; \
+    r2 ^= r4; r4 |= r0; \
+    r4 ^= r1; r1 &= r2; \
+    r1 ^= r3; r4 ^= r2; \
+    r3 &= r4; r4 ^= r1; \
+    r3 ^= r4; r4 = ~r4; \
+    r3 ^= r0; \
+    \
+    w = r1; x = r4; y = r3; z = r2; \
+  }
+
+#define SBOX6(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r2 = ~r2; r4 =  r3; \
+    r3 &= r0; r0 ^= r4; \
+    r3 ^= r2; r2 |= r4; \
+    r1 ^= r3; r2 ^= r0; \
+    r0 |= r1; r2 ^= r1; \
+    r4 ^= r0; r0 |= r3; \
+    r0 ^= r2; r4 ^= r3; \
+    r4 ^= r0; r3 = ~r3; \
+    r2 &= r4; \
+    r2 ^= r3; \
+    \
+    w = r0; x = r1; y = r4; z = r2; \
+  }
+
+#define SBOX6_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r0 ^= r2; r4 =  r2; \
+    r2 &= r0; r4 ^= r3; \
+    r2 = ~r2; r3 ^= r1; \
+    r2 ^= r3; r4 |= r0; \
+    r0 ^= r2; r3 ^= r4; \
+    r4 ^= r1; r1 &= r3; \
+    r1 ^= r0; r0 ^= r3; \
+    r0 |= r2; r3 ^= r1; \
+    r4 ^= r0; \
+    \
+    w = r1; x = r2; y = r4; z = r3; \
+  }
+
+#define SBOX7(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r1; r1 |= r2; \
+    r1 ^= r3; r4 ^= r2; \
+    r2 ^= r1; r3 |= r4; \
+    r3 &= r0; r4 ^= r2; \
+    r3 ^= r1; r1 |= r4; \
+    r1 ^= r0; r0 |= r4; \
+    r0 ^= r2; r1 ^= r4; \
+    r2 ^= r1; r1 &= r0; \
+    r1 ^= r4; r2 = ~r2; \
+    r2 |= r0; \
+    r4 ^= r2; \
+    \
+    w = r4; x = r3; y = r1; z = r0; \
+  }
+
+#define SBOX7_INVERSE(r0, r1, r2, r3, w, x, y, z) \
+  { \
+    __m512i r4; \
+    \
+    r4 =  r2; r2 ^= r0; \
+    r0 &= r3; r4 |= r3; \
+    r2 = ~r2; r3 ^= r1; \
+    r1 |= r0; r0 ^= r2; \
+    r2 &= r4; r3 &= r4; \
+    r1 ^= r2; r2 ^= r0; \
+    r0 |= r2; r4 ^= r1; \
+    r0 ^= r3; r3 ^= r4; \
+    r4 |= r0; r3 ^= r2; \
+    r4 ^= r2; \
+    \
+    w = r3; x = r0; y = r1; z = r4; \
+  }
+
+/* XOR BLOCK1 into BLOCK0.  */
+#define BLOCK_XOR_KEY(block0, rkey)     \
+  {                                     \
+    block0[0] ^= vpbroadcastd(rkey[0]); \
+    block0[1] ^= vpbroadcastd(rkey[1]); \
+    block0[2] ^= vpbroadcastd(rkey[2]); \
+    block0[3] ^= vpbroadcastd(rkey[3]); \
+  }
+
+/* Copy BLOCK_SRC to BLOCK_DST.  */
+#define BLOCK_COPY(block_dst, block_src) \
+  {                                      \
+    block_dst[0] = block_src[0];         \
+    block_dst[1] = block_src[1];         \
+    block_dst[2] = block_src[2];         \
+    block_dst[3] = block_src[3];         \
+  }
+
+/* Apply SBOX number WHICH to to the block found in ARRAY0, writing
+   the output to the block found in ARRAY1.  */
+#define SBOX(which, array0, array1)                         \
+  SBOX##which (array0[0], array0[1], array0[2], array0[3],  \
+               array1[0], array1[1], array1[2], array1[3]);
+
+/* Apply inverse SBOX number WHICH to to the block found in ARRAY0, writing
+   the output to the block found in ARRAY1.  */
+#define SBOX_INVERSE(which, array0, array1)                           \
+  SBOX##which##_INVERSE (array0[0], array0[1], array0[2], array0[3],  \
+                         array1[0], array1[1], array1[2], array1[3]);
+
+/* Apply the linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION(block)                    \
+  {                                                     \
+    block[0] = vrol (block[0], 13);                     \
+    block[2] = vrol (block[2], 3);                      \
+    block[1] = block[1] ^ block[0] ^ block[2];          \
+    block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \
+    block[1] = vrol (block[1], 1);                      \
+    block[3] = vrol (block[3], 7);                      \
+    block[0] = block[0] ^ block[1] ^ block[3];          \
+    block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \
+    block[0] = vrol (block[0], 5);                      \
+    block[2] = vrol (block[2], 22);                     \
+  }
+
+/* Apply the inverse linear transformation to BLOCK.  */
+#define LINEAR_TRANSFORMATION_INVERSE(block)            \
+  {                                                     \
+    block[2] = vror (block[2], 22);                     \
+    block[0] = vror (block[0] , 5);                     \
+    block[2] = block[2] ^ block[3] ^ vshl(block[1], 7); \
+    block[0] = block[0] ^ block[1] ^ block[3];          \
+    block[3] = vror (block[3], 7);                      \
+    block[1] = vror (block[1], 1);                      \
+    block[3] = block[3] ^ block[2] ^ vshl(block[0], 3); \
+    block[1] = block[1] ^ block[0] ^ block[2];          \
+    block[2] = vror (block[2], 3);                      \
+    block[0] = vror (block[0], 13);                     \
+  }
+
+/* Apply a Serpent round to BLOCK, using the SBOX number WHICH and the
+   subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary storage.
+   This macro increments `round'.  */
+#define ROUND(which, subkeys, block, block_tmp) \
+  {                                             \
+    BLOCK_XOR_KEY (block, subkeys[round]);      \
+    SBOX (which, block, block_tmp);             \
+    LINEAR_TRANSFORMATION (block_tmp);          \
+    BLOCK_COPY (block, block_tmp);              \
+  }
+
+/* Apply the last Serpent round to BLOCK, using the SBOX number WHICH
+   and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary
+   storage.  The result will be stored in BLOCK_TMP.  This macro
+   increments `round'.  */
+#define ROUND_LAST(which, subkeys, block, block_tmp) \
+  {                                                  \
+    BLOCK_XOR_KEY (block, subkeys[round]);           \
+    SBOX (which, block, block_tmp);                  \
+    BLOCK_XOR_KEY (block_tmp, subkeys[round+1]);     \
+  }
+
+/* Apply an inverse Serpent round to BLOCK, using the SBOX number
+   WHICH and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as
+   temporary storage.  This macro increments `round'.  */
+#define ROUND_INVERSE(which, subkey, block, block_tmp) \
+  {                                                    \
+    LINEAR_TRANSFORMATION_INVERSE (block);             \
+    SBOX_INVERSE (which, block, block_tmp);            \
+    BLOCK_XOR_KEY (block_tmp, subkey[round]);          \
+    BLOCK_COPY (block, block_tmp);                     \
+  }
+
+/* Apply the first Serpent round to BLOCK, using the SBOX number WHICH
+   and the subkeys contained in SUBKEYS.  Use BLOCK_TMP as temporary
+   storage.  The result will be stored in BLOCK_TMP.  This macro
+   increments `round'.  */
+#define ROUND_FIRST_INVERSE(which, subkeys, block, block_tmp) \
+  {                                                           \
+    BLOCK_XOR_KEY (block, subkeys[round]);                    \
+    SBOX_INVERSE (which, block, block_tmp);                   \
+    BLOCK_XOR_KEY (block_tmp, subkeys[round-1]);              \
+  }
+
+static ALWAYS_INLINE void
+serpent_encrypt_internal_avx512 (const serpent_subkeys_t keys,
+				 const __m512i vin[8], __m512i vout[8])
+{
+  __m512i b[4];
+  __m512i c[4];
+  __m512i b_next[4];
+  __m512i c_next[4];
+  int round = 0;
+
+  b_next[0] = vin[0];
+  b_next[1] = vin[1];
+  b_next[2] = vin[2];
+  b_next[3] = vin[3];
+  c_next[0] = vin[4];
+  c_next[1] = vin[5];
+  c_next[2] = vin[6];
+  c_next[3] = vin[7];
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+
+  b[0] = b_next[0];
+  b[1] = b_next[1];
+  b[2] = b_next[2];
+  b[3] = b_next[3];
+  c[0] = c_next[0];
+  c[1] = c_next[1];
+  c[2] = c_next[2];
+  c[3] = c_next[3];
+
+  while (1)
+    {
+      ROUND (0, keys, b, b_next); ROUND (0, keys, c, c_next); round++;
+      ROUND (1, keys, b, b_next); ROUND (1, keys, c, c_next); round++;
+      ROUND (2, keys, b, b_next); ROUND (2, keys, c, c_next); round++;
+      ROUND (3, keys, b, b_next); ROUND (3, keys, c, c_next); round++;
+      ROUND (4, keys, b, b_next); ROUND (4, keys, c, c_next); round++;
+      ROUND (5, keys, b, b_next); ROUND (5, keys, c, c_next); round++;
+      ROUND (6, keys, b, b_next); ROUND (6, keys, c, c_next); round++;
+      if (round >= ROUNDS - 1)
+	break;
+      ROUND (7, keys, b, b_next); ROUND (7, keys, c, c_next); round++;
+    }
+
+  ROUND_LAST (7, keys, b, b_next); ROUND_LAST (7, keys, c, c_next);
+
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+  vout[0] = b_next[0];
+  vout[1] = b_next[1];
+  vout[2] = b_next[2];
+  vout[3] = b_next[3];
+  vout[4] = c_next[0];
+  vout[5] = c_next[1];
+  vout[6] = c_next[2];
+  vout[7] = c_next[3];
+}
+
+static ALWAYS_INLINE void
+serpent_decrypt_internal_avx512 (const serpent_subkeys_t keys,
+				 const __m512i vin[8], __m512i vout[8])
+{
+  __m512i b[4];
+  __m512i c[4];
+  __m512i b_next[4];
+  __m512i c_next[4];
+  int round = ROUNDS;
+
+  b_next[0] = vin[0];
+  b_next[1] = vin[1];
+  b_next[2] = vin[2];
+  b_next[3] = vin[3];
+  c_next[0] = vin[4];
+  c_next[1] = vin[5];
+  c_next[2] = vin[6];
+  c_next[3] = vin[7];
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+
+  ROUND_FIRST_INVERSE (7, keys, b_next, b); ROUND_FIRST_INVERSE (7, keys, c_next, c);
+  round -= 2;
+
+  while (1)
+    {
+      ROUND_INVERSE (6, keys, b, b_next); ROUND_INVERSE (6, keys, c, c_next); round--;
+      ROUND_INVERSE (5, keys, b, b_next); ROUND_INVERSE (5, keys, c, c_next); round--;
+      ROUND_INVERSE (4, keys, b, b_next); ROUND_INVERSE (4, keys, c, c_next); round--;
+      ROUND_INVERSE (3, keys, b, b_next); ROUND_INVERSE (3, keys, c, c_next); round--;
+      ROUND_INVERSE (2, keys, b, b_next); ROUND_INVERSE (2, keys, c, c_next); round--;
+      ROUND_INVERSE (1, keys, b, b_next); ROUND_INVERSE (1, keys, c, c_next); round--;
+      ROUND_INVERSE (0, keys, b, b_next); ROUND_INVERSE (0, keys, c, c_next); round--;
+      if (round <= 0)
+	break;
+      ROUND_INVERSE (7, keys, b, b_next); ROUND_INVERSE (7, keys, c, c_next); round--;
+    }
+
+  transpose_4x4 (b_next[0], b_next[1], b_next[2], b_next[3], b[0], b[1], b[2]);
+  transpose_4x4 (c_next[0], c_next[1], c_next[2], c_next[3], c[0], c[1], c[2]);
+  vout[0] = b_next[0];
+  vout[1] = b_next[1];
+  vout[2] = b_next[2];
+  vout[3] = b_next[3];
+  vout[4] = c_next[0];
+  vout[5] = c_next[1];
+  vout[6] = c_next[2];
+  vout[7] = c_next[3];
+}
+
+enum crypt_mode_e
+{
+  ECB_ENC = 0,
+  ECB_DEC,
+  CBC_DEC,
+  CFB_DEC,
+  CTR_ENC,
+  OCB_ENC,
+  OCB_DEC
+};
+
+static ALWAYS_INLINE void
+ctr_generate(unsigned char *ctr, __m512i vin[8])
+{
+  const unsigned int blocksize = 16;
+  unsigned char ctr_low = ctr[15];
+
+  if (ctr_low + 32 <= 256)
+    {
+      const __m512i add0123 = _mm512_set_epi64(3LL << 56, 0,
+					       2LL << 56, 0,
+					       1LL << 56, 0,
+					       0LL << 56, 0);
+      const __m512i add4444 = _mm512_set_epi64(4LL << 56, 0,
+					       4LL << 56, 0,
+					       4LL << 56, 0,
+					       4LL << 56, 0);
+      const __m512i add4567 = _mm512_add_epi32(add0123, add4444);
+      const __m512i add8888 = _mm512_add_epi32(add4444, add4444);
+
+      // Fast path without carry handling.
+      __m512i vctr =
+	_mm512_broadcast_i32x4(_mm_loadu_si128((const void *)ctr));
+
+      cipher_block_add(ctr, 32, blocksize);
+      vin[0] = _mm512_add_epi32(vctr, add0123);
+      vin[1] = _mm512_add_epi32(vctr, add4567);
+      vin[2] = _mm512_add_epi32(vin[0], add8888);
+      vin[3] = _mm512_add_epi32(vin[1], add8888);
+      vin[4] = _mm512_add_epi32(vin[2], add8888);
+      vin[5] = _mm512_add_epi32(vin[3], add8888);
+      vin[6] = _mm512_add_epi32(vin[4], add8888);
+      vin[7] = _mm512_add_epi32(vin[5], add8888);
+    }
+  else
+    {
+      // Slow path.
+      u32 blocks[4][blocksize / sizeof(u32)];
+
+      cipher_block_cpy(blocks[0], ctr, blocksize);
+      cipher_block_cpy(blocks[1], ctr, blocksize);
+      cipher_block_cpy(blocks[2], ctr, blocksize);
+      cipher_block_cpy(blocks[3], ctr, blocksize);
+      cipher_block_add(ctr, 32, blocksize);
+      cipher_block_add(blocks[1], 1, blocksize);
+      cipher_block_add(blocks[2], 2, blocksize);
+      cipher_block_add(blocks[3], 3, blocksize);
+      vin[0] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[1] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[2] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[3] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[4] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[5] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[6] = _mm512_loadu_epi32 (blocks);
+      cipher_block_add(blocks[0], 4, blocksize);
+      cipher_block_add(blocks[1], 4, blocksize);
+      cipher_block_add(blocks[2], 4, blocksize);
+      cipher_block_add(blocks[3], 4, blocksize);
+      vin[7] = _mm512_loadu_epi32 (blocks);
+
+      wipememory(blocks, sizeof(blocks));
+    }
+}
+
+static ALWAYS_INLINE __m512i
+ocb_input(__m512i *vchecksum, __m128i *voffset, const unsigned char *input,
+	  unsigned char *output, const ocb_L_uintptr_t L[4])
+{
+  __m128i L0 = _mm_loadu_si128((const void *)(uintptr_t)L[0]);
+  __m128i L1 = _mm_loadu_si128((const void *)(uintptr_t)L[1]);
+  __m128i L2 = _mm_loadu_si128((const void *)(uintptr_t)L[2]);
+  __m128i L3 = _mm_loadu_si128((const void *)(uintptr_t)L[3]);
+  __m512i vin = _mm512_loadu_epi32 (input);
+  __m512i voffsets;
+
+  /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+  /* Checksum_i = Checksum_{i-1} xor P_i  */
+  /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+
+  if (vchecksum)
+    *vchecksum ^= _mm512_loadu_epi32 (input);
+
+  *voffset ^= L0;
+  voffsets = _mm512_castsi128_si512(*voffset);
+  *voffset ^= L1;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 1);
+  *voffset ^= L2;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 2);
+  *voffset ^= L3;
+  voffsets = _mm512_inserti32x4(voffsets, *voffset, 3);
+  _mm512_storeu_epi32 (output, voffsets);
+
+  return vin ^ voffsets;
+}
+
+static NO_INLINE void
+serpent_avx512_blk32(const void *c, unsigned char *output,
+		     const unsigned char *input, int mode,
+		     unsigned char *iv, unsigned char *checksum,
+		     const ocb_L_uintptr_t Ls[32])
+{
+  __m512i vin[8];
+  __m512i vout[8];
+  int encrypt = 1;
+
+  asm volatile ("vpxor %%ymm0, %%ymm0, %%ymm0;\n\t"
+		"vpopcntb %%zmm0, %%zmm6;\n\t" /* spec stop for old AVX512 CPUs */
+		"vpxor %%ymm6, %%ymm6, %%ymm6;\n\t"
+		:
+		: "m"(*input), "m"(*output)
+		: "xmm6", "xmm0", "memory", "cc");
+
+  // Input handling
+  switch (mode)
+    {
+      default:
+      case CBC_DEC:
+      case ECB_DEC:
+	encrypt = 0;
+	/* fall through */
+      case ECB_ENC:
+	vin[0] = _mm512_loadu_epi32 (input + 0 * 64);
+	vin[1] = _mm512_loadu_epi32 (input + 1 * 64);
+	vin[2] = _mm512_loadu_epi32 (input + 2 * 64);
+	vin[3] = _mm512_loadu_epi32 (input + 3 * 64);
+	vin[4] = _mm512_loadu_epi32 (input + 4 * 64);
+	vin[5] = _mm512_loadu_epi32 (input + 5 * 64);
+	vin[6] = _mm512_loadu_epi32 (input + 6 * 64);
+	vin[7] = _mm512_loadu_epi32 (input + 7 * 64);
+	break;
+
+      case CFB_DEC:
+      {
+	__m128i viv = _mm_loadu_si128((const void *)iv);
+	vin[0] = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0),
+					  input - 1 * 64 + 48)
+		  ^ _mm512_castsi128_si512(viv);
+	vin[1] = _mm512_loadu_epi32(input + 0 * 64 + 48);
+	vin[2] = _mm512_loadu_epi32(input + 1 * 64 + 48);
+	vin[3] = _mm512_loadu_epi32(input + 2 * 64 + 48);
+	vin[4] = _mm512_loadu_epi32(input + 3 * 64 + 48);
+	vin[5] = _mm512_loadu_epi32(input + 4 * 64 + 48);
+	vin[6] = _mm512_loadu_epi32(input + 5 * 64 + 48);
+	vin[7] = _mm512_loadu_epi32(input + 6 * 64 + 48);
+	viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48));
+	_mm_storeu_si128((void *)iv, viv);
+	break;
+      }
+
+      case CTR_ENC:
+	ctr_generate(iv, vin);
+	break;
+
+      case OCB_ENC:
+      {
+	const ocb_L_uintptr_t *L = Ls;
+	__m512i vchecksum = _mm512_setzero_epi32();
+	__m128i vchecksum128 = _mm_loadu_si128((const void *)checksum);
+	__m128i voffset = _mm_loadu_si128((const void *)iv);
+	vin[0] = ocb_input(&vchecksum, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4;
+	vin[1] = ocb_input(&vchecksum, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4;
+	vin[2] = ocb_input(&vchecksum, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4;
+	vin[3] = ocb_input(&vchecksum, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4;
+	vin[4] = ocb_input(&vchecksum, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4;
+	vin[5] = ocb_input(&vchecksum, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4;
+	vin[6] = ocb_input(&vchecksum, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4;
+	vin[7] = ocb_input(&vchecksum, &voffset, input + 7 * 64, output + 7 * 64, L);
+	vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0)
+			^ _mm512_extracti32x4_epi32(vchecksum, 1)
+			^ _mm512_extracti32x4_epi32(vchecksum, 2)
+			^ _mm512_extracti32x4_epi32(vchecksum, 3);
+	_mm_storeu_si128((void *)checksum, vchecksum128);
+	_mm_storeu_si128((void *)iv, voffset);
+	break;
+      }
+
+      case OCB_DEC:
+      {
+	const ocb_L_uintptr_t *L = Ls;
+	__m128i voffset = _mm_loadu_si128((const void *)iv);
+	encrypt = 0;
+	vin[0] = ocb_input(NULL, &voffset, input + 0 * 64, output + 0 * 64, L); L += 4;
+	vin[1] = ocb_input(NULL, &voffset, input + 1 * 64, output + 1 * 64, L); L += 4;
+	vin[2] = ocb_input(NULL, &voffset, input + 2 * 64, output + 2 * 64, L); L += 4;
+	vin[3] = ocb_input(NULL, &voffset, input + 3 * 64, output + 3 * 64, L); L += 4;
+	vin[4] = ocb_input(NULL, &voffset, input + 4 * 64, output + 4 * 64, L); L += 4;
+	vin[5] = ocb_input(NULL, &voffset, input + 5 * 64, output + 5 * 64, L); L += 4;
+	vin[6] = ocb_input(NULL, &voffset, input + 6 * 64, output + 6 * 64, L); L += 4;
+	vin[7] = ocb_input(NULL, &voffset, input + 7 * 64, output + 7 * 64, L);
+	_mm_storeu_si128((void *)iv, voffset);
+	break;
+      }
+    }
+
+  if (encrypt)
+    serpent_encrypt_internal_avx512(c, vin, vout);
+  else
+    serpent_decrypt_internal_avx512(c, vin, vout);
+
+  switch (mode)
+    {
+      case CTR_ENC:
+      case CFB_DEC:
+	vout[0] ^= _mm512_loadu_epi32 (input + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (input + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (input + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (input + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (input + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (input + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (input + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (input + 7 * 64);
+	/* fall through */
+      default:
+      case ECB_DEC:
+      case ECB_ENC:
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+
+      case CBC_DEC:
+      {
+	__m128i viv = _mm_loadu_si128((const void *)iv);
+	vout[0] ^= _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0),
+					    input - 1 * 64 + 48)
+		    ^ _mm512_castsi128_si512(viv);
+	vout[1] ^= _mm512_loadu_epi32(input + 0 * 64 + 48);
+	vout[2] ^= _mm512_loadu_epi32(input + 1 * 64 + 48);
+	vout[3] ^= _mm512_loadu_epi32(input + 2 * 64 + 48);
+	vout[4] ^= _mm512_loadu_epi32(input + 3 * 64 + 48);
+	vout[5] ^= _mm512_loadu_epi32(input + 4 * 64 + 48);
+	vout[6] ^= _mm512_loadu_epi32(input + 5 * 64 + 48);
+	vout[7] ^= _mm512_loadu_epi32(input + 6 * 64 + 48);
+	viv = _mm_loadu_si128((const void *)(input + 7 * 64 + 48));
+	_mm_storeu_si128((void *)iv, viv);
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+      }
+
+      case OCB_ENC:
+	vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64);
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	break;
+
+      case OCB_DEC:
+      {
+	__m512i vchecksum = _mm512_setzero_epi32();
+	__m128i vchecksum128 = _mm_loadu_si128((const void *)checksum);
+	vout[0] ^= _mm512_loadu_epi32 (output + 0 * 64);
+	vout[1] ^= _mm512_loadu_epi32 (output + 1 * 64);
+	vout[2] ^= _mm512_loadu_epi32 (output + 2 * 64);
+	vout[3] ^= _mm512_loadu_epi32 (output + 3 * 64);
+	vout[4] ^= _mm512_loadu_epi32 (output + 4 * 64);
+	vout[5] ^= _mm512_loadu_epi32 (output + 5 * 64);
+	vout[6] ^= _mm512_loadu_epi32 (output + 6 * 64);
+	vout[7] ^= _mm512_loadu_epi32 (output + 7 * 64);
+	vchecksum ^= vout[0];
+	vchecksum ^= vout[1];
+	vchecksum ^= vout[2];
+	vchecksum ^= vout[3];
+	vchecksum ^= vout[4];
+	vchecksum ^= vout[5];
+	vchecksum ^= vout[6];
+	vchecksum ^= vout[7];
+	_mm512_storeu_epi32 (output + 0 * 64, vout[0]);
+	_mm512_storeu_epi32 (output + 1 * 64, vout[1]);
+	_mm512_storeu_epi32 (output + 2 * 64, vout[2]);
+	_mm512_storeu_epi32 (output + 3 * 64, vout[3]);
+	_mm512_storeu_epi32 (output + 4 * 64, vout[4]);
+	_mm512_storeu_epi32 (output + 5 * 64, vout[5]);
+	_mm512_storeu_epi32 (output + 6 * 64, vout[6]);
+	_mm512_storeu_epi32 (output + 7 * 64, vout[7]);
+	vchecksum128 ^= _mm512_extracti32x4_epi32(vchecksum, 0)
+			^ _mm512_extracti32x4_epi32(vchecksum, 1)
+			^ _mm512_extracti32x4_epi32(vchecksum, 2)
+			^ _mm512_extracti32x4_epi32(vchecksum, 3);
+	_mm_storeu_si128((void *)checksum, vchecksum128);
+	break;
+      }
+    }
+
+  _mm256_zeroall();
+#ifdef __x86_64__
+  asm volatile (
+#define CLEAR(mm) "vpxord %%" #mm ", %%" #mm ", %%" #mm ";\n\t"
+		CLEAR(ymm16) CLEAR(ymm17) CLEAR(ymm18) CLEAR(ymm19)
+		CLEAR(ymm20) CLEAR(ymm21) CLEAR(ymm22) CLEAR(ymm23)
+		CLEAR(ymm24) CLEAR(ymm25) CLEAR(ymm26) CLEAR(ymm27)
+		CLEAR(ymm28) CLEAR(ymm29) CLEAR(ymm30) CLEAR(ymm31)
+#undef CLEAR
+		:
+		: "m"(*input), "m"(*output)
+		: "xmm16", "xmm17", "xmm18", "xmm19",
+		  "xmm20", "xmm21", "xmm22", "xmm23",
+		  "xmm24", "xmm25", "xmm26", "xmm27",
+		  "xmm28", "xmm29", "xmm30", "xmm31",
+		  "memory", "cc");
+#endif
+}
+
+void
+_gcry_serpent_avx512_blk32(const void *ctx, unsigned char *out,
+			   const unsigned char *in, int encrypt)
+{
+  serpent_avx512_blk32 (ctx, out, in, encrypt ? ECB_ENC : ECB_DEC,
+			NULL, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_cbc_dec(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CBC_DEC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_cfb_dec(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CFB_DEC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_ctr_enc(const void *ctx, unsigned char *out,
+			     const unsigned char *in, unsigned char *iv)
+{
+  serpent_avx512_blk32 (ctx, out, in, CTR_ENC, iv, NULL, NULL);
+}
+
+void
+_gcry_serpent_avx512_ocb_crypt(const void *ctx, unsigned char *out,
+			       const unsigned char *in, unsigned char *offset,
+			       unsigned char *checksum,
+			       const ocb_L_uintptr_t Ls[32], int encrypt)
+{
+  serpent_avx512_blk32 (ctx, out, in, encrypt ? OCB_ENC : OCB_DEC, offset,
+			checksum, Ls);
+}
+
+#endif /*defined(USE_SERPENT) && defined(ENABLE_AVX512_SUPPORT)*/
+#endif /*__x86_64 || __i386*/
diff --git a/cipher/serpent.c b/cipher/serpent.c
index 908523c2..2b951aba 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -32,14 +32,14 @@
 #include "bulkhelp.h"
 
 
-/* USE_SSE2 indicates whether to compile with AMD64 SSE2 code. */
+/* USE_SSE2 indicates whether to compile with x86-64 SSE2 code. */
 #undef USE_SSE2
 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
 # define USE_SSE2 1
 #endif
 
-/* USE_AVX2 indicates whether to compile with AMD64 AVX2 code. */
+/* USE_AVX2 indicates whether to compile with x86-64 AVX2 code. */
 #undef USE_AVX2
 #if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
@@ -48,6 +48,15 @@
 # endif
 #endif
 
+/* USE_AVX512 indicates whether to compile with x86 AVX512 code. */
+#undef USE_AVX512
+#if (defined(__x86_64) || defined(__i386)) && \
+    defined(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS)
+# if defined(ENABLE_AVX512_SUPPORT)
+#  define USE_AVX512 1
+# endif
+#endif
+
 /* USE_NEON indicates whether to enable ARM NEON assembly code. */
 #undef USE_NEON
 #ifdef ENABLE_NEON_SUPPORT
@@ -82,6 +91,9 @@ typedef struct serpent_context
 #ifdef USE_AVX2
   int use_avx2;
 #endif
+#ifdef USE_AVX512
+  int use_avx512;
+#endif
 #ifdef USE_NEON
   int use_neon;
 #endif
@@ -186,6 +198,38 @@ extern void _gcry_serpent_avx2_blk16(const serpent_context_t *c, byte *out,
 				     const byte *in, int encrypt) ASM_FUNC_ABI;
 #endif
 
+#ifdef USE_AVX512
+/* Assembler implementations of Serpent using AVX512.  Processing 32 blocks in
+   parallel.
+ */
+extern void _gcry_serpent_avx512_cbc_dec(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *iv);
+
+extern void _gcry_serpent_avx512_cfb_dec(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *iv);
+
+extern void _gcry_serpent_avx512_ctr_enc(const void *ctx,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *ctr);
+
+extern void _gcry_serpent_avx512_ocb_crypt(const void *ctx,
+					   unsigned char *out,
+					   const unsigned char *in,
+					   unsigned char *offset,
+					   unsigned char *checksum,
+					   const ocb_L_uintptr_t Ls[32],
+					   int encrypt);
+
+extern void _gcry_serpent_avx512_blk32(const void *c, byte *out,
+				       const byte *in,
+				       int encrypt);
+#endif
+
 #ifdef USE_NEON
 /* Assembler implementations of Serpent using ARM NEON.  Process 8 block in
    parallel.
@@ -758,6 +802,14 @@ serpent_setkey_internal (serpent_context_t *context,
   serpent_key_prepare (key, key_length, key_prepared);
   serpent_subkeys_generate (key_prepared, context->keys);
 
+#ifdef USE_AVX512
+  context->use_avx512 = 0;
+  if ((_gcry_get_hw_features () & HWF_INTEL_AVX512))
+    {
+      context->use_avx512 = 1;
+    }
+#endif
+
 #ifdef USE_AVX2
   context->use_avx2 = 0;
   if ((_gcry_get_hw_features () & HWF_INTEL_AVX2))
@@ -954,6 +1006,34 @@ _gcry_serpent_ctr_enc(void *context, unsigned char *ctr,
   unsigned char tmpbuf[sizeof(serpent_block_t)];
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_ctr_enc(ctx, outbuf, inbuf, ctr);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+      /* TODO: use caching instead? */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1066,6 +1146,33 @@ _gcry_serpent_cbc_dec(void *context, unsigned char *iv,
   unsigned char savebuf[sizeof(serpent_block_t)];
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_cbc_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1174,6 +1281,33 @@ _gcry_serpent_cfb_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 2 * sizeof (serpent_block_t);
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+
+      /* Process data in 32 block chunks. */
+      while (nblocks >= 32)
+        {
+          _gcry_serpent_avx512_cfb_dec(ctx, outbuf, inbuf, iv);
+
+          nblocks -= 32;
+          outbuf += 32 * sizeof(serpent_block_t);
+          inbuf  += 32 * sizeof(serpent_block_t);
+          did_use_avx512 = 1;
+        }
+
+      if (did_use_avx512)
+        {
+          /* serpent-avx512 code does not use stack */
+          if (nblocks == 0)
+            burn_stack_depth = 0;
+        }
+
+      /* Use generic/avx2/sse2 code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1270,7 +1404,8 @@ static size_t
 _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 			const void *inbuf_arg, size_t nblocks, int encrypt)
 {
-#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON)
+#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \
+    || defined(USE_NEON)
   serpent_context_t *ctx = (void *)&c->context.c;
   unsigned char *outbuf = outbuf_arg;
   const unsigned char *inbuf = inbuf_arg;
@@ -1283,6 +1418,44 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   (void)encrypt;
 #endif
 
+#ifdef USE_AVX512
+  if (ctx->use_avx512)
+    {
+      int did_use_avx512 = 0;
+      ocb_L_uintptr_t Ls[32];
+      ocb_L_uintptr_t *l;
+
+      if (nblocks >= 32)
+	{
+          l = bulk_ocb_prepare_L_pointers_array_blk32 (c, Ls, blkn);
+
+	  /* Process data in 32 block chunks. */
+	  while (nblocks >= 32)
+	    {
+	      blkn += 32;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 32);
+
+	      _gcry_serpent_avx512_ocb_crypt(ctx, outbuf, inbuf, c->u_iv.iv,
+					     c->u_ctr.ctr, Ls, encrypt);
+
+	      nblocks -= 32;
+	      outbuf += 32 * sizeof(serpent_block_t);
+	      inbuf  += 32 * sizeof(serpent_block_t);
+	      did_use_avx512 = 1;
+	    }
+	}
+
+      if (did_use_avx512)
+	{
+	  /* serpent-avx512 code does not use stack */
+	  if (nblocks == 0)
+	    burn_stack_depth = 0;
+	}
+
+      /* Use generic code to handle smaller chunks... */
+    }
+#endif
+
 #ifdef USE_AVX2
   if (ctx->use_avx2)
     {
@@ -1408,7 +1581,8 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
     }
 #endif
 
-#if defined(USE_AVX2) || defined(USE_SSE2) || defined(USE_NEON)
+#if defined(USE_AVX512) || defined(USE_AVX2) || defined(USE_SSE2) \
+    || defined(USE_NEON)
   c->u_mode.ocb.data_nblocks = blkn;
 
   if (burn_stack_depth)
@@ -1556,17 +1730,27 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 
 
 static unsigned int
-serpent_crypt_blk1_16(void *context, byte *out, const byte *in,
+serpent_crypt_blk1_32(void *context, byte *out, const byte *in,
 		      size_t num_blks, int encrypt)
 {
   serpent_context_t *ctx = context;
   unsigned int burn, burn_stack_depth = 0;
 
+#ifdef USE_AVX512
+  if (num_blks == 32 && ctx->use_avx512)
+    {
+      _gcry_serpent_avx512_blk32 (ctx, out, in, encrypt);
+      return 0;
+    }
+#endif
+
 #ifdef USE_AVX2
-  if (num_blks == 16 && ctx->use_avx2)
+  while (num_blks == 16 && ctx->use_avx2)
     {
       _gcry_serpent_avx2_blk16 (ctx, out, in, encrypt);
-      return 0;
+      out += 16 * sizeof(serpent_block_t);
+      in += 16 * sizeof(serpent_block_t);
+      num_blks -= 16;
     }
 #endif
 
@@ -1611,17 +1795,17 @@ serpent_crypt_blk1_16(void *context, byte *out, const byte *in,
 }
 
 static unsigned int
-serpent_encrypt_blk1_16(void *ctx, byte *out, const byte *in,
+serpent_encrypt_blk1_32(void *ctx, byte *out, const byte *in,
 			size_t num_blks)
 {
-  return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 1);
+  return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 1);
 }
 
 static unsigned int
-serpent_decrypt_blk1_16(void *ctx, byte *out, const byte *in,
+serpent_decrypt_blk1_32(void *ctx, byte *out, const byte *in,
 			size_t num_blks)
 {
-  return serpent_crypt_blk1_16 (ctx, out, in, num_blks, 0);
+  return serpent_crypt_blk1_32 (ctx, out, in, num_blks, 0);
 }
 
 
@@ -1638,12 +1822,12 @@ _gcry_serpent_xts_crypt (void *context, unsigned char *tweak, void *outbuf_arg,
   /* Process remaining blocks. */
   if (nblocks)
     {
-      unsigned char tmpbuf[16 * 16];
+      unsigned char tmpbuf[32 * 16];
       unsigned int tmp_used = 16;
       size_t nburn;
 
-      nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16
-                                              : serpent_decrypt_blk1_16,
+      nburn = bulk_xts_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32
+                                              : serpent_decrypt_blk1_32,
                                  outbuf, inbuf, nblocks,
                                  tweak, tmpbuf, sizeof(tmpbuf) / 16,
                                  &tmp_used);
@@ -1672,9 +1856,9 @@ _gcry_serpent_ecb_crypt (void *context, void *outbuf_arg, const void *inbuf_arg,
     {
       size_t nburn;
 
-      nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_16
-                                              : serpent_decrypt_blk1_16,
-                                 outbuf, inbuf, nblocks, 16);
+      nburn = bulk_ecb_crypt_128(ctx, encrypt ? serpent_encrypt_blk1_32
+                                              : serpent_decrypt_blk1_32,
+                                 outbuf, inbuf, nblocks, 32);
       burn_stack_depth = nburn > burn_stack_depth ? nburn : burn_stack_depth;
     }
 
diff --git a/configure.ac b/configure.ac
index 60fb1f75..572fe279 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1704,6 +1704,46 @@ if test "$gcry_cv_gcc_inline_asm_bmi2" = "yes" ; then
 fi
 
 
+#
+# Check whether compiler supports x86/AVX512 intrinsics
+#
+_gcc_cflags_save=$CFLAGS
+CFLAGS="$CFLAGS -mavx512f"
+
+AC_CACHE_CHECK([whether compiler supports x86/AVX512 intrinsics],
+      [gcry_cv_cc_x86_avx512_intrinsics],
+      [if test "$mpi_cpu_arch" != "x86" ||
+	  test "$try_asm_modules" != "yes" ; then
+	gcry_cv_cc_x86_avx512_intrinsics="n/a"
+      else
+	gcry_cv_cc_x86_avx512_intrinsics=no
+	AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+	[[#include <immintrin.h>
+	  __m512i fn(void *in, __m128i y)
+	  {
+	    __m512i x;
+	    x = _mm512_maskz_loadu_epi32(_cvtu32_mask16(0xfff0), in)
+		  ^ _mm512_castsi128_si512(y);
+	    asm volatile ("vinserti32x4 \$3, %0, %%zmm6, %%zmm6;\n\t"
+			  "vpxord %%zmm6, %%zmm6, %%zmm6"
+			  ::"x"(y),"r"(in):"memory","xmm6");
+	    return x;
+	  }
+	  ]])],
+	[gcry_cv_cc_x86_avx512_intrinsics=yes])
+      fi])
+if test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes" ; then
+    AC_DEFINE(HAVE_COMPATIBLE_CC_X86_AVX512_INTRINSICS,1,
+	    [Defined if underlying compiler supports x86/AVX512 intrinsics])
+fi
+
+AM_CONDITIONAL(ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS,
+	       test "$gcry_cv_cc_x86_avx512_intrinsics" = "yes")
+
+# Restore flags.
+CFLAGS=$_gcc_cflags_save;
+
+
 #
 # Check whether GCC assembler needs "-Wa,--divide" to correctly handle
 # constant division
@@ -3034,6 +3074,11 @@ if test "$found" = "1" ; then
       GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx2-amd64.lo"
    fi
 
+   if test x"$avx512support" = xyes ; then
+      # Build with the AVX512 implementation
+      GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-avx512-x86.lo"
+   fi
+
    if test x"$neonsupport" = xyes ; then
       # Build with the NEON implementation
       GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS serpent-armv7-neon.lo"
-- 
2.39.2