[git] GCRYPT - branch, master, updated. libgcrypt-1.6.0-271-g74184c2

Wed Oct 28 19:11:30 CET 2015

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".

The branch, master has been updated
       via  74184c28fbe7ff58cf57f0094ef957d94045da7d (commit)
       via  909644ef5883927262366c356eed530e55aba478 (commit)
       via  16fd540f4d01eb6dc23d9509ae549353617c7a67 (commit)
       via  ae40af427fd2a856b24ec2a41323ec8b80ffc9c0 (commit)
      from  f7505b550dd591e33d3a3fab9277c43c460f1bad (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
commit 74184c28fbe7ff58cf57f0094ef957d94045da7d
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Fri Oct 23 22:30:48 2015 +0300

    keccak: rewrite for improved performance
    
    * cipher/Makefile.am: Add 'keccak_permute_32.h' and
    'keccak_permute_64.h'.
    * cipher/hash-common.h [USE_SHA3] (MD_BLOCK_MAX_BLOCKSIZE): Remove.
    * cipher/keccak.c (USE_64BIT, USE_32BIT, USE_64BIT_BMI2)
    (USE_64BIT_SHLD, USE_32BIT_BMI2, NEED_COMMON64, NEED_COMMON32BI)
    (keccak_ops_t): New.
    (KECCAK_STATE): Add 'state64' and 'state32bi' members.
    (KECCAK_CONTEXT): Remove 'bctx'; add 'blocksize', 'count' and 'ops'.
    (rol64, keccak_f1600_state_permute): Remove.
    [NEED_COMMON64] (round_consts_64bit, keccak_extract_inplace64): New.
    [NEED_COMMON32BI] (round_consts_32bit, keccak_extract_inplace32bi)
    (keccak_absorb_lane32bi): New.
    [USE_64BIT] (ANDN64, ROL64, keccak_f1600_state_permute64)
    (keccak_absorb_lanes64, keccak_generic64_ops): New.
    [USE_64BIT_SHLD] (ANDN64, ROL64, keccak_f1600_state_permute64_shld)
    (keccak_absorb_lanes64_shld, keccak_shld_64_ops): New.
    [USE_64BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute64_bmi2)
    (keccak_absorb_lanes64_bmi2, keccak_bmi2_64_ops): New.
    [USE_32BIT] (ANDN64, ROL64, keccak_f1600_state_permute32bi)
    (keccak_absorb_lanes32bi, keccak_generic32bi_ops): New.
    [USE_32BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute32bi_bmi2)
    (pext, pdep, keccak_absorb_lane32bi_bmi2, keccak_absorb_lanes32bi_bmi2)
    (keccak_extract_inplace32bi_bmi2, keccak_bmi2_32bi_ops): New.
    (keccak_write): New.
    (keccak_init): Adjust to KECCAK_CONTEXT changes; add implementation
    selection based on HWF features.
    (keccak_final): Adjust to KECCAK_CONTEXT changes; use selected 'ops'
    for state manipulation.
    (keccak_read): Adjust to KECCAK_CONTEXT changes.
    (_gcry_digest_spec_sha3_224, _gcry_digest_spec_sha3_256)
    (_gcry_digest_spec_sha3_348, _gcry_digest_spec_sha3_512): Use
    'keccak_write' instead of '_gcry_md_block_write'.
    * cipher/keccak_permute_32.h: New.
    * cipher/keccak_permute_64.h: New.
    --
    
    Patch adds new generic 64-bit and 32-bit implementations and
    optimized implementations for SHA3:
     - Generic 64-bit implementation based on 'simple' implementation
       from SUPERCOP package.
     - Generic 32-bit bit-inteleaved implementataion based on
       'simple32bi' implementation from SUPERCOP package.
     - Intel BMI2 optimized variants of 64-bit and 32-bit BI
       implementations.
     - Intel SHLD optimized variant of 64-bit implementation.
    
    Patch also makes proper use of sponge construction to avoid
    use of addition input buffer.
    
    Below are bench-slope benchmarks for new 64-bit implementations
    made on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2).
    
    Before (amd64):
    
     SHA3-224       |      3.92 ns/B     243.2 MiB/s     12.55 c/B
     SHA3-256       |      4.15 ns/B     230.0 MiB/s     13.27 c/B
     SHA3-384       |      5.40 ns/B     176.6 MiB/s     17.29 c/B
     SHA3-512       |      7.77 ns/B     122.7 MiB/s     24.87 c/B
    
    After (generic 64-bit, amd64), 1.10x faster):
    
     SHA3-224       |      3.57 ns/B     267.4 MiB/s     11.42 c/B
     SHA3-256       |      3.77 ns/B     252.8 MiB/s     12.07 c/B
     SHA3-384       |      4.91 ns/B     194.1 MiB/s     15.72 c/B
     SHA3-512       |      7.06 ns/B     135.0 MiB/s     22.61 c/B
    
    After (Intel SHLD 64-bit, amd64, 1.13x faster):
    
     SHA3-224       |      3.48 ns/B     273.7 MiB/s     11.15 c/B
     SHA3-256       |      3.68 ns/B     258.9 MiB/s     11.79 c/B
     SHA3-384       |      4.80 ns/B     198.7 MiB/s     15.36 c/B
     SHA3-512       |      6.89 ns/B     138.4 MiB/s     22.05 c/B
    
    After (Intel BMI2 64-bit, amd64, 1.45x faster):
    
     SHA3-224       |      2.71 ns/B     352.1 MiB/s      8.67 c/B
     SHA3-256       |      2.86 ns/B     333.2 MiB/s      9.16 c/B
     SHA3-384       |      3.72 ns/B     256.2 MiB/s     11.91 c/B
     SHA3-512       |      5.34 ns/B     178.5 MiB/s     17.10 c/B
    
    Benchmarks of new 32-bit implementations on Intel Core i5-4570
    (no turbo, 3.2 Ghz, gcc-4.9.2):
    
    Before (win32):
    
     SHA3-224       |     12.05 ns/B     79.16 MiB/s     38.56 c/B
     SHA3-256       |     12.75 ns/B     74.78 MiB/s     40.82 c/B
     SHA3-384       |     16.63 ns/B     57.36 MiB/s     53.22 c/B
     SHA3-512       |     23.97 ns/B     39.79 MiB/s     76.72 c/B
    
    After (generic 32-bit BI, win32, 1.23x to 1.29x faster):
    
     SHA3-224       |      9.76 ns/B     97.69 MiB/s     31.25 c/B
     SHA3-256       |     10.27 ns/B     92.82 MiB/s     32.89 c/B
     SHA3-384       |     13.22 ns/B     72.16 MiB/s     42.31 c/B
     SHA3-512       |     18.65 ns/B     51.13 MiB/s     59.70 c/B
    
    After (Intel BMI2 32-bit BI, win32, 1.66x to 1.70x faster):
    
     SHA3-224       |      7.26 ns/B     131.4 MiB/s     23.23 c/B
     SHA3-256       |      7.65 ns/B     124.7 MiB/s     24.47 c/B
     SHA3-384       |      9.87 ns/B     96.67 MiB/s     31.58 c/B
     SHA3-512       |     14.05 ns/B     67.85 MiB/s     44.99 c/B
    
    Benchmarks of new 32-bit implementation on ARM Cortex-A8
    (1008 Mhz, gcc-4.9.1):
    
    Before:
    
     SHA3-224       |     148.6 ns/B      6.42 MiB/s     149.8 c/B
     SHA3-256       |     157.2 ns/B      6.07 MiB/s     158.4 c/B
     SHA3-384       |     205.3 ns/B      4.65 MiB/s     206.9 c/B
     SHA3-512       |     296.3 ns/B      3.22 MiB/s     298.6 c/B
    
    After (1.56x faster):
    
     SHA3-224       |     96.12 ns/B      9.92 MiB/s     96.89 c/B
     SHA3-256       |     101.5 ns/B      9.40 MiB/s     102.3 c/B
     SHA3-384       |     131.4 ns/B      7.26 MiB/s     132.5 c/B
     SHA3-512       |     188.2 ns/B      5.07 MiB/s     189.7 c/B
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index b08c9a9..be03d06 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -90,7 +90,7 @@ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
 sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
   sha512-armv7-neon.S \
-keccak.c \
+keccak.c keccak_permute_32.h keccak_permute_64.h \
 stribog.c \
 tiger.c \
 whirlpool.c whirlpool-sse2-amd64.S \
diff --git a/cipher/hash-common.h b/cipher/hash-common.h
index e1ae5a2..27d670d 100644
--- a/cipher/hash-common.h
+++ b/cipher/hash-common.h
@@ -33,15 +33,9 @@ typedef unsigned int (*_gcry_md_block_write_t) (void *c,
 						const unsigned char *blks,
 						size_t nblks);
 
-#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_SHA3) || \
-				  defined(USE_WHIRLPOOL))
-/* SHA-512, SHA-3 and Whirlpool needs u64. SHA-512 and SHA3 need larger
- * buffer. */
-# ifdef USE_SHA3
-#  define MD_BLOCK_MAX_BLOCKSIZE (1152 / 8)
-# else
-#  define MD_BLOCK_MAX_BLOCKSIZE 128
-# endif
+#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_WHIRLPOOL))
+/* SHA-512 and Whirlpool needs u64. SHA-512 needs larger buffer. */
+# define MD_BLOCK_MAX_BLOCKSIZE 128
 # define MD_NBLOCKS_TYPE u64
 #else
 # define MD_BLOCK_MAX_BLOCKSIZE 64
diff --git a/cipher/keccak.c b/cipher/keccak.c
index 4a9c1f2..3a72294 100644
--- a/cipher/keccak.c
+++ b/cipher/keccak.c
@@ -27,11 +27,45 @@
 #include "hash-common.h"
 
 
-/* The code is based on public-domain/CC0 "Keccak-readable-and-compact.c"
- * implementation by the Keccak, Keyak and Ketje Teams, namely, Guido Bertoni,
- * Joan Daemen, Michaël Peeters, Gilles Van Assche and Ronny Van Keer. From:
- *   https://github.com/gvanas/KeccakCodePackage
- */
+
+/* USE_64BIT indicates whether to use 64-bit generic implementation.
+ * USE_32BIT indicates whether to use 32-bit generic implementation. */
+#undef USE_64BIT
+#if defined(__x86_64__) || SIZEOF_UNSIGNED_LONG == 8
+# define USE_64BIT 1
+#else
+# define USE_32BIT 1
+#endif
+
+
+/* USE_64BIT_BMI2 indicates whether to compile with 64-bit Intel BMI2 code. */
+#undef USE_64BIT_BMI2
+#if defined(USE_64BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_64BIT_BMI2 1
+#endif
+
+
+/* USE_64BIT_SHLD indicates whether to compile with 64-bit Intel SHLD code. */
+#undef USE_64BIT_SHLD
+#if defined(USE_64BIT) && defined (__GNUC__) && defined(__x86_64__)
+# define USE_64BIT_SHLD 1
+#endif
+
+
+/* USE_32BIT_BMI2 indicates whether to compile with 32-bit Intel BMI2 code. */
+#undef USE_32BIT_BMI2
+#if defined(USE_32BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_32BIT_BMI2 1
+#endif
+
+
+#ifdef USE_64BIT
+# define NEED_COMMON64 1
+#endif
+
+#ifdef USE_32BIT
+# define NEED_COMMON32BI 1
+#endif
 
 
 #define SHA3_DELIMITED_SUFFIX 0x06
@@ -40,220 +74,528 @@
 
 typedef struct
 {
-  u64 state[5][5];
+  union {
+#ifdef NEED_COMMON64
+    u64 state64[25];
+#endif
+#ifdef NEED_COMMON32BI
+    u32 state32bi[50];
+#endif
+  } u;
 } KECCAK_STATE;
 
 
 typedef struct
 {
-  gcry_md_block_ctx_t bctx;
+  unsigned int (*permute)(KECCAK_STATE *hd);
+  unsigned int (*absorb)(KECCAK_STATE *hd, int pos, const byte *lanes,
+			 unsigned int nlanes, int blocklanes);
+  unsigned int (*extract_inplace) (KECCAK_STATE *hd, unsigned int outlen);
+} keccak_ops_t;
+
+
+typedef struct KECCAK_CONTEXT_S
+{
   KECCAK_STATE state;
   unsigned int outlen;
+  unsigned int blocksize;
+  unsigned int count;
+  const keccak_ops_t *ops;
 } KECCAK_CONTEXT;
 
 
-static inline u64
-rol64 (u64 x, unsigned int n)
+
+#ifdef NEED_COMMON64
+
+static const u64 round_consts_64bit[24] =
 {
-  return ((x << n) | (x >> (64 - n)));
-}
+  U64_C(0x0000000000000001), U64_C(0x0000000000008082),
+  U64_C(0x800000000000808A), U64_C(0x8000000080008000),
+  U64_C(0x000000000000808B), U64_C(0x0000000080000001),
+  U64_C(0x8000000080008081), U64_C(0x8000000000008009),
+  U64_C(0x000000000000008A), U64_C(0x0000000000000088),
+  U64_C(0x0000000080008009), U64_C(0x000000008000000A),
+  U64_C(0x000000008000808B), U64_C(0x800000000000008B),
+  U64_C(0x8000000000008089), U64_C(0x8000000000008003),
+  U64_C(0x8000000000008002), U64_C(0x8000000000000080),
+  U64_C(0x000000000000800A), U64_C(0x800000008000000A),
+  U64_C(0x8000000080008081), U64_C(0x8000000000008080),
+  U64_C(0x0000000080000001), U64_C(0x8000000080008008)
+};
 
-/* Function that computes the Keccak-f[1600] permutation on the given state. */
-static unsigned int keccak_f1600_state_permute(KECCAK_STATE *hd)
+static unsigned int
+keccak_extract_inplace64(KECCAK_STATE *hd, unsigned int outlen)
 {
-  static const u64 round_consts[24] =
-  {
-    U64_C(0x0000000000000001), U64_C(0x0000000000008082),
-    U64_C(0x800000000000808A), U64_C(0x8000000080008000),
-    U64_C(0x000000000000808B), U64_C(0x0000000080000001),
-    U64_C(0x8000000080008081), U64_C(0x8000000000008009),
-    U64_C(0x000000000000008A), U64_C(0x0000000000000088),
-    U64_C(0x0000000080008009), U64_C(0x000000008000000A),
-    U64_C(0x000000008000808B), U64_C(0x800000000000008B),
-    U64_C(0x8000000000008089), U64_C(0x8000000000008003),
-    U64_C(0x8000000000008002), U64_C(0x8000000000000080),
-    U64_C(0x000000000000800A), U64_C(0x800000008000000A),
-    U64_C(0x8000000080008081), U64_C(0x8000000000008080),
-    U64_C(0x0000000080000001), U64_C(0x8000000080008008)
-  };
-  unsigned int round;
+  unsigned int i;
 
-  for (round = 0; round < 24; round++)
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
     {
-      {
-	/* θ step (see [Keccak Reference, Section 2.3.2]) === */
-	u64 C[5], D[5];
-
-	/* Compute the parity of the columns */
-	C[0] = hd->state[0][0] ^ hd->state[1][0] ^ hd->state[2][0]
-	      ^ hd->state[3][0] ^ hd->state[4][0];
-	C[1] = hd->state[0][1] ^ hd->state[1][1] ^ hd->state[2][1]
-	      ^ hd->state[3][1] ^ hd->state[4][1];
-	C[2] = hd->state[0][2] ^ hd->state[1][2] ^ hd->state[2][2]
-	      ^ hd->state[3][2] ^ hd->state[4][2];
-	C[3] = hd->state[0][3] ^ hd->state[1][3] ^ hd->state[2][3]
-	      ^ hd->state[3][3] ^ hd->state[4][3];
-	C[4] = hd->state[0][4] ^ hd->state[1][4] ^ hd->state[2][4]
-	      ^ hd->state[3][4] ^ hd->state[4][4];
-
-	/* Compute the θ effect for a given column */
-	D[0] = C[4] ^ rol64(C[1], 1);
-	D[1] = C[0] ^ rol64(C[2], 1);
-	D[2] = C[1] ^ rol64(C[3], 1);
-	D[3] = C[2] ^ rol64(C[4], 1);
-	D[4] = C[3] ^ rol64(C[0], 1);
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][0] ^= D[0];
-	hd->state[1][0] ^= D[0];
-	hd->state[2][0] ^= D[0];
-	hd->state[3][0] ^= D[0];
-	hd->state[4][0] ^= D[0];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][1] ^= D[1];
-	hd->state[1][1] ^= D[1];
-	hd->state[2][1] ^= D[1];
-	hd->state[3][1] ^= D[1];
-	hd->state[4][1] ^= D[1];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][2] ^= D[2];
-	hd->state[1][2] ^= D[2];
-	hd->state[2][2] ^= D[2];
-	hd->state[3][2] ^= D[2];
-	hd->state[4][2] ^= D[2];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][3] ^= D[3];
-	hd->state[1][3] ^= D[3];
-	hd->state[2][3] ^= D[3];
-	hd->state[3][3] ^= D[3];
-	hd->state[4][3] ^= D[3];
-
-	/* Add the θ effect to the whole column */
-	hd->state[0][4] ^= D[4];
-	hd->state[1][4] ^= D[4];
-	hd->state[2][4] ^= D[4];
-	hd->state[3][4] ^= D[4];
-	hd->state[4][4] ^= D[4];
-      }
-
-      {
-	/* ρ and π steps (see [Keccak Reference, Sections 2.3.3 and 2.3.4]) */
-	u64 current, temp;
-
-#define do_swap_n_rol(x, y, r) \
-  temp = hd->state[y][x]; \
-  hd->state[y][x] = rol64(current, r); \
-  current = temp;
-
-	/* Start at coordinates (1 0) */
-	current = hd->state[0][1];
-
-	/* Iterate over ((0 1)(2 3))^t * (1 0) for 0 ≤ t ≤ 23 */
-	do_swap_n_rol(0, 2, 1);
-	do_swap_n_rol(2, 1, 3);
-	do_swap_n_rol(1, 2, 6);
-	do_swap_n_rol(2, 3, 10);
-	do_swap_n_rol(3, 3, 15);
-	do_swap_n_rol(3, 0, 21);
-	do_swap_n_rol(0, 1, 28);
-	do_swap_n_rol(1, 3, 36);
-	do_swap_n_rol(3, 1, 45);
-	do_swap_n_rol(1, 4, 55);
-	do_swap_n_rol(4, 4, 2);
-	do_swap_n_rol(4, 0, 14);
-	do_swap_n_rol(0, 3, 27);
-	do_swap_n_rol(3, 4, 41);
-	do_swap_n_rol(4, 3, 56);
-	do_swap_n_rol(3, 2, 8);
-	do_swap_n_rol(2, 2, 25);
-	do_swap_n_rol(2, 0, 43);
-	do_swap_n_rol(0, 4, 62);
-	do_swap_n_rol(4, 2, 18);
-	do_swap_n_rol(2, 4, 39);
-	do_swap_n_rol(4, 1, 61);
-	do_swap_n_rol(1, 1, 20);
-	do_swap_n_rol(1, 0, 44);
-
-#undef do_swap_n_rol
-      }
-
-      {
-	/* χ step (see [Keccak Reference, Section 2.3.1]) */
-	u64 temp[5];
-
-#define do_x_step_for_plane(y) \
-  /* Take a copy of the plane */ \
-  temp[0] = hd->state[y][0]; \
-  temp[1] = hd->state[y][1]; \
-  temp[2] = hd->state[y][2]; \
-  temp[3] = hd->state[y][3]; \
-  temp[4] = hd->state[y][4]; \
-  \
-  /* Compute χ on the plane */ \
-  hd->state[y][0] = temp[0] ^ ((~temp[1]) & temp[2]); \
-  hd->state[y][1] = temp[1] ^ ((~temp[2]) & temp[3]); \
-  hd->state[y][2] = temp[2] ^ ((~temp[3]) & temp[4]); \
-  hd->state[y][3] = temp[3] ^ ((~temp[4]) & temp[0]); \
-  hd->state[y][4] = temp[4] ^ ((~temp[0]) & temp[1]);
-
-	do_x_step_for_plane(0);
-	do_x_step_for_plane(1);
-	do_x_step_for_plane(2);
-	do_x_step_for_plane(3);
-	do_x_step_for_plane(4);
-
-#undef do_x_step_for_plane
-      }
-
-      {
-	/* ι step (see [Keccak Reference, Section 2.3.5]) */
-
-	hd->state[0][0] ^= round_consts[round];
-      }
+      hd->u.state64[i] = le_bswap64(hd->u.state64[i]);
     }
 
-  return sizeof(void *) * 4 + sizeof(u64) * 10;
+  return 0;
 }
 
+#endif /* NEED_COMMON64 */
+
+
+#ifdef NEED_COMMON32BI
+
+static const u32 round_consts_32bit[2 * 24] =
+{
+  0x00000001UL, 0x00000000UL, 0x00000000UL, 0x00000089UL,
+  0x00000000UL, 0x8000008bUL, 0x00000000UL, 0x80008080UL,
+  0x00000001UL, 0x0000008bUL, 0x00000001UL, 0x00008000UL,
+  0x00000001UL, 0x80008088UL, 0x00000001UL, 0x80000082UL,
+  0x00000000UL, 0x0000000bUL, 0x00000000UL, 0x0000000aUL,
+  0x00000001UL, 0x00008082UL, 0x00000000UL, 0x00008003UL,
+  0x00000001UL, 0x0000808bUL, 0x00000001UL, 0x8000000bUL,
+  0x00000001UL, 0x8000008aUL, 0x00000001UL, 0x80000081UL,
+  0x00000000UL, 0x80000081UL, 0x00000000UL, 0x80000008UL,
+  0x00000000UL, 0x00000083UL, 0x00000000UL, 0x80008003UL,
+  0x00000001UL, 0x80008088UL, 0x00000000UL, 0x80000088UL,
+  0x00000001UL, 0x00008000UL, 0x00000000UL, 0x80008082UL
+};
 
 static unsigned int
-transform_blk (void *context, const unsigned char *data)
+keccak_extract_inplace32bi(KECCAK_STATE *hd, unsigned int outlen)
 {
-  KECCAK_CONTEXT *ctx = context;
-  KECCAK_STATE *hd = &ctx->state;
-  u64 *state = (u64 *)hd->state;
-  const size_t bsize = ctx->bctx.blocksize;
   unsigned int i;
+  u32 x0;
+  u32 x1;
+  u32 t;
+
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+    {
+      x0 = hd->u.state32bi[i * 2 + 0];
+      x1 = hd->u.state32bi[i * 2 + 1];
+
+      t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+      x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+      x0 = t;
+      t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+      t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+      t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+      t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+      t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+      t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+      t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+      t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+
+      hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+      hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+    }
 
-  /* Absorb input block. */
-  for (i = 0; i < bsize / 8; i++)
-    state[i] ^= buf_get_le64(data + i * 8);
+  return 0;
+}
 
-  return keccak_f1600_state_permute(hd) + 4 * sizeof(void *);
+static inline void
+keccak_absorb_lane32bi(u32 *lane, u32 x0, u32 x1)
+{
+  u32 t;
+
+  t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+  t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+  t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+  t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+  t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+  t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+  t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+  t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+  lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+  lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
 }
 
+#endif /* NEED_COMMON32BI */
+
+
+/* Construct generic 64-bit implementation. */
+#ifdef USE_64BIT
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) (((x) << ((unsigned int)n & 63)) | \
+		      ((x) >> ((64 - (unsigned int)(n)) & 63)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
 
 static unsigned int
-transform (void *context, const unsigned char *data, size_t nblks)
+keccak_absorb_lanes64(KECCAK_STATE *hd, int pos, const byte *lanes,
+		      unsigned int nlanes, int blocklanes)
 {
-  KECCAK_CONTEXT *ctx = context;
-  const size_t bsize = ctx->bctx.blocksize;
-  unsigned int burn;
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_generic64_ops =
+{
+  .permute = keccak_f1600_state_permute64,
+  .absorb = keccak_absorb_lanes64,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT */
+
+
+/* Construct 64-bit Intel SHLD implementation. */
+#ifdef USE_64BIT_SHLD
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) ({ \
+			u64 tmp = (x); \
+			asm ("shldq %1, %0, %0" \
+			     : "+r" (tmp) \
+			     : "J" ((n) & 63) \
+			     : "cc"); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_shld
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_shld(KECCAK_STATE *hd, int pos, const byte *lanes,
+			   unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64_shld(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_shld_64_ops =
+{
+  .permute = keccak_f1600_state_permute64_shld,
+  .absorb = keccak_absorb_lanes64_shld,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_SHLD */
+
+
+/* Construct 64-bit Intel BMI2 implementation. */
+#ifdef USE_64BIT_BMI2
+
+# define ANDN64(x, y) ({ \
+			u64 tmp; \
+			asm ("andnq %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "r0" (x), "rm" (y)); \
+			tmp; })
+
+# define ROL64(x, n) ({ \
+			u64 tmp; \
+			asm ("rorxq %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "rm0" (x), "J" (64 - ((n) & 63))); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_bmi2
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+			   unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      hd->u.state64[pos] ^= buf_get_le64(lanes);
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute64_bmi2(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static const keccak_ops_t keccak_bmi2_64_ops =
+{
+  .permute = keccak_f1600_state_permute64_bmi2,
+  .absorb = keccak_absorb_lanes64_bmi2,
+  .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_BMI2 */
+
+
+/* Construct generic 32-bit implementation. */
+#ifdef USE_32BIT
+
+# define ANDN32(x, y) (~(x) & (y))
+# define ROL32(x, n) (((x) << ((unsigned int)n & 31)) | \
+		      ((x) >> ((32 - (unsigned int)(n)) & 31)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes32bi(KECCAK_STATE *hd, int pos, const byte *lanes,
+		        unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
 
-  /* Absorb full blocks. */
-  do
+  while (nlanes)
     {
-      burn = transform_blk (context, data);
-      data += bsize;
+      keccak_absorb_lane32bi(&hd->u.state32bi[pos * 2],
+			     buf_get_le32(lanes + 0),
+			     buf_get_le32(lanes + 4));
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute32bi(hd);
+	  pos = 0;
+	}
     }
-  while (--nblks);
 
   return burn;
 }
 
+static const keccak_ops_t keccak_generic32bi_ops =
+{
+  .permute = keccak_f1600_state_permute32bi,
+  .absorb = keccak_absorb_lanes32bi,
+  .extract_inplace = keccak_extract_inplace32bi,
+};
+
+#endif /* USE_32BIT */
+
+
+/* Construct 32-bit Intel BMI2 implementation. */
+#ifdef USE_32BIT_BMI2
+
+# define ANDN32(x, y) ({ \
+			u32 tmp; \
+			asm ("andnl %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "r0" (x), "rm" (y)); \
+			tmp; })
+
+# define ROL32(x, n) ({ \
+			u32 tmp; \
+			asm ("rorxl %2, %1, %0" \
+			     : "=r" (tmp) \
+			     : "rm0" (x), "J" (32 - ((n) & 31))); \
+			tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi_bmi2
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static inline u32 pext(u32 x, u32 mask)
+{
+  u32 tmp;
+  asm ("pextl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+  return tmp;
+}
+
+static inline u32 pdep(u32 x, u32 mask)
+{
+  u32 tmp;
+  asm ("pdepl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+  return tmp;
+}
+
+static inline void
+keccak_absorb_lane32bi_bmi2(u32 *lane, u32 x0, u32 x1)
+{
+  x0 = pdep(pext(x0, 0x55555555), 0x0000ffff) | (pext(x0, 0xaaaaaaaa) << 16);
+  x1 = pdep(pext(x1, 0x55555555), 0x0000ffff) | (pext(x1, 0xaaaaaaaa) << 16);
+
+  lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+  lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
+}
+
+static unsigned int
+keccak_absorb_lanes32bi_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+		             unsigned int nlanes, int blocklanes)
+{
+  unsigned int burn = 0;
+
+  while (nlanes)
+    {
+      keccak_absorb_lane32bi_bmi2(&hd->u.state32bi[pos * 2],
+			          buf_get_le32(lanes + 0),
+			          buf_get_le32(lanes + 4));
+      lanes += 8;
+      nlanes--;
+
+      if (++pos == blocklanes)
+	{
+	  burn = keccak_f1600_state_permute32bi_bmi2(hd);
+	  pos = 0;
+	}
+    }
+
+  return burn;
+}
+
+static unsigned int
+keccak_extract_inplace32bi_bmi2(KECCAK_STATE *hd, unsigned int outlen)
+{
+  unsigned int i;
+  u32 x0;
+  u32 x1;
+  u32 t;
+
+  for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+    {
+      x0 = hd->u.state32bi[i * 2 + 0];
+      x1 = hd->u.state32bi[i * 2 + 1];
+
+      t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+      x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+      x0 = t;
+
+      x0 = pdep(pext(x0, 0xffff0001), 0xaaaaaaab) | pdep(x0 >> 1, 0x55555554);
+      x1 = pdep(pext(x1, 0xffff0001), 0xaaaaaaab) | pdep(x1 >> 1, 0x55555554);
+
+      hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+      hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+    }
+
+  return 0;
+}
+
+static const keccak_ops_t keccak_bmi2_32bi_ops =
+{
+  .permute = keccak_f1600_state_permute32bi_bmi2,
+  .absorb = keccak_absorb_lanes32bi_bmi2,
+  .extract_inplace = keccak_extract_inplace32bi_bmi2,
+};
+
+#endif /* USE_32BIT */
+
+
+static void
+keccak_write (void *context, const void *inbuf_arg, size_t inlen)
+{
+  KECCAK_CONTEXT *ctx = context;
+  const size_t bsize = ctx->blocksize;
+  const size_t blocklanes = bsize / 8;
+  const byte *inbuf = inbuf_arg;
+  unsigned int nburn, burn = 0;
+  unsigned int count, i;
+  unsigned int pos, nlanes;
+
+  count = ctx->count;
+
+  if (inlen && (count % 8))
+    {
+      byte lane[8] = { 0, };
+
+      /* Complete absorbing partial input lane. */
+
+      pos = count / 8;
+
+      for (i = count % 8; inlen && i < 8; i++)
+	{
+	  lane[i] = *inbuf++;
+	  inlen--;
+	  count++;
+	}
+
+      if (count == bsize)
+	count = 0;
+
+      nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1,
+			       (count % 8) ? -1 : blocklanes);
+      burn = nburn > burn ? nburn : burn;
+    }
+
+  /* Absorb full input lanes. */
+
+  pos = count / 8;
+  nlanes = inlen / 8;
+  if (nlanes > 0)
+    {
+      nburn = ctx->ops->absorb(&ctx->state, pos, inbuf, nlanes, blocklanes);
+      burn = nburn > burn ? nburn : burn;
+      inlen -= nlanes * 8;
+      inbuf += nlanes * 8;
+      count += nlanes * 8;
+      count = count % bsize;
+    }
+
+  if (inlen)
+    {
+      byte lane[8] = { 0, };
+
+      /* Absorb remaining partial input lane. */
+
+      pos = count / 8;
+
+      for (i = count % 8; inlen && i < 8; i++)
+	{
+	  lane[i] = *inbuf++;
+	  inlen--;
+	  count++;
+	}
+
+      nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1, -1);
+      burn = nburn > burn ? nburn : burn;
+
+      gcry_assert(count < bsize);
+    }
+
+  ctx->count = count;
+
+  if (burn)
+    _gcry_burn_stack (burn);
+}
+
 
 static void
 keccak_init (int algo, void *context, unsigned int flags)
@@ -267,29 +609,48 @@ keccak_init (int algo, void *context, unsigned int flags)
 
   memset (hd, 0, sizeof *hd);
 
-  ctx->bctx.nblocks = 0;
-  ctx->bctx.nblocks_high = 0;
-  ctx->bctx.count = 0;
-  ctx->bctx.bwrite = transform;
+  ctx->count = 0;
+
+  /* Select generic implementation. */
+#ifdef USE_64BIT
+  ctx->ops = &keccak_generic64_ops;
+#elif defined USE_32BIT
+  ctx->ops = &keccak_generic32bi_ops;
+#endif
+
+  /* Select optimized implementation based in hw features. */
+  if (0) {}
+#ifdef USE_64BIT_BMI2
+  else if (features & HWF_INTEL_BMI2)
+    ctx->ops = &keccak_bmi2_64_ops;
+#endif
+#ifdef USE_32BIT_BMI2
+  else if (features & HWF_INTEL_BMI2)
+    ctx->ops = &keccak_bmi2_32bi_ops;
+#endif
+#ifdef USE_64BIT_SHLD
+  else if (features & HWF_INTEL_FAST_SHLD)
+    ctx->ops = &keccak_shld_64_ops;
+#endif
 
   /* Set input block size, in Keccak terms this is called 'rate'. */
 
   switch (algo)
     {
     case GCRY_MD_SHA3_224:
-      ctx->bctx.blocksize = 1152 / 8;
+      ctx->blocksize = 1152 / 8;
       ctx->outlen = 224 / 8;
       break;
     case GCRY_MD_SHA3_256:
-      ctx->bctx.blocksize = 1088 / 8;
+      ctx->blocksize = 1088 / 8;
       ctx->outlen = 256 / 8;
       break;
     case GCRY_MD_SHA3_384:
-      ctx->bctx.blocksize = 832 / 8;
+      ctx->blocksize = 832 / 8;
       ctx->outlen = 384 / 8;
       break;
     case GCRY_MD_SHA3_512:
-      ctx->bctx.blocksize = 576 / 8;
+      ctx->blocksize = 576 / 8;
       ctx->outlen = 512 / 8;
       break;
     default:
@@ -334,59 +695,37 @@ keccak_final (void *context)
 {
   KECCAK_CONTEXT *ctx = context;
   KECCAK_STATE *hd = &ctx->state;
-  const size_t bsize = ctx->bctx.blocksize;
+  const size_t bsize = ctx->blocksize;
   const byte suffix = SHA3_DELIMITED_SUFFIX;
-  u64 *state = (u64 *)hd->state;
-  unsigned int stack_burn_depth;
+  unsigned int nburn, burn = 0;
   unsigned int lastbytes;
-  unsigned int i;
-  byte *buf;
+  byte lane[8];
 
-  _gcry_md_block_write (context, NULL, 0); /* flush */
-
-  buf = ctx->bctx.buf;
-  lastbytes = ctx->bctx.count;
-
-  /* Absorb remaining bytes. */
-  for (i = 0; i < lastbytes / 8; i++)
-    {
-      state[i] ^= buf_get_le64(buf);
-      buf += 8;
-    }
-
-  for (i = 0; i < lastbytes % 8; i++)
-    {
-      state[lastbytes / 8] ^= (u64)*buf << (i * 8);
-      buf++;
-    }
+  lastbytes = ctx->count;
 
   /* Do the padding and switch to the squeezing phase */
 
   /* Absorb the last few bits and add the first bit of padding (which
      coincides with the delimiter in delimited suffix) */
-  state[lastbytes / 8] ^= (u64)suffix << ((lastbytes % 8) * 8);
+  buf_put_le64(lane, (u64)suffix << ((lastbytes % 8) * 8));
+  nburn = ctx->ops->absorb(&ctx->state, lastbytes / 8, lane, 1, -1);
+  burn = nburn > burn ? nburn : burn;
 
   /* Add the second bit of padding. */
-  state[(bsize - 1) / 8] ^= (u64)0x80 << (((bsize - 1) % 8) * 8);
+  buf_put_le64(lane, (u64)0x80 << (((bsize - 1) % 8) * 8));
+  nburn = ctx->ops->absorb(&ctx->state, (bsize - 1) / 8, lane, 1, -1);
+  burn = nburn > burn ? nburn : burn;
 
   /* Switch to the squeezing phase. */
-  stack_burn_depth = keccak_f1600_state_permute(hd);
+  nburn = ctx->ops->permute(hd);
+  burn = nburn > burn ? nburn : burn;
 
   /* Squeeze out all the output blocks */
   if (ctx->outlen < bsize)
     {
       /* Output SHA3 digest. */
-      buf = ctx->bctx.buf;
-      for (i = 0; i < ctx->outlen / 8; i++)
-	{
-	  buf_put_le64(buf, state[i]);
-	  buf += 8;
-	}
-      for (i = 0; i < ctx->outlen % 8; i++)
-	{
-	  *buf = state[ctx->outlen / 8] >> (i * 8);
-	  buf++;
-	}
+      nburn = ctx->ops->extract_inplace(hd, ctx->outlen);
+      burn = nburn > burn ? nburn : burn;
     }
   else
     {
@@ -394,15 +733,18 @@ keccak_final (void *context)
       BUG();
     }
 
-  _gcry_burn_stack (stack_burn_depth);
+  wipememory(lane, sizeof(lane));
+  if (burn)
+    _gcry_burn_stack (burn);
 }
 
 
 static byte *
 keccak_read (void *context)
 {
-  KECCAK_CONTEXT *hd = (KECCAK_CONTEXT *) context;
-  return hd->bctx.buf;
+  KECCAK_CONTEXT *ctx = (KECCAK_CONTEXT *) context;
+  KECCAK_STATE *hd = &ctx->state;
+  return (byte *)&hd->u;
 }
 
 
@@ -585,7 +927,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_224 =
   {
     GCRY_MD_SHA3_224, {0, 1},
     "SHA3-224", sha3_224_asn, DIM (sha3_224_asn), oid_spec_sha3_224, 28,
-    sha3_224_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_224_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -593,7 +935,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_256 =
   {
     GCRY_MD_SHA3_256, {0, 1},
     "SHA3-256", sha3_256_asn, DIM (sha3_256_asn), oid_spec_sha3_256, 32,
-    sha3_256_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_256_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -601,7 +943,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_384 =
   {
     GCRY_MD_SHA3_384, {0, 1},
     "SHA3-384", sha3_384_asn, DIM (sha3_384_asn), oid_spec_sha3_384, 48,
-    sha3_384_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_384_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
@@ -609,7 +951,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_512 =
   {
     GCRY_MD_SHA3_512, {0, 1},
     "SHA3-512", sha3_512_asn, DIM (sha3_512_asn), oid_spec_sha3_512, 64,
-    sha3_512_init, _gcry_md_block_write, keccak_final, keccak_read,
+    sha3_512_init, keccak_write, keccak_final, keccak_read,
     sizeof (KECCAK_CONTEXT),
     run_selftests
   };
diff --git a/cipher/keccak_permute_32.h b/cipher/keccak_permute_32.h
new file mode 100644
index 0000000..fed9383
--- /dev/null
+++ b/cipher/keccak_permute_32.h
@@ -0,0 +1,535 @@
+/* keccak_permute_32.h - Keccak permute function (simple 32bit bit-interleaved)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple32bi/
+ * Keccak-simple32BI.c" implementation by Ronny Van Keer from SUPERCOP toolkit
+ * package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+  const u32 *round_consts = round_consts_32bit;
+  u32 Aba0, Abe0, Abi0, Abo0, Abu0;
+  u32 Aba1, Abe1, Abi1, Abo1, Abu1;
+  u32 Aga0, Age0, Agi0, Ago0, Agu0;
+  u32 Aga1, Age1, Agi1, Ago1, Agu1;
+  u32 Aka0, Ake0, Aki0, Ako0, Aku0;
+  u32 Aka1, Ake1, Aki1, Ako1, Aku1;
+  u32 Ama0, Ame0, Ami0, Amo0, Amu0;
+  u32 Ama1, Ame1, Ami1, Amo1, Amu1;
+  u32 Asa0, Ase0, Asi0, Aso0, Asu0;
+  u32 Asa1, Ase1, Asi1, Aso1, Asu1;
+  u32 BCa0, BCe0, BCi0, BCo0, BCu0;
+  u32 BCa1, BCe1, BCi1, BCo1, BCu1;
+  u32 Da0, De0, Di0, Do0, Du0;
+  u32 Da1, De1, Di1, Do1, Du1;
+  u32 Eba0, Ebe0, Ebi0, Ebo0, Ebu0;
+  u32 Eba1, Ebe1, Ebi1, Ebo1, Ebu1;
+  u32 Ega0, Ege0, Egi0, Ego0, Egu0;
+  u32 Ega1, Ege1, Egi1, Ego1, Egu1;
+  u32 Eka0, Eke0, Eki0, Eko0, Eku0;
+  u32 Eka1, Eke1, Eki1, Eko1, Eku1;
+  u32 Ema0, Eme0, Emi0, Emo0, Emu0;
+  u32 Ema1, Eme1, Emi1, Emo1, Emu1;
+  u32 Esa0, Ese0, Esi0, Eso0, Esu0;
+  u32 Esa1, Ese1, Esi1, Eso1, Esu1;
+  u32 *state = hd->u.state32bi;
+  unsigned int round;
+
+  Aba0 = state[0];
+  Aba1 = state[1];
+  Abe0 = state[2];
+  Abe1 = state[3];
+  Abi0 = state[4];
+  Abi1 = state[5];
+  Abo0 = state[6];
+  Abo1 = state[7];
+  Abu0 = state[8];
+  Abu1 = state[9];
+  Aga0 = state[10];
+  Aga1 = state[11];
+  Age0 = state[12];
+  Age1 = state[13];
+  Agi0 = state[14];
+  Agi1 = state[15];
+  Ago0 = state[16];
+  Ago1 = state[17];
+  Agu0 = state[18];
+  Agu1 = state[19];
+  Aka0 = state[20];
+  Aka1 = state[21];
+  Ake0 = state[22];
+  Ake1 = state[23];
+  Aki0 = state[24];
+  Aki1 = state[25];
+  Ako0 = state[26];
+  Ako1 = state[27];
+  Aku0 = state[28];
+  Aku1 = state[29];
+  Ama0 = state[30];
+  Ama1 = state[31];
+  Ame0 = state[32];
+  Ame1 = state[33];
+  Ami0 = state[34];
+  Ami1 = state[35];
+  Amo0 = state[36];
+  Amo1 = state[37];
+  Amu0 = state[38];
+  Amu1 = state[39];
+  Asa0 = state[40];
+  Asa1 = state[41];
+  Ase0 = state[42];
+  Ase1 = state[43];
+  Asi0 = state[44];
+  Asi1 = state[45];
+  Aso0 = state[46];
+  Aso1 = state[47];
+  Asu0 = state[48];
+  Asu1 = state[49];
+
+  for (round = 0; round < 24; round += 2)
+    {
+      /* prepareTheta */
+      BCa0 = Aba0 ^ Aga0 ^ Aka0 ^ Ama0 ^ Asa0;
+      BCa1 = Aba1 ^ Aga1 ^ Aka1 ^ Ama1 ^ Asa1;
+      BCe0 = Abe0 ^ Age0 ^ Ake0 ^ Ame0 ^ Ase0;
+      BCe1 = Abe1 ^ Age1 ^ Ake1 ^ Ame1 ^ Ase1;
+      BCi0 = Abi0 ^ Agi0 ^ Aki0 ^ Ami0 ^ Asi0;
+      BCi1 = Abi1 ^ Agi1 ^ Aki1 ^ Ami1 ^ Asi1;
+      BCo0 = Abo0 ^ Ago0 ^ Ako0 ^ Amo0 ^ Aso0;
+      BCo1 = Abo1 ^ Ago1 ^ Ako1 ^ Amo1 ^ Aso1;
+      BCu0 = Abu0 ^ Agu0 ^ Aku0 ^ Amu0 ^ Asu0;
+      BCu1 = Abu1 ^ Agu1 ^ Aku1 ^ Amu1 ^ Asu1;
+
+      /* thetaRhoPiChiIota(round  , A, E) */
+      Da0 = BCu0 ^ ROL32(BCe1, 1);
+      Da1 = BCu1 ^ BCe0;
+      De0 = BCa0 ^ ROL32(BCi1, 1);
+      De1 = BCa1 ^ BCi0;
+      Di0 = BCe0 ^ ROL32(BCo1, 1);
+      Di1 = BCe1 ^ BCo0;
+      Do0 = BCi0 ^ ROL32(BCu1, 1);
+      Do1 = BCi1 ^ BCu0;
+      Du0 = BCo0 ^ ROL32(BCa1, 1);
+      Du1 = BCo1 ^ BCa0;
+
+      Aba0 ^= Da0;
+      BCa0 = Aba0;
+      Age0 ^= De0;
+      BCe0 = ROL32(Age0, 22);
+      Aki1 ^= Di1;
+      BCi0 = ROL32(Aki1, 22);
+      Amo1 ^= Do1;
+      BCo0 = ROL32(Amo1, 11);
+      Asu0 ^= Du0;
+      BCu0 = ROL32(Asu0, 7);
+      Eba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eba0 ^= round_consts[round * 2 + 0];
+      Ebe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Ebi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ebo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Ebu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Aba1 ^= Da1;
+      BCa1 = Aba1;
+      Age1 ^= De1;
+      BCe1 = ROL32(Age1, 22);
+      Aki0 ^= Di0;
+      BCi1 = ROL32(Aki0, 21);
+      Amo0 ^= Do0;
+      BCo1 = ROL32(Amo0, 10);
+      Asu1 ^= Du1;
+      BCu1 = ROL32(Asu1, 7);
+      Eba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eba1 ^= round_consts[round * 2 + 1];
+      Ebe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Ebi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ebo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Ebu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abo0 ^= Do0;
+      BCa0 = ROL32(Abo0, 14);
+      Agu0 ^= Du0;
+      BCe0 = ROL32(Agu0, 10);
+      Aka1 ^= Da1;
+      BCi0 = ROL32(Aka1, 2);
+      Ame1 ^= De1;
+      BCo0 = ROL32(Ame1, 23);
+      Asi1 ^= Di1;
+      BCu0 = ROL32(Asi1, 31);
+      Ega0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ege0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Egi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ego0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Egu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abo1 ^= Do1;
+      BCa1 = ROL32(Abo1, 14);
+      Agu1 ^= Du1;
+      BCe1 = ROL32(Agu1, 10);
+      Aka0 ^= Da0;
+      BCi1 = ROL32(Aka0, 1);
+      Ame0 ^= De0;
+      BCo1 = ROL32(Ame0, 22);
+      Asi0 ^= Di0;
+      BCu1 = ROL32(Asi0, 30);
+      Ega1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ege1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Egi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ego1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Egu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abe1 ^= De1;
+      BCa0 = ROL32(Abe1, 1);
+      Agi0 ^= Di0;
+      BCe0 = ROL32(Agi0, 3);
+      Ako1 ^= Do1;
+      BCi0 = ROL32(Ako1, 13);
+      Amu0 ^= Du0;
+      BCo0 = ROL32(Amu0, 4);
+      Asa0 ^= Da0;
+      BCu0 = ROL32(Asa0, 9);
+      Eka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eke0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Eki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Eko0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Eku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abe0 ^= De0;
+      BCa1 = Abe0;
+      Agi1 ^= Di1;
+      BCe1 = ROL32(Agi1, 3);
+      Ako0 ^= Do0;
+      BCi1 = ROL32(Ako0, 12);
+      Amu1 ^= Du1;
+      BCo1 = ROL32(Amu1, 4);
+      Asa1 ^= Da1;
+      BCu1 = ROL32(Asa1, 9);
+      Eka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eke1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Eki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Eko1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Eku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abu1 ^= Du1;
+      BCa0 = ROL32(Abu1, 14);
+      Aga0 ^= Da0;
+      BCe0 = ROL32(Aga0, 18);
+      Ake0 ^= De0;
+      BCi0 = ROL32(Ake0, 5);
+      Ami1 ^= Di1;
+      BCo0 = ROL32(Ami1, 8);
+      Aso0 ^= Do0;
+      BCu0 = ROL32(Aso0, 28);
+      Ema0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Eme0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Emi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Emo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Emu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abu0 ^= Du0;
+      BCa1 = ROL32(Abu0, 13);
+      Aga1 ^= Da1;
+      BCe1 = ROL32(Aga1, 18);
+      Ake1 ^= De1;
+      BCi1 = ROL32(Ake1, 5);
+      Ami0 ^= Di0;
+      BCo1 = ROL32(Ami0, 7);
+      Aso1 ^= Do1;
+      BCu1 = ROL32(Aso1, 28);
+      Ema1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Eme1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Emi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Emo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Emu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Abi0 ^= Di0;
+      BCa0 = ROL32(Abi0, 31);
+      Ago1 ^= Do1;
+      BCe0 = ROL32(Ago1, 28);
+      Aku1 ^= Du1;
+      BCi0 = ROL32(Aku1, 20);
+      Ama1 ^= Da1;
+      BCo0 = ROL32(Ama1, 21);
+      Ase0 ^= De0;
+      BCu0 = ROL32(Ase0, 1);
+      Esa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ese0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Esi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Eso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Esu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Abi1 ^= Di1;
+      BCa1 = ROL32(Abi1, 31);
+      Ago0 ^= Do0;
+      BCe1 = ROL32(Ago0, 27);
+      Aku0 ^= Du0;
+      BCi1 = ROL32(Aku0, 19);
+      Ama0 ^= Da0;
+      BCo1 = ROL32(Ama0, 20);
+      Ase1 ^= De1;
+      BCu1 = ROL32(Ase1, 1);
+      Esa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ese1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Esi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Eso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Esu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      /* prepareTheta */
+      BCa0 = Eba0 ^ Ega0 ^ Eka0 ^ Ema0 ^ Esa0;
+      BCa1 = Eba1 ^ Ega1 ^ Eka1 ^ Ema1 ^ Esa1;
+      BCe0 = Ebe0 ^ Ege0 ^ Eke0 ^ Eme0 ^ Ese0;
+      BCe1 = Ebe1 ^ Ege1 ^ Eke1 ^ Eme1 ^ Ese1;
+      BCi0 = Ebi0 ^ Egi0 ^ Eki0 ^ Emi0 ^ Esi0;
+      BCi1 = Ebi1 ^ Egi1 ^ Eki1 ^ Emi1 ^ Esi1;
+      BCo0 = Ebo0 ^ Ego0 ^ Eko0 ^ Emo0 ^ Eso0;
+      BCo1 = Ebo1 ^ Ego1 ^ Eko1 ^ Emo1 ^ Eso1;
+      BCu0 = Ebu0 ^ Egu0 ^ Eku0 ^ Emu0 ^ Esu0;
+      BCu1 = Ebu1 ^ Egu1 ^ Eku1 ^ Emu1 ^ Esu1;
+
+      /* thetaRhoPiChiIota(round+1, E, A) */
+      Da0 = BCu0 ^ ROL32(BCe1, 1);
+      Da1 = BCu1 ^ BCe0;
+      De0 = BCa0 ^ ROL32(BCi1, 1);
+      De1 = BCa1 ^ BCi0;
+      Di0 = BCe0 ^ ROL32(BCo1, 1);
+      Di1 = BCe1 ^ BCo0;
+      Do0 = BCi0 ^ ROL32(BCu1, 1);
+      Do1 = BCi1 ^ BCu0;
+      Du0 = BCo0 ^ ROL32(BCa1, 1);
+      Du1 = BCo1 ^ BCa0;
+
+      Eba0 ^= Da0;
+      BCa0 = Eba0;
+      Ege0 ^= De0;
+      BCe0 = ROL32(Ege0, 22);
+      Eki1 ^= Di1;
+      BCi0 = ROL32(Eki1, 22);
+      Emo1 ^= Do1;
+      BCo0 = ROL32(Emo1, 11);
+      Esu0 ^= Du0;
+      BCu0 = ROL32(Esu0, 7);
+      Aba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Aba0 ^= round_consts[round * 2 + 2];
+      Abe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Abi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Abo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Abu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Eba1 ^= Da1;
+      BCa1 = Eba1;
+      Ege1 ^= De1;
+      BCe1 = ROL32(Ege1, 22);
+      Eki0 ^= Di0;
+      BCi1 = ROL32(Eki0, 21);
+      Emo0 ^= Do0;
+      BCo1 = ROL32(Emo0, 10);
+      Esu1 ^= Du1;
+      BCu1 = ROL32(Esu1, 7);
+      Aba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Aba1 ^= round_consts[round * 2 + 3];
+      Abe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Abi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Abo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Abu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebo0 ^= Do0;
+      BCa0 = ROL32(Ebo0, 14);
+      Egu0 ^= Du0;
+      BCe0 = ROL32(Egu0, 10);
+      Eka1 ^= Da1;
+      BCi0 = ROL32(Eka1, 2);
+      Eme1 ^= De1;
+      BCo0 = ROL32(Eme1, 23);
+      Esi1 ^= Di1;
+      BCu0 = ROL32(Esi1, 31);
+      Aga0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Age0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Agi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ago0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Agu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebo1 ^= Do1;
+      BCa1 = ROL32(Ebo1, 14);
+      Egu1 ^= Du1;
+      BCe1 = ROL32(Egu1, 10);
+      Eka0 ^= Da0;
+      BCi1 = ROL32(Eka0, 1);
+      Eme0 ^= De0;
+      BCo1 = ROL32(Eme0, 22);
+      Esi0 ^= Di0;
+      BCu1 = ROL32(Esi0, 30);
+      Aga1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Age1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Agi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ago1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Agu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebe1 ^= De1;
+      BCa0 = ROL32(Ebe1, 1);
+      Egi0 ^= Di0;
+      BCe0 = ROL32(Egi0, 3);
+      Eko1 ^= Do1;
+      BCi0 = ROL32(Eko1, 13);
+      Emu0 ^= Du0;
+      BCo0 = ROL32(Emu0, 4);
+      Esa0 ^= Da0;
+      BCu0 = ROL32(Esa0, 9);
+      Aka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ake0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Aki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Ako0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Aku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebe0 ^= De0;
+      BCa1 = Ebe0;
+      Egi1 ^= Di1;
+      BCe1 = ROL32(Egi1, 3);
+      Eko0 ^= Do0;
+      BCi1 = ROL32(Eko0, 12);
+      Emu1 ^= Du1;
+      BCo1 = ROL32(Emu1, 4);
+      Esa1 ^= Da1;
+      BCu1 = ROL32(Esa1, 9);
+      Aka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ake1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Aki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Ako1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Aku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebu1 ^= Du1;
+      BCa0 = ROL32(Ebu1, 14);
+      Ega0 ^= Da0;
+      BCe0 = ROL32(Ega0, 18);
+      Eke0 ^= De0;
+      BCi0 = ROL32(Eke0, 5);
+      Emi1 ^= Di1;
+      BCo0 = ROL32(Emi1, 8);
+      Eso0 ^= Do0;
+      BCu0 = ROL32(Eso0, 28);
+      Ama0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ame0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Ami0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Amo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Amu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebu0 ^= Du0;
+      BCa1 = ROL32(Ebu0, 13);
+      Ega1 ^= Da1;
+      BCe1 = ROL32(Ega1, 18);
+      Eke1 ^= De1;
+      BCi1 = ROL32(Eke1, 5);
+      Emi0 ^= Di0;
+      BCo1 = ROL32(Emi0, 7);
+      Eso1 ^= Do1;
+      BCu1 = ROL32(Eso1, 28);
+      Ama1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ame1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Ami1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Amo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Amu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+      Ebi0 ^= Di0;
+      BCa0 = ROL32(Ebi0, 31);
+      Ego1 ^= Do1;
+      BCe0 = ROL32(Ego1, 28);
+      Eku1 ^= Du1;
+      BCi0 = ROL32(Eku1, 20);
+      Ema1 ^= Da1;
+      BCo0 = ROL32(Ema1, 21);
+      Ese0 ^= De0;
+      BCu0 = ROL32(Ese0, 1);
+      Asa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+      Ase0 = BCe0 ^ ANDN32(BCi0, BCo0);
+      Asi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+      Aso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+      Asu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+      Ebi1 ^= Di1;
+      BCa1 = ROL32(Ebi1, 31);
+      Ego0 ^= Do0;
+      BCe1 = ROL32(Ego0, 27);
+      Eku0 ^= Du0;
+      BCi1 = ROL32(Eku0, 19);
+      Ema0 ^= Da0;
+      BCo1 = ROL32(Ema0, 20);
+      Ese1 ^= De1;
+      BCu1 = ROL32(Ese1, 1);
+      Asa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+      Ase1 = BCe1 ^ ANDN32(BCi1, BCo1);
+      Asi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+      Aso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+      Asu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+    }
+
+  state[0] = Aba0;
+  state[1] = Aba1;
+  state[2] = Abe0;
+  state[3] = Abe1;
+  state[4] = Abi0;
+  state[5] = Abi1;
+  state[6] = Abo0;
+  state[7] = Abo1;
+  state[8] = Abu0;
+  state[9] = Abu1;
+  state[10] = Aga0;
+  state[11] = Aga1;
+  state[12] = Age0;
+  state[13] = Age1;
+  state[14] = Agi0;
+  state[15] = Agi1;
+  state[16] = Ago0;
+  state[17] = Ago1;
+  state[18] = Agu0;
+  state[19] = Agu1;
+  state[20] = Aka0;
+  state[21] = Aka1;
+  state[22] = Ake0;
+  state[23] = Ake1;
+  state[24] = Aki0;
+  state[25] = Aki1;
+  state[26] = Ako0;
+  state[27] = Ako1;
+  state[28] = Aku0;
+  state[29] = Aku1;
+  state[30] = Ama0;
+  state[31] = Ama1;
+  state[32] = Ame0;
+  state[33] = Ame1;
+  state[34] = Ami0;
+  state[35] = Ami1;
+  state[36] = Amo0;
+  state[37] = Amo1;
+  state[38] = Amu0;
+  state[39] = Amu1;
+  state[40] = Asa0;
+  state[41] = Asa1;
+  state[42] = Ase0;
+  state[43] = Ase1;
+  state[44] = Asi0;
+  state[45] = Asi1;
+  state[46] = Aso0;
+  state[47] = Aso1;
+  state[48] = Asu0;
+  state[49] = Asu1;
+
+  return sizeof(void *) * 4 + sizeof(u32) * 12 * 5 * 2;
+}
diff --git a/cipher/keccak_permute_64.h b/cipher/keccak_permute_64.h
new file mode 100644
index 0000000..1264f19
--- /dev/null
+++ b/cipher/keccak_permute_64.h
@@ -0,0 +1,290 @@
+/* keccak_permute_64.h - Keccak permute function (simple 64bit)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple/Keccak-simple.c"
+ * implementation by Ronny Van Keer from SUPERCOP toolkit package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+  const u64 *round_consts = round_consts_64bit;
+  u64 Aba, Abe, Abi, Abo, Abu;
+  u64 Aga, Age, Agi, Ago, Agu;
+  u64 Aka, Ake, Aki, Ako, Aku;
+  u64 Ama, Ame, Ami, Amo, Amu;
+  u64 Asa, Ase, Asi, Aso, Asu;
+  u64 BCa, BCe, BCi, BCo, BCu;
+  u64 Da, De, Di, Do, Du;
+  u64 Eba, Ebe, Ebi, Ebo, Ebu;
+  u64 Ega, Ege, Egi, Ego, Egu;
+  u64 Eka, Eke, Eki, Eko, Eku;
+  u64 Ema, Eme, Emi, Emo, Emu;
+  u64 Esa, Ese, Esi, Eso, Esu;
+  u64 *state = hd->u.state64;
+  unsigned int round;
+
+  Aba = state[0];
+  Abe = state[1];
+  Abi = state[2];
+  Abo = state[3];
+  Abu = state[4];
+  Aga = state[5];
+  Age = state[6];
+  Agi = state[7];
+  Ago = state[8];
+  Agu = state[9];
+  Aka = state[10];
+  Ake = state[11];
+  Aki = state[12];
+  Ako = state[13];
+  Aku = state[14];
+  Ama = state[15];
+  Ame = state[16];
+  Ami = state[17];
+  Amo = state[18];
+  Amu = state[19];
+  Asa = state[20];
+  Ase = state[21];
+  Asi = state[22];
+  Aso = state[23];
+  Asu = state[24];
+
+  for (round = 0; round < 24; round += 2)
+    {
+      /* prepareTheta */
+      BCa = Aba ^ Aga ^ Aka ^ Ama ^ Asa;
+      BCe = Abe ^ Age ^ Ake ^ Ame ^ Ase;
+      BCi = Abi ^ Agi ^ Aki ^ Ami ^ Asi;
+      BCo = Abo ^ Ago ^ Ako ^ Amo ^ Aso;
+      BCu = Abu ^ Agu ^ Aku ^ Amu ^ Asu;
+
+      /* thetaRhoPiChiIotaPrepareTheta(round  , A, E) */
+      Da = BCu ^ ROL64(BCe, 1);
+      De = BCa ^ ROL64(BCi, 1);
+      Di = BCe ^ ROL64(BCo, 1);
+      Do = BCi ^ ROL64(BCu, 1);
+      Du = BCo ^ ROL64(BCa, 1);
+
+      Aba ^= Da;
+      BCa = Aba;
+      Age ^= De;
+      BCe = ROL64(Age, 44);
+      Aki ^= Di;
+      BCi = ROL64(Aki, 43);
+      Amo ^= Do;
+      BCo = ROL64(Amo, 21);
+      Asu ^= Du;
+      BCu = ROL64(Asu, 14);
+      Eba = BCa ^ ANDN64(BCe, BCi);
+      Eba ^= (u64)round_consts[round];
+      Ebe = BCe ^ ANDN64(BCi, BCo);
+      Ebi = BCi ^ ANDN64(BCo, BCu);
+      Ebo = BCo ^ ANDN64(BCu, BCa);
+      Ebu = BCu ^ ANDN64(BCa, BCe);
+
+      Abo ^= Do;
+      BCa = ROL64(Abo, 28);
+      Agu ^= Du;
+      BCe = ROL64(Agu, 20);
+      Aka ^= Da;
+      BCi = ROL64(Aka, 3);
+      Ame ^= De;
+      BCo = ROL64(Ame, 45);
+      Asi ^= Di;
+      BCu = ROL64(Asi, 61);
+      Ega = BCa ^ ANDN64(BCe, BCi);
+      Ege = BCe ^ ANDN64(BCi, BCo);
+      Egi = BCi ^ ANDN64(BCo, BCu);
+      Ego = BCo ^ ANDN64(BCu, BCa);
+      Egu = BCu ^ ANDN64(BCa, BCe);
+
+      Abe ^= De;
+      BCa = ROL64(Abe, 1);
+      Agi ^= Di;
+      BCe = ROL64(Agi, 6);
+      Ako ^= Do;
+      BCi = ROL64(Ako, 25);
+      Amu ^= Du;
+      BCo = ROL64(Amu, 8);
+      Asa ^= Da;
+      BCu = ROL64(Asa, 18);
+      Eka = BCa ^ ANDN64(BCe, BCi);
+      Eke = BCe ^ ANDN64(BCi, BCo);
+      Eki = BCi ^ ANDN64(BCo, BCu);
+      Eko = BCo ^ ANDN64(BCu, BCa);
+      Eku = BCu ^ ANDN64(BCa, BCe);
+
+      Abu ^= Du;
+      BCa = ROL64(Abu, 27);
+      Aga ^= Da;
+      BCe = ROL64(Aga, 36);
+      Ake ^= De;
+      BCi = ROL64(Ake, 10);
+      Ami ^= Di;
+      BCo = ROL64(Ami, 15);
+      Aso ^= Do;
+      BCu = ROL64(Aso, 56);
+      Ema = BCa ^ ANDN64(BCe, BCi);
+      Eme = BCe ^ ANDN64(BCi, BCo);
+      Emi = BCi ^ ANDN64(BCo, BCu);
+      Emo = BCo ^ ANDN64(BCu, BCa);
+      Emu = BCu ^ ANDN64(BCa, BCe);
+
+      Abi ^= Di;
+      BCa = ROL64(Abi, 62);
+      Ago ^= Do;
+      BCe = ROL64(Ago, 55);
+      Aku ^= Du;
+      BCi = ROL64(Aku, 39);
+      Ama ^= Da;
+      BCo = ROL64(Ama, 41);
+      Ase ^= De;
+      BCu = ROL64(Ase, 2);
+      Esa = BCa ^ ANDN64(BCe, BCi);
+      Ese = BCe ^ ANDN64(BCi, BCo);
+      Esi = BCi ^ ANDN64(BCo, BCu);
+      Eso = BCo ^ ANDN64(BCu, BCa);
+      Esu = BCu ^ ANDN64(BCa, BCe);
+
+      /* prepareTheta */
+      BCa = Eba ^ Ega ^ Eka ^ Ema ^ Esa;
+      BCe = Ebe ^ Ege ^ Eke ^ Eme ^ Ese;
+      BCi = Ebi ^ Egi ^ Eki ^ Emi ^ Esi;
+      BCo = Ebo ^ Ego ^ Eko ^ Emo ^ Eso;
+      BCu = Ebu ^ Egu ^ Eku ^ Emu ^ Esu;
+
+      /* thetaRhoPiChiIotaPrepareTheta(round+1, E, A) */
+      Da = BCu ^ ROL64(BCe, 1);
+      De = BCa ^ ROL64(BCi, 1);
+      Di = BCe ^ ROL64(BCo, 1);
+      Do = BCi ^ ROL64(BCu, 1);
+      Du = BCo ^ ROL64(BCa, 1);
+
+      Eba ^= Da;
+      BCa = Eba;
+      Ege ^= De;
+      BCe = ROL64(Ege, 44);
+      Eki ^= Di;
+      BCi = ROL64(Eki, 43);
+      Emo ^= Do;
+      BCo = ROL64(Emo, 21);
+      Esu ^= Du;
+      BCu = ROL64(Esu, 14);
+      Aba = BCa ^ ANDN64(BCe, BCi);
+      Aba ^= (u64)round_consts[round + 1];
+      Abe = BCe ^ ANDN64(BCi, BCo);
+      Abi = BCi ^ ANDN64(BCo, BCu);
+      Abo = BCo ^ ANDN64(BCu, BCa);
+      Abu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebo ^= Do;
+      BCa = ROL64(Ebo, 28);
+      Egu ^= Du;
+      BCe = ROL64(Egu, 20);
+      Eka ^= Da;
+      BCi = ROL64(Eka, 3);
+      Eme ^= De;
+      BCo = ROL64(Eme, 45);
+      Esi ^= Di;
+      BCu = ROL64(Esi, 61);
+      Aga = BCa ^ ANDN64(BCe, BCi);
+      Age = BCe ^ ANDN64(BCi, BCo);
+      Agi = BCi ^ ANDN64(BCo, BCu);
+      Ago = BCo ^ ANDN64(BCu, BCa);
+      Agu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebe ^= De;
+      BCa = ROL64(Ebe, 1);
+      Egi ^= Di;
+      BCe = ROL64(Egi, 6);
+      Eko ^= Do;
+      BCi = ROL64(Eko, 25);
+      Emu ^= Du;
+      BCo = ROL64(Emu, 8);
+      Esa ^= Da;
+      BCu = ROL64(Esa, 18);
+      Aka = BCa ^ ANDN64(BCe, BCi);
+      Ake = BCe ^ ANDN64(BCi, BCo);
+      Aki = BCi ^ ANDN64(BCo, BCu);
+      Ako = BCo ^ ANDN64(BCu, BCa);
+      Aku = BCu ^ ANDN64(BCa, BCe);
+
+      Ebu ^= Du;
+      BCa = ROL64(Ebu, 27);
+      Ega ^= Da;
+      BCe = ROL64(Ega, 36);
+      Eke ^= De;
+      BCi = ROL64(Eke, 10);
+      Emi ^= Di;
+      BCo = ROL64(Emi, 15);
+      Eso ^= Do;
+      BCu = ROL64(Eso, 56);
+      Ama = BCa ^ ANDN64(BCe, BCi);
+      Ame = BCe ^ ANDN64(BCi, BCo);
+      Ami = BCi ^ ANDN64(BCo, BCu);
+      Amo = BCo ^ ANDN64(BCu, BCa);
+      Amu = BCu ^ ANDN64(BCa, BCe);
+
+      Ebi ^= Di;
+      BCa = ROL64(Ebi, 62);
+      Ego ^= Do;
+      BCe = ROL64(Ego, 55);
+      Eku ^= Du;
+      BCi = ROL64(Eku, 39);
+      Ema ^= Da;
+      BCo = ROL64(Ema, 41);
+      Ese ^= De;
+      BCu = ROL64(Ese, 2);
+      Asa = BCa ^ ANDN64(BCe, BCi);
+      Ase = BCe ^ ANDN64(BCi, BCo);
+      Asi = BCi ^ ANDN64(BCo, BCu);
+      Aso = BCo ^ ANDN64(BCu, BCa);
+      Asu = BCu ^ ANDN64(BCa, BCe);
+    }
+
+  state[0] = Aba;
+  state[1] = Abe;
+  state[2] = Abi;
+  state[3] = Abo;
+  state[4] = Abu;
+  state[5] = Aga;
+  state[6] = Age;
+  state[7] = Agi;
+  state[8] = Ago;
+  state[9] = Agu;
+  state[10] = Aka;
+  state[11] = Ake;
+  state[12] = Aki;
+  state[13] = Ako;
+  state[14] = Aku;
+  state[15] = Ama;
+  state[16] = Ame;
+  state[17] = Ami;
+  state[18] = Amo;
+  state[19] = Amu;
+  state[20] = Asa;
+  state[21] = Ase;
+  state[22] = Asi;
+  state[23] = Aso;
+  state[24] = Asu;
+
+  return sizeof(void *) * 4 + sizeof(u64) * 12 * 5;
+}

commit 909644ef5883927262366c356eed530e55aba478
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Fri Oct 23 22:39:47 2015 +0300

    hwf-x86: add detection for Intel CPUs with fast SHLD instruction
    
    * cipher/sha1.c (sha1_init): Use HWF_INTEL_FAST_SHLD instead of
    HWF_INTEL_CPU.
    * cipher/sha256.c (sha256_init, sha224_init): Ditto.
    * cipher/sha512.c (sha512_init, sha384_init): Ditto.
    * src/g10lib.h (HWF_INTEL_FAST_SHLD): New.
    (HWF_INTEL_BMI2, HWF_INTEL_SSSE3, HWF_INTEL_PCLMUL, HWF_INTEL_AESNI)
    (HWF_INTEL_RDRAND, HWF_INTEL_AVX, HWF_INTEL_AVX2)
    (HWF_ARM_NEON): Update.
    * src/hwf-x86.c (detect_x86_gnuc): Add detection of Intel Core
    CPUs with fast SHLD/SHRD instruction.
    * src/hwfeatures.c (hwflist): Add "intel-fast-shld".
    --
    
    Intel Core CPUs since codename sandy-bridge have been able to
    execute SHLD/SHRD instructions faster than rotate instructions
    ROL/ROR. Since SHLD/SHRD can be used to do rotation, some
    optimized implementations (SHA1/SHA256/SHA512) use SHLD/SHRD
    instructions in-place of ROL/ROR.
    
    This patch provides more accurate detection of CPUs with
    fast SHLD implementation.
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/sha1.c b/cipher/sha1.c
index eb42883..554d55c 100644
--- a/cipher/sha1.c
+++ b/cipher/sha1.c
@@ -136,7 +136,7 @@ sha1_init (void *context, unsigned int flags)
 #ifdef USE_AVX
   /* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
    * Therefore use this implementation on Intel CPUs only. */
-  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
 #endif
 #ifdef USE_BMI2
   hd->use_bmi2 = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_BMI2);
diff --git a/cipher/sha256.c b/cipher/sha256.c
index 59ffa43..63869d5 100644
--- a/cipher/sha256.c
+++ b/cipher/sha256.c
@@ -124,7 +124,7 @@ sha256_init (void *context, unsigned int flags)
 #ifdef USE_AVX
   /* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
    * Therefore use this implementation on Intel CPUs only. */
-  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
 #endif
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
@@ -162,7 +162,7 @@ sha224_init (void *context, unsigned int flags)
 #ifdef USE_AVX
   /* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
    * Therefore use this implementation on Intel CPUs only. */
-  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+  hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
 #endif
 #ifdef USE_AVX2
   hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
diff --git a/cipher/sha512.c b/cipher/sha512.c
index 029f8f0..4be1cab 100644
--- a/cipher/sha512.c
+++ b/cipher/sha512.c
@@ -154,7 +154,7 @@ sha512_init (void *context, unsigned int flags)
   ctx->use_ssse3 = (features & HWF_INTEL_SSSE3) != 0;
 #endif
 #ifdef USE_AVX
-  ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+  ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
 #endif
 #ifdef USE_AVX2
   ctx->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
@@ -194,7 +194,7 @@ sha384_init (void *context, unsigned int flags)
   ctx->use_ssse3 = (features & HWF_INTEL_SSSE3) != 0;
 #endif
 #ifdef USE_AVX
-  ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+  ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
 #endif
 #ifdef USE_AVX2
   ctx->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
diff --git a/src/g10lib.h b/src/g10lib.h
index d1f9426..a579e94 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -197,16 +197,17 @@ int _gcry_log_verbosity( int level );
 #define HWF_PADLOCK_SHA  4
 #define HWF_PADLOCK_MMUL 8
 
-#define HWF_INTEL_CPU    16
-#define HWF_INTEL_BMI2   32
-#define HWF_INTEL_SSSE3  64
-#define HWF_INTEL_PCLMUL 128
-#define HWF_INTEL_AESNI  256
-#define HWF_INTEL_RDRAND 512
-#define HWF_INTEL_AVX    1024
-#define HWF_INTEL_AVX2   2048
-
-#define HWF_ARM_NEON     4096
+#define HWF_INTEL_CPU       16
+#define HWF_INTEL_FAST_SHLD 32
+#define HWF_INTEL_BMI2      64
+#define HWF_INTEL_SSSE3     128
+#define HWF_INTEL_PCLMUL    256
+#define HWF_INTEL_AESNI     512
+#define HWF_INTEL_RDRAND    1024
+#define HWF_INTEL_AVX       2048
+#define HWF_INTEL_AVX2      4096
+
+#define HWF_ARM_NEON        8192
 
 
 gpg_err_code_t _gcry_disable_hw_feature (const char *name);
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index 399952c..fbd6331 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -174,6 +174,7 @@ detect_x86_gnuc (void)
   unsigned int features;
   unsigned int os_supports_avx_avx2_registers = 0;
   unsigned int max_cpuid_level;
+  unsigned int fms, family, model;
   unsigned int result = 0;
 
   (void)os_supports_avx_avx2_registers;
@@ -236,8 +237,37 @@ detect_x86_gnuc (void)
   /* Detect Intel features, that might also be supported by other
      vendors.  */
 
-  /* Get CPU info and Intel feature flags (ECX).  */
-  get_cpuid(1, NULL, NULL, &features, NULL);
+  /* Get CPU family/model/stepping (EAX) and Intel feature flags (ECX).  */
+  get_cpuid(1, &fms, NULL, &features, NULL);
+
+  family = ((fms & 0xf00) >> 8) + ((fms & 0xff00000) >> 20);
+  model = ((fms & 0xf0) >> 4) + ((fms & 0xf0000) >> 12);
+
+  if ((result & HWF_INTEL_CPU) && family == 6)
+    {
+      /* These Intel Core processor models have SHLD/SHRD instruction that
+       * can do integer rotation faster actual ROL/ROR instructions. */
+      switch (model)
+	{
+	case 0x2A:
+	case 0x2D:
+	case 0x3A:
+	case 0x3C:
+	case 0x3F:
+	case 0x45:
+	case 0x46:
+	case 0x3D:
+	case 0x4F:
+	case 0x56:
+	case 0x47:
+	case 0x4E:
+	case 0x5E:
+	case 0x55:
+	case 0x66:
+	  result |= HWF_INTEL_FAST_SHLD;
+	  break;
+	}
+    }
 
 #ifdef ENABLE_PCLMUL_SUPPORT
   /* Test bit 1 for PCLMUL.  */
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 58099c4..e7c55cc 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -42,19 +42,20 @@ static struct
   const char *desc;
 } hwflist[] =
   {
-    { HWF_PADLOCK_RNG, "padlock-rng" },
-    { HWF_PADLOCK_AES, "padlock-aes" },
-    { HWF_PADLOCK_SHA, "padlock-sha" },
-    { HWF_PADLOCK_MMUL,"padlock-mmul"},
-    { HWF_INTEL_CPU,   "intel-cpu" },
-    { HWF_INTEL_BMI2,  "intel-bmi2" },
-    { HWF_INTEL_SSSE3, "intel-ssse3" },
-    { HWF_INTEL_PCLMUL,"intel-pclmul" },
-    { HWF_INTEL_AESNI, "intel-aesni" },
-    { HWF_INTEL_RDRAND,"intel-rdrand" },
-    { HWF_INTEL_AVX,   "intel-avx" },
-    { HWF_INTEL_AVX2,  "intel-avx2" },
-    { HWF_ARM_NEON,    "arm-neon" }
+    { HWF_PADLOCK_RNG,     "padlock-rng" },
+    { HWF_PADLOCK_AES,     "padlock-aes" },
+    { HWF_PADLOCK_SHA,     "padlock-sha" },
+    { HWF_PADLOCK_MMUL,    "padlock-mmul"},
+    { HWF_INTEL_CPU,       "intel-cpu" },
+    { HWF_INTEL_FAST_SHLD, "intel-fast-shld" },
+    { HWF_INTEL_BMI2,      "intel-bmi2" },
+    { HWF_INTEL_SSSE3,     "intel-ssse3" },
+    { HWF_INTEL_PCLMUL,    "intel-pclmul" },
+    { HWF_INTEL_AESNI,     "intel-aesni" },
+    { HWF_INTEL_RDRAND,    "intel-rdrand" },
+    { HWF_INTEL_AVX,       "intel-avx" },
+    { HWF_INTEL_AVX2,      "intel-avx2" },
+    { HWF_ARM_NEON,        "arm-neon" }
   };
 
 /* A bit vector with the hardware features which shall not be used.

commit 16fd540f4d01eb6dc23d9509ae549353617c7a67
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Sat Oct 24 12:41:23 2015 +0300

    Fix OCB amd64 assembly implementations for x32
    
    * cipher/camellia-glue.c (_gcry_camellia_aesni_avx_ocb_enc)
    (_gcry_camellia_aesni_avx_ocb_dec, _gcry_camellia_aesni_avx_ocb_auth)
    (_gcry_camellia_aesni_avx2_ocb_enc, _gcry_camellia_aesni_avx2_ocb_dec)
    (_gcry_camellia_aesni_avx2_ocb_auth, _gcry_camellia_ocb_crypt)
    (_gcry_camellia_ocb_auth): Change 'Ls' from pointer array to u64 array.
    * cipher/serpent.c (_gcry_serpent_sse2_ocb_enc)
    (_gcry_serpent_sse2_ocb_dec, _gcry_serpent_sse2_ocb_auth)
    (_gcry_serpent_avx2_ocb_enc, _gcry_serpent_avx2_ocb_dec)
    (_gcry_serpent_ocb_crypt, _gcry_serpent_ocb_auth): Ditto.
    * cipher/twofish.c (_gcry_twofish_amd64_ocb_enc)
    (_gcry_twofish_amd64_ocb_dec, _gcry_twofish_amd64_ocb_auth)
    (twofish_amd64_ocb_enc, twofish_amd64_ocb_dec, twofish_amd64_ocb_auth)
    (_gcry_twofish_ocb_crypt, _gcry_twofish_ocb_auth): Ditto.
    --
    
    Pointers on x32 are 32-bit, but amd64 assembly implementations
    expect 64-bit pointers. Pass 'Ls' array to 64-bit integers so
    that input arrays has correct format for assembly functions.
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c
index dee0169..dfddb4a 100644
--- a/cipher/camellia-glue.c
+++ b/cipher/camellia-glue.c
@@ -141,20 +141,20 @@ extern void _gcry_camellia_aesni_avx_ocb_enc(CAMELLIA_context *ctx,
 					     const unsigned char *in,
 					     unsigned char *offset,
 					     unsigned char *checksum,
-					     const void *Ls[16]) ASM_FUNC_ABI;
+					     const u64 Ls[16]) ASM_FUNC_ABI;
 
 extern void _gcry_camellia_aesni_avx_ocb_dec(CAMELLIA_context *ctx,
 					     unsigned char *out,
 					     const unsigned char *in,
 					     unsigned char *offset,
 					     unsigned char *checksum,
-					     const void *Ls[16]) ASM_FUNC_ABI;
+					     const u64 Ls[16]) ASM_FUNC_ABI;
 
 extern void _gcry_camellia_aesni_avx_ocb_auth(CAMELLIA_context *ctx,
 					     const unsigned char *abuf,
 					     unsigned char *offset,
 					     unsigned char *checksum,
-					     const void *Ls[16]) ASM_FUNC_ABI;
+					     const u64 Ls[16]) ASM_FUNC_ABI;
 
 extern void _gcry_camellia_aesni_avx_keygen(CAMELLIA_context *ctx,
 					    const unsigned char *key,
@@ -185,20 +185,20 @@ extern void _gcry_camellia_aesni_avx2_ocb_enc(CAMELLIA_context *ctx,
 					      const unsigned char *in,
 					      unsigned char *offset,
 					      unsigned char *checksum,
-					      const void *Ls[32]) ASM_FUNC_ABI;
+					      const u64 Ls[32]) ASM_FUNC_ABI;
 
 extern void _gcry_camellia_aesni_avx2_ocb_dec(CAMELLIA_context *ctx,
 					      unsigned char *out,
 					      const unsigned char *in,
 					      unsigned char *offset,
 					      unsigned char *checksum,
-					      const void *Ls[32]) ASM_FUNC_ABI;
+					      const u64 Ls[32]) ASM_FUNC_ABI;
 
 extern void _gcry_camellia_aesni_avx2_ocb_auth(CAMELLIA_context *ctx,
 					       const unsigned char *abuf,
 					       unsigned char *offset,
 					       unsigned char *checksum,
-					       const void *Ls[32]) ASM_FUNC_ABI;
+					       const u64 Ls[32]) ASM_FUNC_ABI;
 #endif
 
 static const char *selftest(void);
@@ -630,27 +630,29 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   if (ctx->use_aesni_avx2)
     {
       int did_use_aesni_avx2 = 0;
-      const void *Ls[32];
+      u64 Ls[32];
       unsigned int n = 32 - (blkn % 32);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 32)
 	{
 	  for (i = 0; i < 32; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 32] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 32] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 32] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 32] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 32] = c->u_mode.ocb.L[3];
-	  Ls[(15 + n) % 32] = c->u_mode.ocb.L[4];
-	  Ls[(23 + n) % 32] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4];
+	  Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(31 + n) % 32];
 
 	  /* Process data in 32 block chunks. */
@@ -658,7 +660,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 32;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 32);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 32);
 
 	      if (encrypt)
 		_gcry_camellia_aesni_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -691,25 +693,27 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   if (ctx->use_aesni_avx)
     {
       int did_use_aesni_avx = 0;
-      const void *Ls[16];
+      u64 Ls[16];
       unsigned int n = 16 - (blkn % 16);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 16)
 	{
 	  for (i = 0; i < 16; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(15 + n) % 16];
 
 	  /* Process data in 16 block chunks. */
@@ -717,7 +721,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 16;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
 
 	      if (encrypt)
 		_gcry_camellia_aesni_avx_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -780,27 +784,29 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
   if (ctx->use_aesni_avx2)
     {
       int did_use_aesni_avx2 = 0;
-      const void *Ls[32];
+      u64 Ls[32];
       unsigned int n = 32 - (blkn % 32);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 32)
 	{
 	  for (i = 0; i < 32; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 32] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 32] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 32] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 32] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 32] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 32] = c->u_mode.ocb.L[3];
-	  Ls[(15 + n) % 32] = c->u_mode.ocb.L[4];
-	  Ls[(23 + n) % 32] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4];
+	  Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(31 + n) % 32];
 
 	  /* Process data in 32 block chunks. */
@@ -808,7 +814,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 32;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 32);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 32);
 
 	      _gcry_camellia_aesni_avx2_ocb_auth(ctx, abuf,
 						 c->u_mode.ocb.aad_offset,
@@ -837,25 +843,27 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
   if (ctx->use_aesni_avx)
     {
       int did_use_aesni_avx = 0;
-      const void *Ls[16];
+      u64 Ls[16];
       unsigned int n = 16 - (blkn % 16);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 16)
 	{
 	  for (i = 0; i < 16; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(15 + n) % 16];
 
 	  /* Process data in 16 block chunks. */
@@ -863,7 +871,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 16;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
 
 	      _gcry_camellia_aesni_avx_ocb_auth(ctx, abuf,
 						c->u_mode.ocb.aad_offset,
diff --git a/cipher/serpent.c b/cipher/serpent.c
index fc3afa6..4ef7f52 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -125,20 +125,20 @@ extern void _gcry_serpent_sse2_ocb_enc(serpent_context_t *ctx,
 				       const unsigned char *in,
 				       unsigned char *offset,
 				       unsigned char *checksum,
-				       const void *Ls[8]) ASM_FUNC_ABI;
+				       const u64 Ls[8]) ASM_FUNC_ABI;
 
 extern void _gcry_serpent_sse2_ocb_dec(serpent_context_t *ctx,
 				       unsigned char *out,
 				       const unsigned char *in,
 				       unsigned char *offset,
 				       unsigned char *checksum,
-				       const void *Ls[8]) ASM_FUNC_ABI;
+				       const u64 Ls[8]) ASM_FUNC_ABI;
 
 extern void _gcry_serpent_sse2_ocb_auth(serpent_context_t *ctx,
 					const unsigned char *abuf,
 					unsigned char *offset,
 					unsigned char *checksum,
-					const void *Ls[8]) ASM_FUNC_ABI;
+					const u64 Ls[8]) ASM_FUNC_ABI;
 #endif
 
 #ifdef USE_AVX2
@@ -165,20 +165,20 @@ extern void _gcry_serpent_avx2_ocb_enc(serpent_context_t *ctx,
 				       const unsigned char *in,
 				       unsigned char *offset,
 				       unsigned char *checksum,
-				       const void *Ls[16]) ASM_FUNC_ABI;
+				       const u64 Ls[16]) ASM_FUNC_ABI;
 
 extern void _gcry_serpent_avx2_ocb_dec(serpent_context_t *ctx,
 				       unsigned char *out,
 				       const unsigned char *in,
 				       unsigned char *offset,
 				       unsigned char *checksum,
-				       const void *Ls[16]) ASM_FUNC_ABI;
+				       const u64 Ls[16]) ASM_FUNC_ABI;
 
 extern void _gcry_serpent_avx2_ocb_auth(serpent_context_t *ctx,
 					const unsigned char *abuf,
 					unsigned char *offset,
 					unsigned char *checksum,
-					const void *Ls[16]) ASM_FUNC_ABI;
+					const u64 Ls[16]) ASM_FUNC_ABI;
 #endif
 
 #ifdef USE_NEON
@@ -1249,25 +1249,27 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   if (ctx->use_avx2)
     {
       int did_use_avx2 = 0;
-      const void *Ls[16];
+      u64 Ls[16];
       unsigned int n = 16 - (blkn % 16);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 16)
 	{
 	  for (i = 0; i < 16; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(15 + n) % 16];
 
 	  /* Process data in 16 block chunks. */
@@ -1275,7 +1277,7 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 16;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
 
 	      if (encrypt)
 		_gcry_serpent_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -1305,19 +1307,21 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 #ifdef USE_SSE2
   {
     int did_use_sse2 = 0;
-    const void *Ls[8];
+    u64 Ls[8];
     unsigned int n = 8 - (blkn % 8);
-    const void **l;
+    u64 *l;
 
     if (nblocks >= 8)
       {
-	Ls[(0 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(1 + n) % 8] = c->u_mode.ocb.L[1];
-	Ls[(2 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(3 + n) % 8] = c->u_mode.ocb.L[2];
-	Ls[(4 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(5 + n) % 8] = c->u_mode.ocb.L[1];
-	Ls[(6 + n) % 8] = c->u_mode.ocb.L[0];
+	/* Use u64 to store pointers for x32 support (assembly function
+	  * assumes 64-bit pointers). */
+	Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	l = &Ls[(7 + n) % 8];
 
 	/* Process data in 8 block chunks. */
@@ -1325,7 +1329,7 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 	  {
 	    /* l_tmp will be used only every 65536-th block. */
 	    blkn += 8;
-	    *l = ocb_get_l(c, l_tmp, blkn - blkn % 8);
+	    *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 8);
 
 	    if (encrypt)
 	      _gcry_serpent_sse2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -1435,25 +1439,27 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
   if (ctx->use_avx2)
     {
       int did_use_avx2 = 0;
-      const void *Ls[16];
+      u64 Ls[16];
       unsigned int n = 16 - (blkn % 16);
-      const void **l;
+      u64 *l;
       int i;
 
       if (nblocks >= 16)
 	{
 	  for (i = 0; i < 16; i += 8)
 	    {
-	      Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
-	      Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
-	      Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
-	      Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	    }
 
-	  Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
 	  l = &Ls[(15 + n) % 16];
 
 	  /* Process data in 16 block chunks. */
@@ -1461,7 +1467,7 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 	    {
 	      /* l_tmp will be used only every 65536-th block. */
 	      blkn += 16;
-	      *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+	      *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
 
 	      _gcry_serpent_avx2_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,
 					  c->u_mode.ocb.aad_sum, Ls);
@@ -1486,19 +1492,21 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 #ifdef USE_SSE2
   {
     int did_use_sse2 = 0;
-    const void *Ls[8];
+    u64 Ls[8];
     unsigned int n = 8 - (blkn % 8);
-    const void **l;
+    u64 *l;
 
     if (nblocks >= 8)
       {
-	Ls[(0 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(1 + n) % 8] = c->u_mode.ocb.L[1];
-	Ls[(2 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(3 + n) % 8] = c->u_mode.ocb.L[2];
-	Ls[(4 + n) % 8] = c->u_mode.ocb.L[0];
-	Ls[(5 + n) % 8] = c->u_mode.ocb.L[1];
-	Ls[(6 + n) % 8] = c->u_mode.ocb.L[0];
+	/* Use u64 to store pointers for x32 support (assembly function
+	* assumes 64-bit pointers). */
+	Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
 	l = &Ls[(7 + n) % 8];
 
 	/* Process data in 8 block chunks. */
@@ -1506,7 +1514,7 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 	  {
 	    /* l_tmp will be used only every 65536-th block. */
 	    blkn += 8;
-	    *l = ocb_get_l(c, l_tmp, blkn - blkn % 8);
+	    *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 8);
 
 	    _gcry_serpent_sse2_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,
 					c->u_mode.ocb.aad_sum, Ls);
diff --git a/cipher/twofish.c b/cipher/twofish.c
index 7f361c9..f6ecd67 100644
--- a/cipher/twofish.c
+++ b/cipher/twofish.c
@@ -734,15 +734,15 @@ extern void _gcry_twofish_amd64_cfb_dec(const TWOFISH_context *c, byte *out,
 
 extern void _gcry_twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out,
 					const byte *in, byte *offset,
-					byte *checksum, const void *Ls[3]);
+					byte *checksum, const u64 Ls[3]);
 
 extern void _gcry_twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out,
 					const byte *in, byte *offset,
-					byte *checksum, const void *Ls[3]);
+					byte *checksum, const u64 Ls[3]);
 
 extern void _gcry_twofish_amd64_ocb_auth(const TWOFISH_context *ctx,
 					 const byte *abuf, byte *offset,
-					 byte *checksum, const void *Ls[3]);
+					 byte *checksum, const u64 Ls[3]);
 
 #ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 static inline void
@@ -854,7 +854,7 @@ twofish_amd64_cfb_dec(const TWOFISH_context *c, byte *out, const byte *in,
 
 static inline void
 twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out, const byte *in,
-		      byte *offset, byte *checksum, const void *Ls[3])
+		      byte *offset, byte *checksum, const u64 Ls[3])
 {
 #ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
   call_sysv_fn6(_gcry_twofish_amd64_ocb_enc, ctx, out, in, offset, checksum, Ls);
@@ -865,7 +865,7 @@ twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out, const byte *in,
 
 static inline void
 twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out, const byte *in,
-		      byte *offset, byte *checksum, const void *Ls[3])
+		      byte *offset, byte *checksum, const u64 Ls[3])
 {
 #ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
   call_sysv_fn6(_gcry_twofish_amd64_ocb_dec, ctx, out, in, offset, checksum, Ls);
@@ -876,7 +876,7 @@ twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out, const byte *in,
 
 static inline void
 twofish_amd64_ocb_auth(const TWOFISH_context *ctx, const byte *abuf,
-		       byte *offset, byte *checksum, const void *Ls[3])
+		       byte *offset, byte *checksum, const u64 Ls[3])
 {
 #ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
   call_sysv_fn5(_gcry_twofish_amd64_ocb_auth, ctx, abuf, offset, checksum, Ls);
@@ -1261,15 +1261,17 @@ _gcry_twofish_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   u64 blkn = c->u_mode.ocb.data_nblocks;
 
   {
-    const void *Ls[3];
+    /* Use u64 to store pointers for x32 support (assembly function
+      * assumes 64-bit pointers). */
+    u64 Ls[3];
 
     /* Process data in 3 block chunks. */
     while (nblocks >= 3)
       {
 	/* l_tmp will be used only every 65536-th block. */
-	Ls[0] = ocb_get_l(c, l_tmp, blkn + 1);
-	Ls[1] = ocb_get_l(c, l_tmp, blkn + 2);
-	Ls[2] = ocb_get_l(c, l_tmp, blkn + 3);
+	Ls[0] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 1);
+	Ls[1] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 2);
+	Ls[2] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 3);
 	blkn += 3;
 
 	if (encrypt)
@@ -1320,15 +1322,17 @@ _gcry_twofish_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
   u64 blkn = c->u_mode.ocb.aad_nblocks;
 
   {
-    const void *Ls[3];
+    /* Use u64 to store pointers for x32 support (assembly function
+      * assumes 64-bit pointers). */
+    u64 Ls[3];
 
     /* Process data in 3 block chunks. */
     while (nblocks >= 3)
       {
 	/* l_tmp will be used only every 65536-th block. */
-	Ls[0] = ocb_get_l(c, l_tmp, blkn + 1);
-	Ls[1] = ocb_get_l(c, l_tmp, blkn + 2);
-	Ls[2] = ocb_get_l(c, l_tmp, blkn + 3);
+	Ls[0] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 1);
+	Ls[1] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 2);
+	Ls[2] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 3);
 	blkn += 3;
 
 	twofish_amd64_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,

commit ae40af427fd2a856b24ec2a41323ec8b80ffc9c0
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date:   Fri Oct 23 22:24:47 2015 +0300

    bench-slope: add KDF/PBKDF2 benchmark
    
    * tests/bench-slope.c (bench_kdf_mode, bench_kdf_init, bench_kdf_free)
    (bench_kdf_do_bench, kdf_ops, kdf_bench_one, kdf_bench): New.
    (print_help): Add 'kdf'.
    (main): Add KDF benchmarks.
    --
    
    Introduce KDF benchmarking to bench-slope. Output is given as
    nanosecs/iter (and cycles/iter if --cpu-mhz used). Only PBKDF2
    is support with this initial patch.
    
    For example, below shows output of KDF bench-slope before
    and after commit "md: keep contexts for HMAC in GcryDigestEntry",
    on Intel Core i5-4570 @ 3.2 Ghz:
    
    Before:
    
    $ tests/bench-slope --cpu-mhz 3201 kdf
    KDF:
                              |  nanosecs/iter   cycles/iter
     PBKDF2-HMAC-MD5          |          882.4        2824.7
     PBKDF2-HMAC-SHA1         |          832.6        2665.0
     PBKDF2-HMAC-RIPEMD160    |         1148.3        3675.6
     PBKDF2-HMAC-TIGER192     |         1339.6        4288.2
     PBKDF2-HMAC-SHA256       |         1460.5        4675.1
     PBKDF2-HMAC-SHA384       |         1723.2        5515.8
     PBKDF2-HMAC-SHA512       |         1729.1        5534.7
     PBKDF2-HMAC-SHA224       |         1424.0        4558.3
     PBKDF2-HMAC-WHIRLPOOL    |         2459.7        7873.5
     PBKDF2-HMAC-TIGER        |         1350.2        4322.1
     PBKDF2-HMAC-TIGER2       |         1348.7        4317.3
     PBKDF2-HMAC-GOSTR3411_94 |         7374.1       23604.4
     PBKDF2-HMAC-STRIBOG256   |         6060.0       19398.1
     PBKDF2-HMAC-STRIBOG512   |         7512.8       24048.3
     PBKDF2-HMAC-GOSTR3411_CP |         7378.3       23618.0
     PBKDF2-HMAC-SHA3-224     |         2789.6        8929.5
     PBKDF2-HMAC-SHA3-256     |         2785.1        8915.0
     PBKDF2-HMAC-SHA3-384     |         2955.5        9460.5
     PBKDF2-HMAC-SHA3-512     |         2859.7        9153.9
                              =
    
    After:
    
    $ tests/bench-slope --cpu-mhz 3201 kdf
    KDF:
                              |  nanosecs/iter   cycles/iter
     PBKDF2-HMAC-MD5          |          405.9        1299.2
     PBKDF2-HMAC-SHA1         |          392.1        1255.0
     PBKDF2-HMAC-RIPEMD160    |          540.9        1731.5
     PBKDF2-HMAC-TIGER192     |          637.1        2039.4
     PBKDF2-HMAC-SHA256       |          691.8        2214.3
     PBKDF2-HMAC-SHA384       |          848.0        2714.3
     PBKDF2-HMAC-SHA512       |          875.7        2803.1
     PBKDF2-HMAC-SHA224       |          689.2        2206.0
     PBKDF2-HMAC-WHIRLPOOL    |         1535.6        4915.5
     PBKDF2-HMAC-TIGER        |          636.3        2036.7
     PBKDF2-HMAC-TIGER2       |          636.6        2037.7
     PBKDF2-HMAC-GOSTR3411_94 |         5311.5       17002.2
     PBKDF2-HMAC-STRIBOG256   |         4308.0       13790.0
     PBKDF2-HMAC-STRIBOG512   |         5767.4       18461.4
     PBKDF2-HMAC-GOSTR3411_CP |         5309.4       16995.4
     PBKDF2-HMAC-SHA3-224     |         1333.1        4267.2
     PBKDF2-HMAC-SHA3-256     |         1327.8        4250.4
     PBKDF2-HMAC-SHA3-384     |         1392.8        4458.3
     PBKDF2-HMAC-SHA3-512     |         1428.5        4572.7
                              =
    
    Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>

diff --git a/tests/bench-slope.c b/tests/bench-slope.c
index 394d7fc..2679556 100644
--- a/tests/bench-slope.c
+++ b/tests/bench-slope.c
@@ -1571,13 +1571,176 @@ mac_bench (char **argv, int argc)
 }
 
 
+/************************************************************ KDF benchmarks. */
+
+struct bench_kdf_mode
+{
+  struct bench_ops *ops;
+
+  int algo;
+  int subalgo;
+};
+
+
+static int
+bench_kdf_init (struct bench_obj *obj)
+{
+  struct bench_kdf_mode *mode = obj->priv;
+
+  if (mode->algo == GCRY_KDF_PBKDF2)
+    {
+      obj->min_bufsize = 2;
+      obj->max_bufsize = 2 * 32;
+      obj->step_size = 2;
+    }
+
+  obj->num_measure_repetitions = num_measurement_repetitions;
+
+  return 0;
+}
+
+static void
+bench_kdf_free (struct bench_obj *obj)
+{
+  (void)obj;
+}
+
+static void
+bench_kdf_do_bench (struct bench_obj *obj, void *buf, size_t buflen)
+{
+  struct bench_kdf_mode *mode = obj->priv;
+  char keybuf[16];
+
+  (void)buf;
+
+  if (mode->algo == GCRY_KDF_PBKDF2)
+    {
+      gcry_kdf_derive("qwerty", 6, mode->algo, mode->subalgo, "01234567", 8,
+		      buflen, sizeof(keybuf), keybuf);
+    }
+}
+
+static struct bench_ops kdf_ops = {
+  &bench_kdf_init,
+  &bench_kdf_free,
+  &bench_kdf_do_bench
+};
+
+
+static void
+kdf_bench_one (int algo, int subalgo)
+{
+  struct bench_kdf_mode mode = { &kdf_ops };
+  struct bench_obj obj = { 0 };
+  double nsecs_per_iteration;
+  double cycles_per_iteration;
+  char algo_name[32];
+  char nsecpiter_buf[16];
+  char cpiter_buf[16];
+
+  mode.algo = algo;
+  mode.subalgo = subalgo;
+
+  switch (subalgo)
+    {
+    case GCRY_MD_CRC32:
+    case GCRY_MD_CRC32_RFC1510:
+    case GCRY_MD_CRC24_RFC2440:
+    case GCRY_MD_MD4:
+      /* Skip CRC32s. */
+      return;
+    }
+
+  *algo_name = 0;
+
+  if (algo == GCRY_KDF_PBKDF2)
+    {
+      snprintf (algo_name, sizeof(algo_name), "PBKDF2-HMAC-%s",
+		gcry_md_algo_name (subalgo));
+    }
+
+  bench_print_algo (-24, algo_name);
+
+  obj.ops = mode.ops;
+  obj.priv = &mode;
+
+  nsecs_per_iteration = do_slope_benchmark (&obj);
+
+  strcpy(cpiter_buf, csv_mode ? "" : "-");
+
+  double_to_str (nsecpiter_buf, sizeof (nsecpiter_buf), nsecs_per_iteration);
+
+  /* If user didn't provide CPU speed, we cannot show cycles/iter results.  */
+  if (cpu_ghz > 0.0)
+    {
+      cycles_per_iteration = nsecs_per_iteration * cpu_ghz;
+      double_to_str (cpiter_buf, sizeof (cpiter_buf), cycles_per_iteration);
+    }
+
+  if (csv_mode)
+    {
+      printf ("%s,%s,%s,,,,,,,,,%s,ns/iter,%s,c/iter\n",
+	      current_section_name,
+	      current_algo_name ? current_algo_name : "",
+	      current_mode_name ? current_mode_name : "",
+	      nsecpiter_buf,
+	      cpiter_buf);
+    }
+  else
+    {
+      printf ("%14s %13s\n", nsecpiter_buf, cpiter_buf);
+    }
+}
+
+void
+kdf_bench (char **argv, int argc)
+{
+  char algo_name[32];
+  int i, j;
+
+  bench_print_section ("kdf", "KDF");
+
+  if (!csv_mode)
+    {
+      printf (" %-*s | ", 24, "");
+      printf ("%14s %13s\n", "nanosecs/iter", "cycles/iter");
+    }
+
+  if (argv && argc)
+    {
+      for (i = 0; i < argc; i++)
+	{
+	  for (j = 1; j < 400; j++)
+	    {
+	      if (gcry_md_test_algo (j))
+		continue;
+
+	      snprintf (algo_name, sizeof(algo_name), "PBKDF2-HMAC-%s",
+			gcry_md_algo_name (j));
+
+	      if (!strcmp(argv[i], algo_name))
+		kdf_bench_one (GCRY_KDF_PBKDF2, j);
+	    }
+	}
+    }
+  else
+    {
+      for (i = 1; i < 400; i++)
+	if (!gcry_md_test_algo (i))
+	  kdf_bench_one (GCRY_KDF_PBKDF2, i);
+    }
+
+  bench_print_footer (24);
+}
+
+
 /************************************************************** Main program. */
 
 void
 print_help (void)
 {
   static const char *help_lines[] = {
-    "usage: bench-slope [options] [hash|mac|cipher [algonames]]",
+    "usage: bench-slope [options] [hash|mac|cipher|kdf [algonames]]",
     "",
     " options:",
     "   --cpu-mhz <mhz>           Set CPU speed for calculating cycles",
@@ -1744,6 +1907,7 @@ main (int argc, char **argv)
       hash_bench (NULL, 0);
       mac_bench (NULL, 0);
       cipher_bench (NULL, 0);
+      kdf_bench (NULL, 0);
     }
   else if (!strcmp (*argv, "hash"))
     {
@@ -1769,6 +1933,14 @@ main (int argc, char **argv)
       warm_up_cpu ();
       cipher_bench ((argc == 0) ? NULL : argv, argc);
     }
+  else if (!strcmp (*argv, "kdf"))
+    {
+      argc--;
+      argv++;
+
+      warm_up_cpu ();
+      kdf_bench ((argc == 0) ? NULL : argv, argc);
+    }
   else
     {
       fprintf (stderr, PGM ": unknown argument: %s\n", *argv);

-----------------------------------------------------------------------

Summary of changes:
 cipher/Makefile.am         |   2 +-
 cipher/camellia-glue.c     | 116 ++++---
 cipher/hash-common.h       |  12 +-
 cipher/keccak.c            | 808 ++++++++++++++++++++++++++++++++-------------
 cipher/keccak_permute_32.h | 535 ++++++++++++++++++++++++++++++
 cipher/keccak_permute_64.h | 290 ++++++++++++++++
 cipher/serpent.c           | 104 +++---
 cipher/sha1.c              |   2 +-
 cipher/sha256.c            |   4 +-
 cipher/sha512.c            |   4 +-
 cipher/twofish.c           |  32 +-
 src/g10lib.h               |  21 +-
 src/hwf-x86.c              |  34 +-
 src/hwfeatures.c           |  27 +-
 tests/bench-slope.c        | 174 +++++++++-
 15 files changed, 1775 insertions(+), 390 deletions(-)
 create mode 100644 cipher/keccak_permute_32.h
 create mode 100644 cipher/keccak_permute_64.h


hooks/post-receive
-- 
The GNU crypto library
http://git.gnupg.org


_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits