[git] GCRYPT - branch, master, updated. libgcrypt-1.6.0-271-g74184c2
by Jussi Kivilinna
cvs at cvs.gnupg.org
Wed Oct 28 19:11:30 CET 2015
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "The GNU crypto library".
The branch, master has been updated
via 74184c28fbe7ff58cf57f0094ef957d94045da7d (commit)
via 909644ef5883927262366c356eed530e55aba478 (commit)
via 16fd540f4d01eb6dc23d9509ae549353617c7a67 (commit)
via ae40af427fd2a856b24ec2a41323ec8b80ffc9c0 (commit)
from f7505b550dd591e33d3a3fab9277c43c460f1bad (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
commit 74184c28fbe7ff58cf57f0094ef957d94045da7d
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Fri Oct 23 22:30:48 2015 +0300
keccak: rewrite for improved performance
* cipher/Makefile.am: Add 'keccak_permute_32.h' and
'keccak_permute_64.h'.
* cipher/hash-common.h [USE_SHA3] (MD_BLOCK_MAX_BLOCKSIZE): Remove.
* cipher/keccak.c (USE_64BIT, USE_32BIT, USE_64BIT_BMI2)
(USE_64BIT_SHLD, USE_32BIT_BMI2, NEED_COMMON64, NEED_COMMON32BI)
(keccak_ops_t): New.
(KECCAK_STATE): Add 'state64' and 'state32bi' members.
(KECCAK_CONTEXT): Remove 'bctx'; add 'blocksize', 'count' and 'ops'.
(rol64, keccak_f1600_state_permute): Remove.
[NEED_COMMON64] (round_consts_64bit, keccak_extract_inplace64): New.
[NEED_COMMON32BI] (round_consts_32bit, keccak_extract_inplace32bi)
(keccak_absorb_lane32bi): New.
[USE_64BIT] (ANDN64, ROL64, keccak_f1600_state_permute64)
(keccak_absorb_lanes64, keccak_generic64_ops): New.
[USE_64BIT_SHLD] (ANDN64, ROL64, keccak_f1600_state_permute64_shld)
(keccak_absorb_lanes64_shld, keccak_shld_64_ops): New.
[USE_64BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute64_bmi2)
(keccak_absorb_lanes64_bmi2, keccak_bmi2_64_ops): New.
[USE_32BIT] (ANDN64, ROL64, keccak_f1600_state_permute32bi)
(keccak_absorb_lanes32bi, keccak_generic32bi_ops): New.
[USE_32BIT_BMI2] (ANDN64, ROL64, keccak_f1600_state_permute32bi_bmi2)
(pext, pdep, keccak_absorb_lane32bi_bmi2, keccak_absorb_lanes32bi_bmi2)
(keccak_extract_inplace32bi_bmi2, keccak_bmi2_32bi_ops): New.
(keccak_write): New.
(keccak_init): Adjust to KECCAK_CONTEXT changes; add implementation
selection based on HWF features.
(keccak_final): Adjust to KECCAK_CONTEXT changes; use selected 'ops'
for state manipulation.
(keccak_read): Adjust to KECCAK_CONTEXT changes.
(_gcry_digest_spec_sha3_224, _gcry_digest_spec_sha3_256)
(_gcry_digest_spec_sha3_348, _gcry_digest_spec_sha3_512): Use
'keccak_write' instead of '_gcry_md_block_write'.
* cipher/keccak_permute_32.h: New.
* cipher/keccak_permute_64.h: New.
--
Patch adds new generic 64-bit and 32-bit implementations and
optimized implementations for SHA3:
- Generic 64-bit implementation based on 'simple' implementation
from SUPERCOP package.
- Generic 32-bit bit-inteleaved implementataion based on
'simple32bi' implementation from SUPERCOP package.
- Intel BMI2 optimized variants of 64-bit and 32-bit BI
implementations.
- Intel SHLD optimized variant of 64-bit implementation.
Patch also makes proper use of sponge construction to avoid
use of addition input buffer.
Below are bench-slope benchmarks for new 64-bit implementations
made on Intel Core i5-4570 (no turbo, 3.2 Ghz, gcc-4.9.2).
Before (amd64):
SHA3-224 | 3.92 ns/B 243.2 MiB/s 12.55 c/B
SHA3-256 | 4.15 ns/B 230.0 MiB/s 13.27 c/B
SHA3-384 | 5.40 ns/B 176.6 MiB/s 17.29 c/B
SHA3-512 | 7.77 ns/B 122.7 MiB/s 24.87 c/B
After (generic 64-bit, amd64), 1.10x faster):
SHA3-224 | 3.57 ns/B 267.4 MiB/s 11.42 c/B
SHA3-256 | 3.77 ns/B 252.8 MiB/s 12.07 c/B
SHA3-384 | 4.91 ns/B 194.1 MiB/s 15.72 c/B
SHA3-512 | 7.06 ns/B 135.0 MiB/s 22.61 c/B
After (Intel SHLD 64-bit, amd64, 1.13x faster):
SHA3-224 | 3.48 ns/B 273.7 MiB/s 11.15 c/B
SHA3-256 | 3.68 ns/B 258.9 MiB/s 11.79 c/B
SHA3-384 | 4.80 ns/B 198.7 MiB/s 15.36 c/B
SHA3-512 | 6.89 ns/B 138.4 MiB/s 22.05 c/B
After (Intel BMI2 64-bit, amd64, 1.45x faster):
SHA3-224 | 2.71 ns/B 352.1 MiB/s 8.67 c/B
SHA3-256 | 2.86 ns/B 333.2 MiB/s 9.16 c/B
SHA3-384 | 3.72 ns/B 256.2 MiB/s 11.91 c/B
SHA3-512 | 5.34 ns/B 178.5 MiB/s 17.10 c/B
Benchmarks of new 32-bit implementations on Intel Core i5-4570
(no turbo, 3.2 Ghz, gcc-4.9.2):
Before (win32):
SHA3-224 | 12.05 ns/B 79.16 MiB/s 38.56 c/B
SHA3-256 | 12.75 ns/B 74.78 MiB/s 40.82 c/B
SHA3-384 | 16.63 ns/B 57.36 MiB/s 53.22 c/B
SHA3-512 | 23.97 ns/B 39.79 MiB/s 76.72 c/B
After (generic 32-bit BI, win32, 1.23x to 1.29x faster):
SHA3-224 | 9.76 ns/B 97.69 MiB/s 31.25 c/B
SHA3-256 | 10.27 ns/B 92.82 MiB/s 32.89 c/B
SHA3-384 | 13.22 ns/B 72.16 MiB/s 42.31 c/B
SHA3-512 | 18.65 ns/B 51.13 MiB/s 59.70 c/B
After (Intel BMI2 32-bit BI, win32, 1.66x to 1.70x faster):
SHA3-224 | 7.26 ns/B 131.4 MiB/s 23.23 c/B
SHA3-256 | 7.65 ns/B 124.7 MiB/s 24.47 c/B
SHA3-384 | 9.87 ns/B 96.67 MiB/s 31.58 c/B
SHA3-512 | 14.05 ns/B 67.85 MiB/s 44.99 c/B
Benchmarks of new 32-bit implementation on ARM Cortex-A8
(1008 Mhz, gcc-4.9.1):
Before:
SHA3-224 | 148.6 ns/B 6.42 MiB/s 149.8 c/B
SHA3-256 | 157.2 ns/B 6.07 MiB/s 158.4 c/B
SHA3-384 | 205.3 ns/B 4.65 MiB/s 206.9 c/B
SHA3-512 | 296.3 ns/B 3.22 MiB/s 298.6 c/B
After (1.56x faster):
SHA3-224 | 96.12 ns/B 9.92 MiB/s 96.89 c/B
SHA3-256 | 101.5 ns/B 9.40 MiB/s 102.3 c/B
SHA3-384 | 131.4 ns/B 7.26 MiB/s 132.5 c/B
SHA3-512 | 188.2 ns/B 5.07 MiB/s 189.7 c/B
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index b08c9a9..be03d06 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -90,7 +90,7 @@ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S sha256-avx2-bmi2-amd64.S \
sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S sha512-avx2-bmi2-amd64.S \
sha512-armv7-neon.S \
-keccak.c \
+keccak.c keccak_permute_32.h keccak_permute_64.h \
stribog.c \
tiger.c \
whirlpool.c whirlpool-sse2-amd64.S \
diff --git a/cipher/hash-common.h b/cipher/hash-common.h
index e1ae5a2..27d670d 100644
--- a/cipher/hash-common.h
+++ b/cipher/hash-common.h
@@ -33,15 +33,9 @@ typedef unsigned int (*_gcry_md_block_write_t) (void *c,
const unsigned char *blks,
size_t nblks);
-#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_SHA3) || \
- defined(USE_WHIRLPOOL))
-/* SHA-512, SHA-3 and Whirlpool needs u64. SHA-512 and SHA3 need larger
- * buffer. */
-# ifdef USE_SHA3
-# define MD_BLOCK_MAX_BLOCKSIZE (1152 / 8)
-# else
-# define MD_BLOCK_MAX_BLOCKSIZE 128
-# endif
+#if defined(HAVE_U64_TYPEDEF) && (defined(USE_SHA512) || defined(USE_WHIRLPOOL))
+/* SHA-512 and Whirlpool needs u64. SHA-512 needs larger buffer. */
+# define MD_BLOCK_MAX_BLOCKSIZE 128
# define MD_NBLOCKS_TYPE u64
#else
# define MD_BLOCK_MAX_BLOCKSIZE 64
diff --git a/cipher/keccak.c b/cipher/keccak.c
index 4a9c1f2..3a72294 100644
--- a/cipher/keccak.c
+++ b/cipher/keccak.c
@@ -27,11 +27,45 @@
#include "hash-common.h"
-/* The code is based on public-domain/CC0 "Keccak-readable-and-compact.c"
- * implementation by the Keccak, Keyak and Ketje Teams, namely, Guido Bertoni,
- * Joan Daemen, Michaël Peeters, Gilles Van Assche and Ronny Van Keer. From:
- * https://github.com/gvanas/KeccakCodePackage
- */
+
+/* USE_64BIT indicates whether to use 64-bit generic implementation.
+ * USE_32BIT indicates whether to use 32-bit generic implementation. */
+#undef USE_64BIT
+#if defined(__x86_64__) || SIZEOF_UNSIGNED_LONG == 8
+# define USE_64BIT 1
+#else
+# define USE_32BIT 1
+#endif
+
+
+/* USE_64BIT_BMI2 indicates whether to compile with 64-bit Intel BMI2 code. */
+#undef USE_64BIT_BMI2
+#if defined(USE_64BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_64BIT_BMI2 1
+#endif
+
+
+/* USE_64BIT_SHLD indicates whether to compile with 64-bit Intel SHLD code. */
+#undef USE_64BIT_SHLD
+#if defined(USE_64BIT) && defined (__GNUC__) && defined(__x86_64__)
+# define USE_64BIT_SHLD 1
+#endif
+
+
+/* USE_32BIT_BMI2 indicates whether to compile with 32-bit Intel BMI2 code. */
+#undef USE_32BIT_BMI2
+#if defined(USE_32BIT) && defined(HAVE_GCC_INLINE_ASM_BMI2)
+# define USE_32BIT_BMI2 1
+#endif
+
+
+#ifdef USE_64BIT
+# define NEED_COMMON64 1
+#endif
+
+#ifdef USE_32BIT
+# define NEED_COMMON32BI 1
+#endif
#define SHA3_DELIMITED_SUFFIX 0x06
@@ -40,220 +74,528 @@
typedef struct
{
- u64 state[5][5];
+ union {
+#ifdef NEED_COMMON64
+ u64 state64[25];
+#endif
+#ifdef NEED_COMMON32BI
+ u32 state32bi[50];
+#endif
+ } u;
} KECCAK_STATE;
typedef struct
{
- gcry_md_block_ctx_t bctx;
+ unsigned int (*permute)(KECCAK_STATE *hd);
+ unsigned int (*absorb)(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes);
+ unsigned int (*extract_inplace) (KECCAK_STATE *hd, unsigned int outlen);
+} keccak_ops_t;
+
+
+typedef struct KECCAK_CONTEXT_S
+{
KECCAK_STATE state;
unsigned int outlen;
+ unsigned int blocksize;
+ unsigned int count;
+ const keccak_ops_t *ops;
} KECCAK_CONTEXT;
-static inline u64
-rol64 (u64 x, unsigned int n)
+
+#ifdef NEED_COMMON64
+
+static const u64 round_consts_64bit[24] =
{
- return ((x << n) | (x >> (64 - n)));
-}
+ U64_C(0x0000000000000001), U64_C(0x0000000000008082),
+ U64_C(0x800000000000808A), U64_C(0x8000000080008000),
+ U64_C(0x000000000000808B), U64_C(0x0000000080000001),
+ U64_C(0x8000000080008081), U64_C(0x8000000000008009),
+ U64_C(0x000000000000008A), U64_C(0x0000000000000088),
+ U64_C(0x0000000080008009), U64_C(0x000000008000000A),
+ U64_C(0x000000008000808B), U64_C(0x800000000000008B),
+ U64_C(0x8000000000008089), U64_C(0x8000000000008003),
+ U64_C(0x8000000000008002), U64_C(0x8000000000000080),
+ U64_C(0x000000000000800A), U64_C(0x800000008000000A),
+ U64_C(0x8000000080008081), U64_C(0x8000000000008080),
+ U64_C(0x0000000080000001), U64_C(0x8000000080008008)
+};
-/* Function that computes the Keccak-f[1600] permutation on the given state. */
-static unsigned int keccak_f1600_state_permute(KECCAK_STATE *hd)
+static unsigned int
+keccak_extract_inplace64(KECCAK_STATE *hd, unsigned int outlen)
{
- static const u64 round_consts[24] =
- {
- U64_C(0x0000000000000001), U64_C(0x0000000000008082),
- U64_C(0x800000000000808A), U64_C(0x8000000080008000),
- U64_C(0x000000000000808B), U64_C(0x0000000080000001),
- U64_C(0x8000000080008081), U64_C(0x8000000000008009),
- U64_C(0x000000000000008A), U64_C(0x0000000000000088),
- U64_C(0x0000000080008009), U64_C(0x000000008000000A),
- U64_C(0x000000008000808B), U64_C(0x800000000000008B),
- U64_C(0x8000000000008089), U64_C(0x8000000000008003),
- U64_C(0x8000000000008002), U64_C(0x8000000000000080),
- U64_C(0x000000000000800A), U64_C(0x800000008000000A),
- U64_C(0x8000000080008081), U64_C(0x8000000000008080),
- U64_C(0x0000000080000001), U64_C(0x8000000080008008)
- };
- unsigned int round;
+ unsigned int i;
- for (round = 0; round < 24; round++)
+ for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
{
- {
- /* θ step (see [Keccak Reference, Section 2.3.2]) === */
- u64 C[5], D[5];
-
- /* Compute the parity of the columns */
- C[0] = hd->state[0][0] ^ hd->state[1][0] ^ hd->state[2][0]
- ^ hd->state[3][0] ^ hd->state[4][0];
- C[1] = hd->state[0][1] ^ hd->state[1][1] ^ hd->state[2][1]
- ^ hd->state[3][1] ^ hd->state[4][1];
- C[2] = hd->state[0][2] ^ hd->state[1][2] ^ hd->state[2][2]
- ^ hd->state[3][2] ^ hd->state[4][2];
- C[3] = hd->state[0][3] ^ hd->state[1][3] ^ hd->state[2][3]
- ^ hd->state[3][3] ^ hd->state[4][3];
- C[4] = hd->state[0][4] ^ hd->state[1][4] ^ hd->state[2][4]
- ^ hd->state[3][4] ^ hd->state[4][4];
-
- /* Compute the θ effect for a given column */
- D[0] = C[4] ^ rol64(C[1], 1);
- D[1] = C[0] ^ rol64(C[2], 1);
- D[2] = C[1] ^ rol64(C[3], 1);
- D[3] = C[2] ^ rol64(C[4], 1);
- D[4] = C[3] ^ rol64(C[0], 1);
-
- /* Add the θ effect to the whole column */
- hd->state[0][0] ^= D[0];
- hd->state[1][0] ^= D[0];
- hd->state[2][0] ^= D[0];
- hd->state[3][0] ^= D[0];
- hd->state[4][0] ^= D[0];
-
- /* Add the θ effect to the whole column */
- hd->state[0][1] ^= D[1];
- hd->state[1][1] ^= D[1];
- hd->state[2][1] ^= D[1];
- hd->state[3][1] ^= D[1];
- hd->state[4][1] ^= D[1];
-
- /* Add the θ effect to the whole column */
- hd->state[0][2] ^= D[2];
- hd->state[1][2] ^= D[2];
- hd->state[2][2] ^= D[2];
- hd->state[3][2] ^= D[2];
- hd->state[4][2] ^= D[2];
-
- /* Add the θ effect to the whole column */
- hd->state[0][3] ^= D[3];
- hd->state[1][3] ^= D[3];
- hd->state[2][3] ^= D[3];
- hd->state[3][3] ^= D[3];
- hd->state[4][3] ^= D[3];
-
- /* Add the θ effect to the whole column */
- hd->state[0][4] ^= D[4];
- hd->state[1][4] ^= D[4];
- hd->state[2][4] ^= D[4];
- hd->state[3][4] ^= D[4];
- hd->state[4][4] ^= D[4];
- }
-
- {
- /* ρ and π steps (see [Keccak Reference, Sections 2.3.3 and 2.3.4]) */
- u64 current, temp;
-
-#define do_swap_n_rol(x, y, r) \
- temp = hd->state[y][x]; \
- hd->state[y][x] = rol64(current, r); \
- current = temp;
-
- /* Start at coordinates (1 0) */
- current = hd->state[0][1];
-
- /* Iterate over ((0 1)(2 3))^t * (1 0) for 0 ≤ t ≤ 23 */
- do_swap_n_rol(0, 2, 1);
- do_swap_n_rol(2, 1, 3);
- do_swap_n_rol(1, 2, 6);
- do_swap_n_rol(2, 3, 10);
- do_swap_n_rol(3, 3, 15);
- do_swap_n_rol(3, 0, 21);
- do_swap_n_rol(0, 1, 28);
- do_swap_n_rol(1, 3, 36);
- do_swap_n_rol(3, 1, 45);
- do_swap_n_rol(1, 4, 55);
- do_swap_n_rol(4, 4, 2);
- do_swap_n_rol(4, 0, 14);
- do_swap_n_rol(0, 3, 27);
- do_swap_n_rol(3, 4, 41);
- do_swap_n_rol(4, 3, 56);
- do_swap_n_rol(3, 2, 8);
- do_swap_n_rol(2, 2, 25);
- do_swap_n_rol(2, 0, 43);
- do_swap_n_rol(0, 4, 62);
- do_swap_n_rol(4, 2, 18);
- do_swap_n_rol(2, 4, 39);
- do_swap_n_rol(4, 1, 61);
- do_swap_n_rol(1, 1, 20);
- do_swap_n_rol(1, 0, 44);
-
-#undef do_swap_n_rol
- }
-
- {
- /* χ step (see [Keccak Reference, Section 2.3.1]) */
- u64 temp[5];
-
-#define do_x_step_for_plane(y) \
- /* Take a copy of the plane */ \
- temp[0] = hd->state[y][0]; \
- temp[1] = hd->state[y][1]; \
- temp[2] = hd->state[y][2]; \
- temp[3] = hd->state[y][3]; \
- temp[4] = hd->state[y][4]; \
- \
- /* Compute χ on the plane */ \
- hd->state[y][0] = temp[0] ^ ((~temp[1]) & temp[2]); \
- hd->state[y][1] = temp[1] ^ ((~temp[2]) & temp[3]); \
- hd->state[y][2] = temp[2] ^ ((~temp[3]) & temp[4]); \
- hd->state[y][3] = temp[3] ^ ((~temp[4]) & temp[0]); \
- hd->state[y][4] = temp[4] ^ ((~temp[0]) & temp[1]);
-
- do_x_step_for_plane(0);
- do_x_step_for_plane(1);
- do_x_step_for_plane(2);
- do_x_step_for_plane(3);
- do_x_step_for_plane(4);
-
-#undef do_x_step_for_plane
- }
-
- {
- /* ι step (see [Keccak Reference, Section 2.3.5]) */
-
- hd->state[0][0] ^= round_consts[round];
- }
+ hd->u.state64[i] = le_bswap64(hd->u.state64[i]);
}
- return sizeof(void *) * 4 + sizeof(u64) * 10;
+ return 0;
}
+#endif /* NEED_COMMON64 */
+
+
+#ifdef NEED_COMMON32BI
+
+static const u32 round_consts_32bit[2 * 24] =
+{
+ 0x00000001UL, 0x00000000UL, 0x00000000UL, 0x00000089UL,
+ 0x00000000UL, 0x8000008bUL, 0x00000000UL, 0x80008080UL,
+ 0x00000001UL, 0x0000008bUL, 0x00000001UL, 0x00008000UL,
+ 0x00000001UL, 0x80008088UL, 0x00000001UL, 0x80000082UL,
+ 0x00000000UL, 0x0000000bUL, 0x00000000UL, 0x0000000aUL,
+ 0x00000001UL, 0x00008082UL, 0x00000000UL, 0x00008003UL,
+ 0x00000001UL, 0x0000808bUL, 0x00000001UL, 0x8000000bUL,
+ 0x00000001UL, 0x8000008aUL, 0x00000001UL, 0x80000081UL,
+ 0x00000000UL, 0x80000081UL, 0x00000000UL, 0x80000008UL,
+ 0x00000000UL, 0x00000083UL, 0x00000000UL, 0x80008003UL,
+ 0x00000001UL, 0x80008088UL, 0x00000000UL, 0x80000088UL,
+ 0x00000001UL, 0x00008000UL, 0x00000000UL, 0x80008082UL
+};
static unsigned int
-transform_blk (void *context, const unsigned char *data)
+keccak_extract_inplace32bi(KECCAK_STATE *hd, unsigned int outlen)
{
- KECCAK_CONTEXT *ctx = context;
- KECCAK_STATE *hd = &ctx->state;
- u64 *state = (u64 *)hd->state;
- const size_t bsize = ctx->bctx.blocksize;
unsigned int i;
+ u32 x0;
+ u32 x1;
+ u32 t;
+
+ for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+ {
+ x0 = hd->u.state32bi[i * 2 + 0];
+ x1 = hd->u.state32bi[i * 2 + 1];
+
+ t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+ x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+ x0 = t;
+ t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+ t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+ t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+ t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+ t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+ t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+ t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+ t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+
+ hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+ hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+ }
- /* Absorb input block. */
- for (i = 0; i < bsize / 8; i++)
- state[i] ^= buf_get_le64(data + i * 8);
+ return 0;
+}
- return keccak_f1600_state_permute(hd) + 4 * sizeof(void *);
+static inline void
+keccak_absorb_lane32bi(u32 *lane, u32 x0, u32 x1)
+{
+ u32 t;
+
+ t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
+ t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
+ t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
+ t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
+ t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
+ t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
+ t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
+ t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
+ lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+ lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
}
+#endif /* NEED_COMMON32BI */
+
+
+/* Construct generic 64-bit implementation. */
+#ifdef USE_64BIT
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) (((x) << ((unsigned int)n & 63)) | \
+ ((x) >> ((64 - (unsigned int)(n)) & 63)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
static unsigned int
-transform (void *context, const unsigned char *data, size_t nblks)
+keccak_absorb_lanes64(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes)
{
- KECCAK_CONTEXT *ctx = context;
- const size_t bsize = ctx->bctx.blocksize;
- unsigned int burn;
+ unsigned int burn = 0;
+
+ while (nlanes)
+ {
+ hd->u.state64[pos] ^= buf_get_le64(lanes);
+ lanes += 8;
+ nlanes--;
+
+ if (++pos == blocklanes)
+ {
+ burn = keccak_f1600_state_permute64(hd);
+ pos = 0;
+ }
+ }
+
+ return burn;
+}
+
+static const keccak_ops_t keccak_generic64_ops =
+{
+ .permute = keccak_f1600_state_permute64,
+ .absorb = keccak_absorb_lanes64,
+ .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT */
+
+
+/* Construct 64-bit Intel SHLD implementation. */
+#ifdef USE_64BIT_SHLD
+
+# define ANDN64(x, y) (~(x) & (y))
+# define ROL64(x, n) ({ \
+ u64 tmp = (x); \
+ asm ("shldq %1, %0, %0" \
+ : "+r" (tmp) \
+ : "J" ((n) & 63) \
+ : "cc"); \
+ tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_shld
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_shld(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes)
+{
+ unsigned int burn = 0;
+
+ while (nlanes)
+ {
+ hd->u.state64[pos] ^= buf_get_le64(lanes);
+ lanes += 8;
+ nlanes--;
+
+ if (++pos == blocklanes)
+ {
+ burn = keccak_f1600_state_permute64_shld(hd);
+ pos = 0;
+ }
+ }
+
+ return burn;
+}
+
+static const keccak_ops_t keccak_shld_64_ops =
+{
+ .permute = keccak_f1600_state_permute64_shld,
+ .absorb = keccak_absorb_lanes64_shld,
+ .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_SHLD */
+
+
+/* Construct 64-bit Intel BMI2 implementation. */
+#ifdef USE_64BIT_BMI2
+
+# define ANDN64(x, y) ({ \
+ u64 tmp; \
+ asm ("andnq %2, %1, %0" \
+ : "=r" (tmp) \
+ : "r0" (x), "rm" (y)); \
+ tmp; })
+
+# define ROL64(x, n) ({ \
+ u64 tmp; \
+ asm ("rorxq %2, %1, %0" \
+ : "=r" (tmp) \
+ : "rm0" (x), "J" (64 - ((n) & 63))); \
+ tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute64_bmi2
+# include "keccak_permute_64.h"
+
+# undef ANDN64
+# undef ROL64
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes64_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes)
+{
+ unsigned int burn = 0;
+
+ while (nlanes)
+ {
+ hd->u.state64[pos] ^= buf_get_le64(lanes);
+ lanes += 8;
+ nlanes--;
+
+ if (++pos == blocklanes)
+ {
+ burn = keccak_f1600_state_permute64_bmi2(hd);
+ pos = 0;
+ }
+ }
+
+ return burn;
+}
+
+static const keccak_ops_t keccak_bmi2_64_ops =
+{
+ .permute = keccak_f1600_state_permute64_bmi2,
+ .absorb = keccak_absorb_lanes64_bmi2,
+ .extract_inplace = keccak_extract_inplace64,
+};
+
+#endif /* USE_64BIT_BMI2 */
+
+
+/* Construct generic 32-bit implementation. */
+#ifdef USE_32BIT
+
+# define ANDN32(x, y) (~(x) & (y))
+# define ROL32(x, n) (((x) << ((unsigned int)n & 31)) | \
+ ((x) >> ((32 - (unsigned int)(n)) & 31)))
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static unsigned int
+keccak_absorb_lanes32bi(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes)
+{
+ unsigned int burn = 0;
- /* Absorb full blocks. */
- do
+ while (nlanes)
{
- burn = transform_blk (context, data);
- data += bsize;
+ keccak_absorb_lane32bi(&hd->u.state32bi[pos * 2],
+ buf_get_le32(lanes + 0),
+ buf_get_le32(lanes + 4));
+ lanes += 8;
+ nlanes--;
+
+ if (++pos == blocklanes)
+ {
+ burn = keccak_f1600_state_permute32bi(hd);
+ pos = 0;
+ }
}
- while (--nblks);
return burn;
}
+static const keccak_ops_t keccak_generic32bi_ops =
+{
+ .permute = keccak_f1600_state_permute32bi,
+ .absorb = keccak_absorb_lanes32bi,
+ .extract_inplace = keccak_extract_inplace32bi,
+};
+
+#endif /* USE_32BIT */
+
+
+/* Construct 32-bit Intel BMI2 implementation. */
+#ifdef USE_32BIT_BMI2
+
+# define ANDN32(x, y) ({ \
+ u32 tmp; \
+ asm ("andnl %2, %1, %0" \
+ : "=r" (tmp) \
+ : "r0" (x), "rm" (y)); \
+ tmp; })
+
+# define ROL32(x, n) ({ \
+ u32 tmp; \
+ asm ("rorxl %2, %1, %0" \
+ : "=r" (tmp) \
+ : "rm0" (x), "J" (32 - ((n) & 31))); \
+ tmp; })
+
+# define KECCAK_F1600_PERMUTE_FUNC_NAME keccak_f1600_state_permute32bi_bmi2
+# include "keccak_permute_32.h"
+
+# undef ANDN32
+# undef ROL32
+# undef KECCAK_F1600_PERMUTE_FUNC_NAME
+
+static inline u32 pext(u32 x, u32 mask)
+{
+ u32 tmp;
+ asm ("pextl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+ return tmp;
+}
+
+static inline u32 pdep(u32 x, u32 mask)
+{
+ u32 tmp;
+ asm ("pdepl %2, %1, %0" : "=r" (tmp) : "r0" (x), "rm" (mask));
+ return tmp;
+}
+
+static inline void
+keccak_absorb_lane32bi_bmi2(u32 *lane, u32 x0, u32 x1)
+{
+ x0 = pdep(pext(x0, 0x55555555), 0x0000ffff) | (pext(x0, 0xaaaaaaaa) << 16);
+ x1 = pdep(pext(x1, 0x55555555), 0x0000ffff) | (pext(x1, 0xaaaaaaaa) << 16);
+
+ lane[0] ^= (x0 & 0x0000FFFFUL) + (x1 << 16);
+ lane[1] ^= (x0 >> 16) + (x1 & 0xFFFF0000UL);
+}
+
+static unsigned int
+keccak_absorb_lanes32bi_bmi2(KECCAK_STATE *hd, int pos, const byte *lanes,
+ unsigned int nlanes, int blocklanes)
+{
+ unsigned int burn = 0;
+
+ while (nlanes)
+ {
+ keccak_absorb_lane32bi_bmi2(&hd->u.state32bi[pos * 2],
+ buf_get_le32(lanes + 0),
+ buf_get_le32(lanes + 4));
+ lanes += 8;
+ nlanes--;
+
+ if (++pos == blocklanes)
+ {
+ burn = keccak_f1600_state_permute32bi_bmi2(hd);
+ pos = 0;
+ }
+ }
+
+ return burn;
+}
+
+static unsigned int
+keccak_extract_inplace32bi_bmi2(KECCAK_STATE *hd, unsigned int outlen)
+{
+ unsigned int i;
+ u32 x0;
+ u32 x1;
+ u32 t;
+
+ for (i = 0; i < outlen / 8 + !!(outlen % 8); i++)
+ {
+ x0 = hd->u.state32bi[i * 2 + 0];
+ x1 = hd->u.state32bi[i * 2 + 1];
+
+ t = (x0 & 0x0000FFFFUL) + (x1 << 16);
+ x1 = (x0 >> 16) + (x1 & 0xFFFF0000UL);
+ x0 = t;
+
+ x0 = pdep(pext(x0, 0xffff0001), 0xaaaaaaab) | pdep(x0 >> 1, 0x55555554);
+ x1 = pdep(pext(x1, 0xffff0001), 0xaaaaaaab) | pdep(x1 >> 1, 0x55555554);
+
+ hd->u.state32bi[i * 2 + 0] = le_bswap32(x0);
+ hd->u.state32bi[i * 2 + 1] = le_bswap32(x1);
+ }
+
+ return 0;
+}
+
+static const keccak_ops_t keccak_bmi2_32bi_ops =
+{
+ .permute = keccak_f1600_state_permute32bi_bmi2,
+ .absorb = keccak_absorb_lanes32bi_bmi2,
+ .extract_inplace = keccak_extract_inplace32bi_bmi2,
+};
+
+#endif /* USE_32BIT */
+
+
+static void
+keccak_write (void *context, const void *inbuf_arg, size_t inlen)
+{
+ KECCAK_CONTEXT *ctx = context;
+ const size_t bsize = ctx->blocksize;
+ const size_t blocklanes = bsize / 8;
+ const byte *inbuf = inbuf_arg;
+ unsigned int nburn, burn = 0;
+ unsigned int count, i;
+ unsigned int pos, nlanes;
+
+ count = ctx->count;
+
+ if (inlen && (count % 8))
+ {
+ byte lane[8] = { 0, };
+
+ /* Complete absorbing partial input lane. */
+
+ pos = count / 8;
+
+ for (i = count % 8; inlen && i < 8; i++)
+ {
+ lane[i] = *inbuf++;
+ inlen--;
+ count++;
+ }
+
+ if (count == bsize)
+ count = 0;
+
+ nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1,
+ (count % 8) ? -1 : blocklanes);
+ burn = nburn > burn ? nburn : burn;
+ }
+
+ /* Absorb full input lanes. */
+
+ pos = count / 8;
+ nlanes = inlen / 8;
+ if (nlanes > 0)
+ {
+ nburn = ctx->ops->absorb(&ctx->state, pos, inbuf, nlanes, blocklanes);
+ burn = nburn > burn ? nburn : burn;
+ inlen -= nlanes * 8;
+ inbuf += nlanes * 8;
+ count += nlanes * 8;
+ count = count % bsize;
+ }
+
+ if (inlen)
+ {
+ byte lane[8] = { 0, };
+
+ /* Absorb remaining partial input lane. */
+
+ pos = count / 8;
+
+ for (i = count % 8; inlen && i < 8; i++)
+ {
+ lane[i] = *inbuf++;
+ inlen--;
+ count++;
+ }
+
+ nburn = ctx->ops->absorb(&ctx->state, pos, lane, 1, -1);
+ burn = nburn > burn ? nburn : burn;
+
+ gcry_assert(count < bsize);
+ }
+
+ ctx->count = count;
+
+ if (burn)
+ _gcry_burn_stack (burn);
+}
+
static void
keccak_init (int algo, void *context, unsigned int flags)
@@ -267,29 +609,48 @@ keccak_init (int algo, void *context, unsigned int flags)
memset (hd, 0, sizeof *hd);
- ctx->bctx.nblocks = 0;
- ctx->bctx.nblocks_high = 0;
- ctx->bctx.count = 0;
- ctx->bctx.bwrite = transform;
+ ctx->count = 0;
+
+ /* Select generic implementation. */
+#ifdef USE_64BIT
+ ctx->ops = &keccak_generic64_ops;
+#elif defined USE_32BIT
+ ctx->ops = &keccak_generic32bi_ops;
+#endif
+
+ /* Select optimized implementation based in hw features. */
+ if (0) {}
+#ifdef USE_64BIT_BMI2
+ else if (features & HWF_INTEL_BMI2)
+ ctx->ops = &keccak_bmi2_64_ops;
+#endif
+#ifdef USE_32BIT_BMI2
+ else if (features & HWF_INTEL_BMI2)
+ ctx->ops = &keccak_bmi2_32bi_ops;
+#endif
+#ifdef USE_64BIT_SHLD
+ else if (features & HWF_INTEL_FAST_SHLD)
+ ctx->ops = &keccak_shld_64_ops;
+#endif
/* Set input block size, in Keccak terms this is called 'rate'. */
switch (algo)
{
case GCRY_MD_SHA3_224:
- ctx->bctx.blocksize = 1152 / 8;
+ ctx->blocksize = 1152 / 8;
ctx->outlen = 224 / 8;
break;
case GCRY_MD_SHA3_256:
- ctx->bctx.blocksize = 1088 / 8;
+ ctx->blocksize = 1088 / 8;
ctx->outlen = 256 / 8;
break;
case GCRY_MD_SHA3_384:
- ctx->bctx.blocksize = 832 / 8;
+ ctx->blocksize = 832 / 8;
ctx->outlen = 384 / 8;
break;
case GCRY_MD_SHA3_512:
- ctx->bctx.blocksize = 576 / 8;
+ ctx->blocksize = 576 / 8;
ctx->outlen = 512 / 8;
break;
default:
@@ -334,59 +695,37 @@ keccak_final (void *context)
{
KECCAK_CONTEXT *ctx = context;
KECCAK_STATE *hd = &ctx->state;
- const size_t bsize = ctx->bctx.blocksize;
+ const size_t bsize = ctx->blocksize;
const byte suffix = SHA3_DELIMITED_SUFFIX;
- u64 *state = (u64 *)hd->state;
- unsigned int stack_burn_depth;
+ unsigned int nburn, burn = 0;
unsigned int lastbytes;
- unsigned int i;
- byte *buf;
+ byte lane[8];
- _gcry_md_block_write (context, NULL, 0); /* flush */
-
- buf = ctx->bctx.buf;
- lastbytes = ctx->bctx.count;
-
- /* Absorb remaining bytes. */
- for (i = 0; i < lastbytes / 8; i++)
- {
- state[i] ^= buf_get_le64(buf);
- buf += 8;
- }
-
- for (i = 0; i < lastbytes % 8; i++)
- {
- state[lastbytes / 8] ^= (u64)*buf << (i * 8);
- buf++;
- }
+ lastbytes = ctx->count;
/* Do the padding and switch to the squeezing phase */
/* Absorb the last few bits and add the first bit of padding (which
coincides with the delimiter in delimited suffix) */
- state[lastbytes / 8] ^= (u64)suffix << ((lastbytes % 8) * 8);
+ buf_put_le64(lane, (u64)suffix << ((lastbytes % 8) * 8));
+ nburn = ctx->ops->absorb(&ctx->state, lastbytes / 8, lane, 1, -1);
+ burn = nburn > burn ? nburn : burn;
/* Add the second bit of padding. */
- state[(bsize - 1) / 8] ^= (u64)0x80 << (((bsize - 1) % 8) * 8);
+ buf_put_le64(lane, (u64)0x80 << (((bsize - 1) % 8) * 8));
+ nburn = ctx->ops->absorb(&ctx->state, (bsize - 1) / 8, lane, 1, -1);
+ burn = nburn > burn ? nburn : burn;
/* Switch to the squeezing phase. */
- stack_burn_depth = keccak_f1600_state_permute(hd);
+ nburn = ctx->ops->permute(hd);
+ burn = nburn > burn ? nburn : burn;
/* Squeeze out all the output blocks */
if (ctx->outlen < bsize)
{
/* Output SHA3 digest. */
- buf = ctx->bctx.buf;
- for (i = 0; i < ctx->outlen / 8; i++)
- {
- buf_put_le64(buf, state[i]);
- buf += 8;
- }
- for (i = 0; i < ctx->outlen % 8; i++)
- {
- *buf = state[ctx->outlen / 8] >> (i * 8);
- buf++;
- }
+ nburn = ctx->ops->extract_inplace(hd, ctx->outlen);
+ burn = nburn > burn ? nburn : burn;
}
else
{
@@ -394,15 +733,18 @@ keccak_final (void *context)
BUG();
}
- _gcry_burn_stack (stack_burn_depth);
+ wipememory(lane, sizeof(lane));
+ if (burn)
+ _gcry_burn_stack (burn);
}
static byte *
keccak_read (void *context)
{
- KECCAK_CONTEXT *hd = (KECCAK_CONTEXT *) context;
- return hd->bctx.buf;
+ KECCAK_CONTEXT *ctx = (KECCAK_CONTEXT *) context;
+ KECCAK_STATE *hd = &ctx->state;
+ return (byte *)&hd->u;
}
@@ -585,7 +927,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_224 =
{
GCRY_MD_SHA3_224, {0, 1},
"SHA3-224", sha3_224_asn, DIM (sha3_224_asn), oid_spec_sha3_224, 28,
- sha3_224_init, _gcry_md_block_write, keccak_final, keccak_read,
+ sha3_224_init, keccak_write, keccak_final, keccak_read,
sizeof (KECCAK_CONTEXT),
run_selftests
};
@@ -593,7 +935,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_256 =
{
GCRY_MD_SHA3_256, {0, 1},
"SHA3-256", sha3_256_asn, DIM (sha3_256_asn), oid_spec_sha3_256, 32,
- sha3_256_init, _gcry_md_block_write, keccak_final, keccak_read,
+ sha3_256_init, keccak_write, keccak_final, keccak_read,
sizeof (KECCAK_CONTEXT),
run_selftests
};
@@ -601,7 +943,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_384 =
{
GCRY_MD_SHA3_384, {0, 1},
"SHA3-384", sha3_384_asn, DIM (sha3_384_asn), oid_spec_sha3_384, 48,
- sha3_384_init, _gcry_md_block_write, keccak_final, keccak_read,
+ sha3_384_init, keccak_write, keccak_final, keccak_read,
sizeof (KECCAK_CONTEXT),
run_selftests
};
@@ -609,7 +951,7 @@ gcry_md_spec_t _gcry_digest_spec_sha3_512 =
{
GCRY_MD_SHA3_512, {0, 1},
"SHA3-512", sha3_512_asn, DIM (sha3_512_asn), oid_spec_sha3_512, 64,
- sha3_512_init, _gcry_md_block_write, keccak_final, keccak_read,
+ sha3_512_init, keccak_write, keccak_final, keccak_read,
sizeof (KECCAK_CONTEXT),
run_selftests
};
diff --git a/cipher/keccak_permute_32.h b/cipher/keccak_permute_32.h
new file mode 100644
index 0000000..fed9383
--- /dev/null
+++ b/cipher/keccak_permute_32.h
@@ -0,0 +1,535 @@
+/* keccak_permute_32.h - Keccak permute function (simple 32bit bit-interleaved)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple32bi/
+ * Keccak-simple32BI.c" implementation by Ronny Van Keer from SUPERCOP toolkit
+ * package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+ const u32 *round_consts = round_consts_32bit;
+ u32 Aba0, Abe0, Abi0, Abo0, Abu0;
+ u32 Aba1, Abe1, Abi1, Abo1, Abu1;
+ u32 Aga0, Age0, Agi0, Ago0, Agu0;
+ u32 Aga1, Age1, Agi1, Ago1, Agu1;
+ u32 Aka0, Ake0, Aki0, Ako0, Aku0;
+ u32 Aka1, Ake1, Aki1, Ako1, Aku1;
+ u32 Ama0, Ame0, Ami0, Amo0, Amu0;
+ u32 Ama1, Ame1, Ami1, Amo1, Amu1;
+ u32 Asa0, Ase0, Asi0, Aso0, Asu0;
+ u32 Asa1, Ase1, Asi1, Aso1, Asu1;
+ u32 BCa0, BCe0, BCi0, BCo0, BCu0;
+ u32 BCa1, BCe1, BCi1, BCo1, BCu1;
+ u32 Da0, De0, Di0, Do0, Du0;
+ u32 Da1, De1, Di1, Do1, Du1;
+ u32 Eba0, Ebe0, Ebi0, Ebo0, Ebu0;
+ u32 Eba1, Ebe1, Ebi1, Ebo1, Ebu1;
+ u32 Ega0, Ege0, Egi0, Ego0, Egu0;
+ u32 Ega1, Ege1, Egi1, Ego1, Egu1;
+ u32 Eka0, Eke0, Eki0, Eko0, Eku0;
+ u32 Eka1, Eke1, Eki1, Eko1, Eku1;
+ u32 Ema0, Eme0, Emi0, Emo0, Emu0;
+ u32 Ema1, Eme1, Emi1, Emo1, Emu1;
+ u32 Esa0, Ese0, Esi0, Eso0, Esu0;
+ u32 Esa1, Ese1, Esi1, Eso1, Esu1;
+ u32 *state = hd->u.state32bi;
+ unsigned int round;
+
+ Aba0 = state[0];
+ Aba1 = state[1];
+ Abe0 = state[2];
+ Abe1 = state[3];
+ Abi0 = state[4];
+ Abi1 = state[5];
+ Abo0 = state[6];
+ Abo1 = state[7];
+ Abu0 = state[8];
+ Abu1 = state[9];
+ Aga0 = state[10];
+ Aga1 = state[11];
+ Age0 = state[12];
+ Age1 = state[13];
+ Agi0 = state[14];
+ Agi1 = state[15];
+ Ago0 = state[16];
+ Ago1 = state[17];
+ Agu0 = state[18];
+ Agu1 = state[19];
+ Aka0 = state[20];
+ Aka1 = state[21];
+ Ake0 = state[22];
+ Ake1 = state[23];
+ Aki0 = state[24];
+ Aki1 = state[25];
+ Ako0 = state[26];
+ Ako1 = state[27];
+ Aku0 = state[28];
+ Aku1 = state[29];
+ Ama0 = state[30];
+ Ama1 = state[31];
+ Ame0 = state[32];
+ Ame1 = state[33];
+ Ami0 = state[34];
+ Ami1 = state[35];
+ Amo0 = state[36];
+ Amo1 = state[37];
+ Amu0 = state[38];
+ Amu1 = state[39];
+ Asa0 = state[40];
+ Asa1 = state[41];
+ Ase0 = state[42];
+ Ase1 = state[43];
+ Asi0 = state[44];
+ Asi1 = state[45];
+ Aso0 = state[46];
+ Aso1 = state[47];
+ Asu0 = state[48];
+ Asu1 = state[49];
+
+ for (round = 0; round < 24; round += 2)
+ {
+ /* prepareTheta */
+ BCa0 = Aba0 ^ Aga0 ^ Aka0 ^ Ama0 ^ Asa0;
+ BCa1 = Aba1 ^ Aga1 ^ Aka1 ^ Ama1 ^ Asa1;
+ BCe0 = Abe0 ^ Age0 ^ Ake0 ^ Ame0 ^ Ase0;
+ BCe1 = Abe1 ^ Age1 ^ Ake1 ^ Ame1 ^ Ase1;
+ BCi0 = Abi0 ^ Agi0 ^ Aki0 ^ Ami0 ^ Asi0;
+ BCi1 = Abi1 ^ Agi1 ^ Aki1 ^ Ami1 ^ Asi1;
+ BCo0 = Abo0 ^ Ago0 ^ Ako0 ^ Amo0 ^ Aso0;
+ BCo1 = Abo1 ^ Ago1 ^ Ako1 ^ Amo1 ^ Aso1;
+ BCu0 = Abu0 ^ Agu0 ^ Aku0 ^ Amu0 ^ Asu0;
+ BCu1 = Abu1 ^ Agu1 ^ Aku1 ^ Amu1 ^ Asu1;
+
+ /* thetaRhoPiChiIota(round , A, E) */
+ Da0 = BCu0 ^ ROL32(BCe1, 1);
+ Da1 = BCu1 ^ BCe0;
+ De0 = BCa0 ^ ROL32(BCi1, 1);
+ De1 = BCa1 ^ BCi0;
+ Di0 = BCe0 ^ ROL32(BCo1, 1);
+ Di1 = BCe1 ^ BCo0;
+ Do0 = BCi0 ^ ROL32(BCu1, 1);
+ Do1 = BCi1 ^ BCu0;
+ Du0 = BCo0 ^ ROL32(BCa1, 1);
+ Du1 = BCo1 ^ BCa0;
+
+ Aba0 ^= Da0;
+ BCa0 = Aba0;
+ Age0 ^= De0;
+ BCe0 = ROL32(Age0, 22);
+ Aki1 ^= Di1;
+ BCi0 = ROL32(Aki1, 22);
+ Amo1 ^= Do1;
+ BCo0 = ROL32(Amo1, 11);
+ Asu0 ^= Du0;
+ BCu0 = ROL32(Asu0, 7);
+ Eba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Eba0 ^= round_consts[round * 2 + 0];
+ Ebe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Ebi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Ebo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Ebu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Aba1 ^= Da1;
+ BCa1 = Aba1;
+ Age1 ^= De1;
+ BCe1 = ROL32(Age1, 22);
+ Aki0 ^= Di0;
+ BCi1 = ROL32(Aki0, 21);
+ Amo0 ^= Do0;
+ BCo1 = ROL32(Amo0, 10);
+ Asu1 ^= Du1;
+ BCu1 = ROL32(Asu1, 7);
+ Eba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Eba1 ^= round_consts[round * 2 + 1];
+ Ebe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Ebi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Ebo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Ebu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Abo0 ^= Do0;
+ BCa0 = ROL32(Abo0, 14);
+ Agu0 ^= Du0;
+ BCe0 = ROL32(Agu0, 10);
+ Aka1 ^= Da1;
+ BCi0 = ROL32(Aka1, 2);
+ Ame1 ^= De1;
+ BCo0 = ROL32(Ame1, 23);
+ Asi1 ^= Di1;
+ BCu0 = ROL32(Asi1, 31);
+ Ega0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Ege0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Egi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Ego0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Egu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Abo1 ^= Do1;
+ BCa1 = ROL32(Abo1, 14);
+ Agu1 ^= Du1;
+ BCe1 = ROL32(Agu1, 10);
+ Aka0 ^= Da0;
+ BCi1 = ROL32(Aka0, 1);
+ Ame0 ^= De0;
+ BCo1 = ROL32(Ame0, 22);
+ Asi0 ^= Di0;
+ BCu1 = ROL32(Asi0, 30);
+ Ega1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Ege1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Egi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Ego1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Egu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Abe1 ^= De1;
+ BCa0 = ROL32(Abe1, 1);
+ Agi0 ^= Di0;
+ BCe0 = ROL32(Agi0, 3);
+ Ako1 ^= Do1;
+ BCi0 = ROL32(Ako1, 13);
+ Amu0 ^= Du0;
+ BCo0 = ROL32(Amu0, 4);
+ Asa0 ^= Da0;
+ BCu0 = ROL32(Asa0, 9);
+ Eka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Eke0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Eki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Eko0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Eku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Abe0 ^= De0;
+ BCa1 = Abe0;
+ Agi1 ^= Di1;
+ BCe1 = ROL32(Agi1, 3);
+ Ako0 ^= Do0;
+ BCi1 = ROL32(Ako0, 12);
+ Amu1 ^= Du1;
+ BCo1 = ROL32(Amu1, 4);
+ Asa1 ^= Da1;
+ BCu1 = ROL32(Asa1, 9);
+ Eka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Eke1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Eki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Eko1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Eku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Abu1 ^= Du1;
+ BCa0 = ROL32(Abu1, 14);
+ Aga0 ^= Da0;
+ BCe0 = ROL32(Aga0, 18);
+ Ake0 ^= De0;
+ BCi0 = ROL32(Ake0, 5);
+ Ami1 ^= Di1;
+ BCo0 = ROL32(Ami1, 8);
+ Aso0 ^= Do0;
+ BCu0 = ROL32(Aso0, 28);
+ Ema0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Eme0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Emi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Emo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Emu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Abu0 ^= Du0;
+ BCa1 = ROL32(Abu0, 13);
+ Aga1 ^= Da1;
+ BCe1 = ROL32(Aga1, 18);
+ Ake1 ^= De1;
+ BCi1 = ROL32(Ake1, 5);
+ Ami0 ^= Di0;
+ BCo1 = ROL32(Ami0, 7);
+ Aso1 ^= Do1;
+ BCu1 = ROL32(Aso1, 28);
+ Ema1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Eme1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Emi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Emo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Emu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Abi0 ^= Di0;
+ BCa0 = ROL32(Abi0, 31);
+ Ago1 ^= Do1;
+ BCe0 = ROL32(Ago1, 28);
+ Aku1 ^= Du1;
+ BCi0 = ROL32(Aku1, 20);
+ Ama1 ^= Da1;
+ BCo0 = ROL32(Ama1, 21);
+ Ase0 ^= De0;
+ BCu0 = ROL32(Ase0, 1);
+ Esa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Ese0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Esi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Eso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Esu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Abi1 ^= Di1;
+ BCa1 = ROL32(Abi1, 31);
+ Ago0 ^= Do0;
+ BCe1 = ROL32(Ago0, 27);
+ Aku0 ^= Du0;
+ BCi1 = ROL32(Aku0, 19);
+ Ama0 ^= Da0;
+ BCo1 = ROL32(Ama0, 20);
+ Ase1 ^= De1;
+ BCu1 = ROL32(Ase1, 1);
+ Esa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Ese1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Esi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Eso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Esu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ /* prepareTheta */
+ BCa0 = Eba0 ^ Ega0 ^ Eka0 ^ Ema0 ^ Esa0;
+ BCa1 = Eba1 ^ Ega1 ^ Eka1 ^ Ema1 ^ Esa1;
+ BCe0 = Ebe0 ^ Ege0 ^ Eke0 ^ Eme0 ^ Ese0;
+ BCe1 = Ebe1 ^ Ege1 ^ Eke1 ^ Eme1 ^ Ese1;
+ BCi0 = Ebi0 ^ Egi0 ^ Eki0 ^ Emi0 ^ Esi0;
+ BCi1 = Ebi1 ^ Egi1 ^ Eki1 ^ Emi1 ^ Esi1;
+ BCo0 = Ebo0 ^ Ego0 ^ Eko0 ^ Emo0 ^ Eso0;
+ BCo1 = Ebo1 ^ Ego1 ^ Eko1 ^ Emo1 ^ Eso1;
+ BCu0 = Ebu0 ^ Egu0 ^ Eku0 ^ Emu0 ^ Esu0;
+ BCu1 = Ebu1 ^ Egu1 ^ Eku1 ^ Emu1 ^ Esu1;
+
+ /* thetaRhoPiChiIota(round+1, E, A) */
+ Da0 = BCu0 ^ ROL32(BCe1, 1);
+ Da1 = BCu1 ^ BCe0;
+ De0 = BCa0 ^ ROL32(BCi1, 1);
+ De1 = BCa1 ^ BCi0;
+ Di0 = BCe0 ^ ROL32(BCo1, 1);
+ Di1 = BCe1 ^ BCo0;
+ Do0 = BCi0 ^ ROL32(BCu1, 1);
+ Do1 = BCi1 ^ BCu0;
+ Du0 = BCo0 ^ ROL32(BCa1, 1);
+ Du1 = BCo1 ^ BCa0;
+
+ Eba0 ^= Da0;
+ BCa0 = Eba0;
+ Ege0 ^= De0;
+ BCe0 = ROL32(Ege0, 22);
+ Eki1 ^= Di1;
+ BCi0 = ROL32(Eki1, 22);
+ Emo1 ^= Do1;
+ BCo0 = ROL32(Emo1, 11);
+ Esu0 ^= Du0;
+ BCu0 = ROL32(Esu0, 7);
+ Aba0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Aba0 ^= round_consts[round * 2 + 2];
+ Abe0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Abi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Abo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Abu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Eba1 ^= Da1;
+ BCa1 = Eba1;
+ Ege1 ^= De1;
+ BCe1 = ROL32(Ege1, 22);
+ Eki0 ^= Di0;
+ BCi1 = ROL32(Eki0, 21);
+ Emo0 ^= Do0;
+ BCo1 = ROL32(Emo0, 10);
+ Esu1 ^= Du1;
+ BCu1 = ROL32(Esu1, 7);
+ Aba1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Aba1 ^= round_consts[round * 2 + 3];
+ Abe1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Abi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Abo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Abu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Ebo0 ^= Do0;
+ BCa0 = ROL32(Ebo0, 14);
+ Egu0 ^= Du0;
+ BCe0 = ROL32(Egu0, 10);
+ Eka1 ^= Da1;
+ BCi0 = ROL32(Eka1, 2);
+ Eme1 ^= De1;
+ BCo0 = ROL32(Eme1, 23);
+ Esi1 ^= Di1;
+ BCu0 = ROL32(Esi1, 31);
+ Aga0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Age0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Agi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Ago0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Agu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Ebo1 ^= Do1;
+ BCa1 = ROL32(Ebo1, 14);
+ Egu1 ^= Du1;
+ BCe1 = ROL32(Egu1, 10);
+ Eka0 ^= Da0;
+ BCi1 = ROL32(Eka0, 1);
+ Eme0 ^= De0;
+ BCo1 = ROL32(Eme0, 22);
+ Esi0 ^= Di0;
+ BCu1 = ROL32(Esi0, 30);
+ Aga1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Age1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Agi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Ago1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Agu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Ebe1 ^= De1;
+ BCa0 = ROL32(Ebe1, 1);
+ Egi0 ^= Di0;
+ BCe0 = ROL32(Egi0, 3);
+ Eko1 ^= Do1;
+ BCi0 = ROL32(Eko1, 13);
+ Emu0 ^= Du0;
+ BCo0 = ROL32(Emu0, 4);
+ Esa0 ^= Da0;
+ BCu0 = ROL32(Esa0, 9);
+ Aka0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Ake0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Aki0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Ako0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Aku0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Ebe0 ^= De0;
+ BCa1 = Ebe0;
+ Egi1 ^= Di1;
+ BCe1 = ROL32(Egi1, 3);
+ Eko0 ^= Do0;
+ BCi1 = ROL32(Eko0, 12);
+ Emu1 ^= Du1;
+ BCo1 = ROL32(Emu1, 4);
+ Esa1 ^= Da1;
+ BCu1 = ROL32(Esa1, 9);
+ Aka1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Ake1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Aki1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Ako1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Aku1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Ebu1 ^= Du1;
+ BCa0 = ROL32(Ebu1, 14);
+ Ega0 ^= Da0;
+ BCe0 = ROL32(Ega0, 18);
+ Eke0 ^= De0;
+ BCi0 = ROL32(Eke0, 5);
+ Emi1 ^= Di1;
+ BCo0 = ROL32(Emi1, 8);
+ Eso0 ^= Do0;
+ BCu0 = ROL32(Eso0, 28);
+ Ama0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Ame0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Ami0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Amo0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Amu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Ebu0 ^= Du0;
+ BCa1 = ROL32(Ebu0, 13);
+ Ega1 ^= Da1;
+ BCe1 = ROL32(Ega1, 18);
+ Eke1 ^= De1;
+ BCi1 = ROL32(Eke1, 5);
+ Emi0 ^= Di0;
+ BCo1 = ROL32(Emi0, 7);
+ Eso1 ^= Do1;
+ BCu1 = ROL32(Eso1, 28);
+ Ama1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Ame1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Ami1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Amo1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Amu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+
+ Ebi0 ^= Di0;
+ BCa0 = ROL32(Ebi0, 31);
+ Ego1 ^= Do1;
+ BCe0 = ROL32(Ego1, 28);
+ Eku1 ^= Du1;
+ BCi0 = ROL32(Eku1, 20);
+ Ema1 ^= Da1;
+ BCo0 = ROL32(Ema1, 21);
+ Ese0 ^= De0;
+ BCu0 = ROL32(Ese0, 1);
+ Asa0 = BCa0 ^ ANDN32(BCe0, BCi0);
+ Ase0 = BCe0 ^ ANDN32(BCi0, BCo0);
+ Asi0 = BCi0 ^ ANDN32(BCo0, BCu0);
+ Aso0 = BCo0 ^ ANDN32(BCu0, BCa0);
+ Asu0 = BCu0 ^ ANDN32(BCa0, BCe0);
+
+ Ebi1 ^= Di1;
+ BCa1 = ROL32(Ebi1, 31);
+ Ego0 ^= Do0;
+ BCe1 = ROL32(Ego0, 27);
+ Eku0 ^= Du0;
+ BCi1 = ROL32(Eku0, 19);
+ Ema0 ^= Da0;
+ BCo1 = ROL32(Ema0, 20);
+ Ese1 ^= De1;
+ BCu1 = ROL32(Ese1, 1);
+ Asa1 = BCa1 ^ ANDN32(BCe1, BCi1);
+ Ase1 = BCe1 ^ ANDN32(BCi1, BCo1);
+ Asi1 = BCi1 ^ ANDN32(BCo1, BCu1);
+ Aso1 = BCo1 ^ ANDN32(BCu1, BCa1);
+ Asu1 = BCu1 ^ ANDN32(BCa1, BCe1);
+ }
+
+ state[0] = Aba0;
+ state[1] = Aba1;
+ state[2] = Abe0;
+ state[3] = Abe1;
+ state[4] = Abi0;
+ state[5] = Abi1;
+ state[6] = Abo0;
+ state[7] = Abo1;
+ state[8] = Abu0;
+ state[9] = Abu1;
+ state[10] = Aga0;
+ state[11] = Aga1;
+ state[12] = Age0;
+ state[13] = Age1;
+ state[14] = Agi0;
+ state[15] = Agi1;
+ state[16] = Ago0;
+ state[17] = Ago1;
+ state[18] = Agu0;
+ state[19] = Agu1;
+ state[20] = Aka0;
+ state[21] = Aka1;
+ state[22] = Ake0;
+ state[23] = Ake1;
+ state[24] = Aki0;
+ state[25] = Aki1;
+ state[26] = Ako0;
+ state[27] = Ako1;
+ state[28] = Aku0;
+ state[29] = Aku1;
+ state[30] = Ama0;
+ state[31] = Ama1;
+ state[32] = Ame0;
+ state[33] = Ame1;
+ state[34] = Ami0;
+ state[35] = Ami1;
+ state[36] = Amo0;
+ state[37] = Amo1;
+ state[38] = Amu0;
+ state[39] = Amu1;
+ state[40] = Asa0;
+ state[41] = Asa1;
+ state[42] = Ase0;
+ state[43] = Ase1;
+ state[44] = Asi0;
+ state[45] = Asi1;
+ state[46] = Aso0;
+ state[47] = Aso1;
+ state[48] = Asu0;
+ state[49] = Asu1;
+
+ return sizeof(void *) * 4 + sizeof(u32) * 12 * 5 * 2;
+}
diff --git a/cipher/keccak_permute_64.h b/cipher/keccak_permute_64.h
new file mode 100644
index 0000000..1264f19
--- /dev/null
+++ b/cipher/keccak_permute_64.h
@@ -0,0 +1,290 @@
+/* keccak_permute_64.h - Keccak permute function (simple 64bit)
+ * Copyright (C) 2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser general Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* The code is based on public-domain/CC0 "keccakc1024/simple/Keccak-simple.c"
+ * implementation by Ronny Van Keer from SUPERCOP toolkit package.
+ */
+
+/* Function that computes the Keccak-f[1600] permutation on the given state. */
+static unsigned int
+KECCAK_F1600_PERMUTE_FUNC_NAME(KECCAK_STATE *hd)
+{
+ const u64 *round_consts = round_consts_64bit;
+ u64 Aba, Abe, Abi, Abo, Abu;
+ u64 Aga, Age, Agi, Ago, Agu;
+ u64 Aka, Ake, Aki, Ako, Aku;
+ u64 Ama, Ame, Ami, Amo, Amu;
+ u64 Asa, Ase, Asi, Aso, Asu;
+ u64 BCa, BCe, BCi, BCo, BCu;
+ u64 Da, De, Di, Do, Du;
+ u64 Eba, Ebe, Ebi, Ebo, Ebu;
+ u64 Ega, Ege, Egi, Ego, Egu;
+ u64 Eka, Eke, Eki, Eko, Eku;
+ u64 Ema, Eme, Emi, Emo, Emu;
+ u64 Esa, Ese, Esi, Eso, Esu;
+ u64 *state = hd->u.state64;
+ unsigned int round;
+
+ Aba = state[0];
+ Abe = state[1];
+ Abi = state[2];
+ Abo = state[3];
+ Abu = state[4];
+ Aga = state[5];
+ Age = state[6];
+ Agi = state[7];
+ Ago = state[8];
+ Agu = state[9];
+ Aka = state[10];
+ Ake = state[11];
+ Aki = state[12];
+ Ako = state[13];
+ Aku = state[14];
+ Ama = state[15];
+ Ame = state[16];
+ Ami = state[17];
+ Amo = state[18];
+ Amu = state[19];
+ Asa = state[20];
+ Ase = state[21];
+ Asi = state[22];
+ Aso = state[23];
+ Asu = state[24];
+
+ for (round = 0; round < 24; round += 2)
+ {
+ /* prepareTheta */
+ BCa = Aba ^ Aga ^ Aka ^ Ama ^ Asa;
+ BCe = Abe ^ Age ^ Ake ^ Ame ^ Ase;
+ BCi = Abi ^ Agi ^ Aki ^ Ami ^ Asi;
+ BCo = Abo ^ Ago ^ Ako ^ Amo ^ Aso;
+ BCu = Abu ^ Agu ^ Aku ^ Amu ^ Asu;
+
+ /* thetaRhoPiChiIotaPrepareTheta(round , A, E) */
+ Da = BCu ^ ROL64(BCe, 1);
+ De = BCa ^ ROL64(BCi, 1);
+ Di = BCe ^ ROL64(BCo, 1);
+ Do = BCi ^ ROL64(BCu, 1);
+ Du = BCo ^ ROL64(BCa, 1);
+
+ Aba ^= Da;
+ BCa = Aba;
+ Age ^= De;
+ BCe = ROL64(Age, 44);
+ Aki ^= Di;
+ BCi = ROL64(Aki, 43);
+ Amo ^= Do;
+ BCo = ROL64(Amo, 21);
+ Asu ^= Du;
+ BCu = ROL64(Asu, 14);
+ Eba = BCa ^ ANDN64(BCe, BCi);
+ Eba ^= (u64)round_consts[round];
+ Ebe = BCe ^ ANDN64(BCi, BCo);
+ Ebi = BCi ^ ANDN64(BCo, BCu);
+ Ebo = BCo ^ ANDN64(BCu, BCa);
+ Ebu = BCu ^ ANDN64(BCa, BCe);
+
+ Abo ^= Do;
+ BCa = ROL64(Abo, 28);
+ Agu ^= Du;
+ BCe = ROL64(Agu, 20);
+ Aka ^= Da;
+ BCi = ROL64(Aka, 3);
+ Ame ^= De;
+ BCo = ROL64(Ame, 45);
+ Asi ^= Di;
+ BCu = ROL64(Asi, 61);
+ Ega = BCa ^ ANDN64(BCe, BCi);
+ Ege = BCe ^ ANDN64(BCi, BCo);
+ Egi = BCi ^ ANDN64(BCo, BCu);
+ Ego = BCo ^ ANDN64(BCu, BCa);
+ Egu = BCu ^ ANDN64(BCa, BCe);
+
+ Abe ^= De;
+ BCa = ROL64(Abe, 1);
+ Agi ^= Di;
+ BCe = ROL64(Agi, 6);
+ Ako ^= Do;
+ BCi = ROL64(Ako, 25);
+ Amu ^= Du;
+ BCo = ROL64(Amu, 8);
+ Asa ^= Da;
+ BCu = ROL64(Asa, 18);
+ Eka = BCa ^ ANDN64(BCe, BCi);
+ Eke = BCe ^ ANDN64(BCi, BCo);
+ Eki = BCi ^ ANDN64(BCo, BCu);
+ Eko = BCo ^ ANDN64(BCu, BCa);
+ Eku = BCu ^ ANDN64(BCa, BCe);
+
+ Abu ^= Du;
+ BCa = ROL64(Abu, 27);
+ Aga ^= Da;
+ BCe = ROL64(Aga, 36);
+ Ake ^= De;
+ BCi = ROL64(Ake, 10);
+ Ami ^= Di;
+ BCo = ROL64(Ami, 15);
+ Aso ^= Do;
+ BCu = ROL64(Aso, 56);
+ Ema = BCa ^ ANDN64(BCe, BCi);
+ Eme = BCe ^ ANDN64(BCi, BCo);
+ Emi = BCi ^ ANDN64(BCo, BCu);
+ Emo = BCo ^ ANDN64(BCu, BCa);
+ Emu = BCu ^ ANDN64(BCa, BCe);
+
+ Abi ^= Di;
+ BCa = ROL64(Abi, 62);
+ Ago ^= Do;
+ BCe = ROL64(Ago, 55);
+ Aku ^= Du;
+ BCi = ROL64(Aku, 39);
+ Ama ^= Da;
+ BCo = ROL64(Ama, 41);
+ Ase ^= De;
+ BCu = ROL64(Ase, 2);
+ Esa = BCa ^ ANDN64(BCe, BCi);
+ Ese = BCe ^ ANDN64(BCi, BCo);
+ Esi = BCi ^ ANDN64(BCo, BCu);
+ Eso = BCo ^ ANDN64(BCu, BCa);
+ Esu = BCu ^ ANDN64(BCa, BCe);
+
+ /* prepareTheta */
+ BCa = Eba ^ Ega ^ Eka ^ Ema ^ Esa;
+ BCe = Ebe ^ Ege ^ Eke ^ Eme ^ Ese;
+ BCi = Ebi ^ Egi ^ Eki ^ Emi ^ Esi;
+ BCo = Ebo ^ Ego ^ Eko ^ Emo ^ Eso;
+ BCu = Ebu ^ Egu ^ Eku ^ Emu ^ Esu;
+
+ /* thetaRhoPiChiIotaPrepareTheta(round+1, E, A) */
+ Da = BCu ^ ROL64(BCe, 1);
+ De = BCa ^ ROL64(BCi, 1);
+ Di = BCe ^ ROL64(BCo, 1);
+ Do = BCi ^ ROL64(BCu, 1);
+ Du = BCo ^ ROL64(BCa, 1);
+
+ Eba ^= Da;
+ BCa = Eba;
+ Ege ^= De;
+ BCe = ROL64(Ege, 44);
+ Eki ^= Di;
+ BCi = ROL64(Eki, 43);
+ Emo ^= Do;
+ BCo = ROL64(Emo, 21);
+ Esu ^= Du;
+ BCu = ROL64(Esu, 14);
+ Aba = BCa ^ ANDN64(BCe, BCi);
+ Aba ^= (u64)round_consts[round + 1];
+ Abe = BCe ^ ANDN64(BCi, BCo);
+ Abi = BCi ^ ANDN64(BCo, BCu);
+ Abo = BCo ^ ANDN64(BCu, BCa);
+ Abu = BCu ^ ANDN64(BCa, BCe);
+
+ Ebo ^= Do;
+ BCa = ROL64(Ebo, 28);
+ Egu ^= Du;
+ BCe = ROL64(Egu, 20);
+ Eka ^= Da;
+ BCi = ROL64(Eka, 3);
+ Eme ^= De;
+ BCo = ROL64(Eme, 45);
+ Esi ^= Di;
+ BCu = ROL64(Esi, 61);
+ Aga = BCa ^ ANDN64(BCe, BCi);
+ Age = BCe ^ ANDN64(BCi, BCo);
+ Agi = BCi ^ ANDN64(BCo, BCu);
+ Ago = BCo ^ ANDN64(BCu, BCa);
+ Agu = BCu ^ ANDN64(BCa, BCe);
+
+ Ebe ^= De;
+ BCa = ROL64(Ebe, 1);
+ Egi ^= Di;
+ BCe = ROL64(Egi, 6);
+ Eko ^= Do;
+ BCi = ROL64(Eko, 25);
+ Emu ^= Du;
+ BCo = ROL64(Emu, 8);
+ Esa ^= Da;
+ BCu = ROL64(Esa, 18);
+ Aka = BCa ^ ANDN64(BCe, BCi);
+ Ake = BCe ^ ANDN64(BCi, BCo);
+ Aki = BCi ^ ANDN64(BCo, BCu);
+ Ako = BCo ^ ANDN64(BCu, BCa);
+ Aku = BCu ^ ANDN64(BCa, BCe);
+
+ Ebu ^= Du;
+ BCa = ROL64(Ebu, 27);
+ Ega ^= Da;
+ BCe = ROL64(Ega, 36);
+ Eke ^= De;
+ BCi = ROL64(Eke, 10);
+ Emi ^= Di;
+ BCo = ROL64(Emi, 15);
+ Eso ^= Do;
+ BCu = ROL64(Eso, 56);
+ Ama = BCa ^ ANDN64(BCe, BCi);
+ Ame = BCe ^ ANDN64(BCi, BCo);
+ Ami = BCi ^ ANDN64(BCo, BCu);
+ Amo = BCo ^ ANDN64(BCu, BCa);
+ Amu = BCu ^ ANDN64(BCa, BCe);
+
+ Ebi ^= Di;
+ BCa = ROL64(Ebi, 62);
+ Ego ^= Do;
+ BCe = ROL64(Ego, 55);
+ Eku ^= Du;
+ BCi = ROL64(Eku, 39);
+ Ema ^= Da;
+ BCo = ROL64(Ema, 41);
+ Ese ^= De;
+ BCu = ROL64(Ese, 2);
+ Asa = BCa ^ ANDN64(BCe, BCi);
+ Ase = BCe ^ ANDN64(BCi, BCo);
+ Asi = BCi ^ ANDN64(BCo, BCu);
+ Aso = BCo ^ ANDN64(BCu, BCa);
+ Asu = BCu ^ ANDN64(BCa, BCe);
+ }
+
+ state[0] = Aba;
+ state[1] = Abe;
+ state[2] = Abi;
+ state[3] = Abo;
+ state[4] = Abu;
+ state[5] = Aga;
+ state[6] = Age;
+ state[7] = Agi;
+ state[8] = Ago;
+ state[9] = Agu;
+ state[10] = Aka;
+ state[11] = Ake;
+ state[12] = Aki;
+ state[13] = Ako;
+ state[14] = Aku;
+ state[15] = Ama;
+ state[16] = Ame;
+ state[17] = Ami;
+ state[18] = Amo;
+ state[19] = Amu;
+ state[20] = Asa;
+ state[21] = Ase;
+ state[22] = Asi;
+ state[23] = Aso;
+ state[24] = Asu;
+
+ return sizeof(void *) * 4 + sizeof(u64) * 12 * 5;
+}
commit 909644ef5883927262366c356eed530e55aba478
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Fri Oct 23 22:39:47 2015 +0300
hwf-x86: add detection for Intel CPUs with fast SHLD instruction
* cipher/sha1.c (sha1_init): Use HWF_INTEL_FAST_SHLD instead of
HWF_INTEL_CPU.
* cipher/sha256.c (sha256_init, sha224_init): Ditto.
* cipher/sha512.c (sha512_init, sha384_init): Ditto.
* src/g10lib.h (HWF_INTEL_FAST_SHLD): New.
(HWF_INTEL_BMI2, HWF_INTEL_SSSE3, HWF_INTEL_PCLMUL, HWF_INTEL_AESNI)
(HWF_INTEL_RDRAND, HWF_INTEL_AVX, HWF_INTEL_AVX2)
(HWF_ARM_NEON): Update.
* src/hwf-x86.c (detect_x86_gnuc): Add detection of Intel Core
CPUs with fast SHLD/SHRD instruction.
* src/hwfeatures.c (hwflist): Add "intel-fast-shld".
--
Intel Core CPUs since codename sandy-bridge have been able to
execute SHLD/SHRD instructions faster than rotate instructions
ROL/ROR. Since SHLD/SHRD can be used to do rotation, some
optimized implementations (SHA1/SHA256/SHA512) use SHLD/SHRD
instructions in-place of ROL/ROR.
This patch provides more accurate detection of CPUs with
fast SHLD implementation.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/cipher/sha1.c b/cipher/sha1.c
index eb42883..554d55c 100644
--- a/cipher/sha1.c
+++ b/cipher/sha1.c
@@ -136,7 +136,7 @@ sha1_init (void *context, unsigned int flags)
#ifdef USE_AVX
/* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
* Therefore use this implementation on Intel CPUs only. */
- hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+ hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
#endif
#ifdef USE_BMI2
hd->use_bmi2 = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_BMI2);
diff --git a/cipher/sha256.c b/cipher/sha256.c
index 59ffa43..63869d5 100644
--- a/cipher/sha256.c
+++ b/cipher/sha256.c
@@ -124,7 +124,7 @@ sha256_init (void *context, unsigned int flags)
#ifdef USE_AVX
/* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
* Therefore use this implementation on Intel CPUs only. */
- hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+ hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
#endif
#ifdef USE_AVX2
hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
@@ -162,7 +162,7 @@ sha224_init (void *context, unsigned int flags)
#ifdef USE_AVX
/* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs.
* Therefore use this implementation on Intel CPUs only. */
- hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+ hd->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
#endif
#ifdef USE_AVX2
hd->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
diff --git a/cipher/sha512.c b/cipher/sha512.c
index 029f8f0..4be1cab 100644
--- a/cipher/sha512.c
+++ b/cipher/sha512.c
@@ -154,7 +154,7 @@ sha512_init (void *context, unsigned int flags)
ctx->use_ssse3 = (features & HWF_INTEL_SSSE3) != 0;
#endif
#ifdef USE_AVX
- ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+ ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
#endif
#ifdef USE_AVX2
ctx->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
@@ -194,7 +194,7 @@ sha384_init (void *context, unsigned int flags)
ctx->use_ssse3 = (features & HWF_INTEL_SSSE3) != 0;
#endif
#ifdef USE_AVX
- ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_CPU);
+ ctx->use_avx = (features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD);
#endif
#ifdef USE_AVX2
ctx->use_avx2 = (features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2);
diff --git a/src/g10lib.h b/src/g10lib.h
index d1f9426..a579e94 100644
--- a/src/g10lib.h
+++ b/src/g10lib.h
@@ -197,16 +197,17 @@ int _gcry_log_verbosity( int level );
#define HWF_PADLOCK_SHA 4
#define HWF_PADLOCK_MMUL 8
-#define HWF_INTEL_CPU 16
-#define HWF_INTEL_BMI2 32
-#define HWF_INTEL_SSSE3 64
-#define HWF_INTEL_PCLMUL 128
-#define HWF_INTEL_AESNI 256
-#define HWF_INTEL_RDRAND 512
-#define HWF_INTEL_AVX 1024
-#define HWF_INTEL_AVX2 2048
-
-#define HWF_ARM_NEON 4096
+#define HWF_INTEL_CPU 16
+#define HWF_INTEL_FAST_SHLD 32
+#define HWF_INTEL_BMI2 64
+#define HWF_INTEL_SSSE3 128
+#define HWF_INTEL_PCLMUL 256
+#define HWF_INTEL_AESNI 512
+#define HWF_INTEL_RDRAND 1024
+#define HWF_INTEL_AVX 2048
+#define HWF_INTEL_AVX2 4096
+
+#define HWF_ARM_NEON 8192
gpg_err_code_t _gcry_disable_hw_feature (const char *name);
diff --git a/src/hwf-x86.c b/src/hwf-x86.c
index 399952c..fbd6331 100644
--- a/src/hwf-x86.c
+++ b/src/hwf-x86.c
@@ -174,6 +174,7 @@ detect_x86_gnuc (void)
unsigned int features;
unsigned int os_supports_avx_avx2_registers = 0;
unsigned int max_cpuid_level;
+ unsigned int fms, family, model;
unsigned int result = 0;
(void)os_supports_avx_avx2_registers;
@@ -236,8 +237,37 @@ detect_x86_gnuc (void)
/* Detect Intel features, that might also be supported by other
vendors. */
- /* Get CPU info and Intel feature flags (ECX). */
- get_cpuid(1, NULL, NULL, &features, NULL);
+ /* Get CPU family/model/stepping (EAX) and Intel feature flags (ECX). */
+ get_cpuid(1, &fms, NULL, &features, NULL);
+
+ family = ((fms & 0xf00) >> 8) + ((fms & 0xff00000) >> 20);
+ model = ((fms & 0xf0) >> 4) + ((fms & 0xf0000) >> 12);
+
+ if ((result & HWF_INTEL_CPU) && family == 6)
+ {
+ /* These Intel Core processor models have SHLD/SHRD instruction that
+ * can do integer rotation faster actual ROL/ROR instructions. */
+ switch (model)
+ {
+ case 0x2A:
+ case 0x2D:
+ case 0x3A:
+ case 0x3C:
+ case 0x3F:
+ case 0x45:
+ case 0x46:
+ case 0x3D:
+ case 0x4F:
+ case 0x56:
+ case 0x47:
+ case 0x4E:
+ case 0x5E:
+ case 0x55:
+ case 0x66:
+ result |= HWF_INTEL_FAST_SHLD;
+ break;
+ }
+ }
#ifdef ENABLE_PCLMUL_SUPPORT
/* Test bit 1 for PCLMUL. */
diff --git a/src/hwfeatures.c b/src/hwfeatures.c
index 58099c4..e7c55cc 100644
--- a/src/hwfeatures.c
+++ b/src/hwfeatures.c
@@ -42,19 +42,20 @@ static struct
const char *desc;
} hwflist[] =
{
- { HWF_PADLOCK_RNG, "padlock-rng" },
- { HWF_PADLOCK_AES, "padlock-aes" },
- { HWF_PADLOCK_SHA, "padlock-sha" },
- { HWF_PADLOCK_MMUL,"padlock-mmul"},
- { HWF_INTEL_CPU, "intel-cpu" },
- { HWF_INTEL_BMI2, "intel-bmi2" },
- { HWF_INTEL_SSSE3, "intel-ssse3" },
- { HWF_INTEL_PCLMUL,"intel-pclmul" },
- { HWF_INTEL_AESNI, "intel-aesni" },
- { HWF_INTEL_RDRAND,"intel-rdrand" },
- { HWF_INTEL_AVX, "intel-avx" },
- { HWF_INTEL_AVX2, "intel-avx2" },
- { HWF_ARM_NEON, "arm-neon" }
+ { HWF_PADLOCK_RNG, "padlock-rng" },
+ { HWF_PADLOCK_AES, "padlock-aes" },
+ { HWF_PADLOCK_SHA, "padlock-sha" },
+ { HWF_PADLOCK_MMUL, "padlock-mmul"},
+ { HWF_INTEL_CPU, "intel-cpu" },
+ { HWF_INTEL_FAST_SHLD, "intel-fast-shld" },
+ { HWF_INTEL_BMI2, "intel-bmi2" },
+ { HWF_INTEL_SSSE3, "intel-ssse3" },
+ { HWF_INTEL_PCLMUL, "intel-pclmul" },
+ { HWF_INTEL_AESNI, "intel-aesni" },
+ { HWF_INTEL_RDRAND, "intel-rdrand" },
+ { HWF_INTEL_AVX, "intel-avx" },
+ { HWF_INTEL_AVX2, "intel-avx2" },
+ { HWF_ARM_NEON, "arm-neon" }
};
/* A bit vector with the hardware features which shall not be used.
commit 16fd540f4d01eb6dc23d9509ae549353617c7a67
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Sat Oct 24 12:41:23 2015 +0300
Fix OCB amd64 assembly implementations for x32
* cipher/camellia-glue.c (_gcry_camellia_aesni_avx_ocb_enc)
(_gcry_camellia_aesni_avx_ocb_dec, _gcry_camellia_aesni_avx_ocb_auth)
(_gcry_camellia_aesni_avx2_ocb_enc, _gcry_camellia_aesni_avx2_ocb_dec)
(_gcry_camellia_aesni_avx2_ocb_auth, _gcry_camellia_ocb_crypt)
(_gcry_camellia_ocb_auth): Change 'Ls' from pointer array to u64 array.
* cipher/serpent.c (_gcry_serpent_sse2_ocb_enc)
(_gcry_serpent_sse2_ocb_dec, _gcry_serpent_sse2_ocb_auth)
(_gcry_serpent_avx2_ocb_enc, _gcry_serpent_avx2_ocb_dec)
(_gcry_serpent_ocb_crypt, _gcry_serpent_ocb_auth): Ditto.
* cipher/twofish.c (_gcry_twofish_amd64_ocb_enc)
(_gcry_twofish_amd64_ocb_dec, _gcry_twofish_amd64_ocb_auth)
(twofish_amd64_ocb_enc, twofish_amd64_ocb_dec, twofish_amd64_ocb_auth)
(_gcry_twofish_ocb_crypt, _gcry_twofish_ocb_auth): Ditto.
--
Pointers on x32 are 32-bit, but amd64 assembly implementations
expect 64-bit pointers. Pass 'Ls' array to 64-bit integers so
that input arrays has correct format for assembly functions.
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c
index dee0169..dfddb4a 100644
--- a/cipher/camellia-glue.c
+++ b/cipher/camellia-glue.c
@@ -141,20 +141,20 @@ extern void _gcry_camellia_aesni_avx_ocb_enc(CAMELLIA_context *ctx,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
extern void _gcry_camellia_aesni_avx_ocb_dec(CAMELLIA_context *ctx,
unsigned char *out,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
extern void _gcry_camellia_aesni_avx_ocb_auth(CAMELLIA_context *ctx,
const unsigned char *abuf,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
extern void _gcry_camellia_aesni_avx_keygen(CAMELLIA_context *ctx,
const unsigned char *key,
@@ -185,20 +185,20 @@ extern void _gcry_camellia_aesni_avx2_ocb_enc(CAMELLIA_context *ctx,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[32]) ASM_FUNC_ABI;
+ const u64 Ls[32]) ASM_FUNC_ABI;
extern void _gcry_camellia_aesni_avx2_ocb_dec(CAMELLIA_context *ctx,
unsigned char *out,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[32]) ASM_FUNC_ABI;
+ const u64 Ls[32]) ASM_FUNC_ABI;
extern void _gcry_camellia_aesni_avx2_ocb_auth(CAMELLIA_context *ctx,
const unsigned char *abuf,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[32]) ASM_FUNC_ABI;
+ const u64 Ls[32]) ASM_FUNC_ABI;
#endif
static const char *selftest(void);
@@ -630,27 +630,29 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
if (ctx->use_aesni_avx2)
{
int did_use_aesni_avx2 = 0;
- const void *Ls[32];
+ u64 Ls[32];
unsigned int n = 32 - (blkn % 32);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 32)
{
for (i = 0; i < 32; i += 8)
{
- Ls[(i + 0 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 32] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 32] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 32] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 32] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 32] = c->u_mode.ocb.L[3];
- Ls[(15 + n) % 32] = c->u_mode.ocb.L[4];
- Ls[(23 + n) % 32] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+ Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4];
+ Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(31 + n) % 32];
/* Process data in 32 block chunks. */
@@ -658,7 +660,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 32;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 32);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 32);
if (encrypt)
_gcry_camellia_aesni_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -691,25 +693,27 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
if (ctx->use_aesni_avx)
{
int did_use_aesni_avx = 0;
- const void *Ls[16];
+ u64 Ls[16];
unsigned int n = 16 - (blkn % 16);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 16)
{
for (i = 0; i < 16; i += 8)
{
- Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(15 + n) % 16];
/* Process data in 16 block chunks. */
@@ -717,7 +721,7 @@ _gcry_camellia_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 16;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
if (encrypt)
_gcry_camellia_aesni_avx_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -780,27 +784,29 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
if (ctx->use_aesni_avx2)
{
int did_use_aesni_avx2 = 0;
- const void *Ls[32];
+ u64 Ls[32];
unsigned int n = 32 - (blkn % 32);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 32)
{
for (i = 0; i < 32; i += 8)
{
- Ls[(i + 0 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 32] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 32] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 32] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 32] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 32] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 32] = c->u_mode.ocb.L[3];
- Ls[(15 + n) % 32] = c->u_mode.ocb.L[4];
- Ls[(23 + n) % 32] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+ Ls[(15 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[4];
+ Ls[(23 + n) % 32] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(31 + n) % 32];
/* Process data in 32 block chunks. */
@@ -808,7 +814,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 32;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 32);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 32);
_gcry_camellia_aesni_avx2_ocb_auth(ctx, abuf,
c->u_mode.ocb.aad_offset,
@@ -837,25 +843,27 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
if (ctx->use_aesni_avx)
{
int did_use_aesni_avx = 0;
- const void *Ls[16];
+ u64 Ls[16];
unsigned int n = 16 - (blkn % 16);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 16)
{
for (i = 0; i < 16; i += 8)
{
- Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(15 + n) % 16];
/* Process data in 16 block chunks. */
@@ -863,7 +871,7 @@ _gcry_camellia_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 16;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
_gcry_camellia_aesni_avx_ocb_auth(ctx, abuf,
c->u_mode.ocb.aad_offset,
diff --git a/cipher/serpent.c b/cipher/serpent.c
index fc3afa6..4ef7f52 100644
--- a/cipher/serpent.c
+++ b/cipher/serpent.c
@@ -125,20 +125,20 @@ extern void _gcry_serpent_sse2_ocb_enc(serpent_context_t *ctx,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[8]) ASM_FUNC_ABI;
+ const u64 Ls[8]) ASM_FUNC_ABI;
extern void _gcry_serpent_sse2_ocb_dec(serpent_context_t *ctx,
unsigned char *out,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[8]) ASM_FUNC_ABI;
+ const u64 Ls[8]) ASM_FUNC_ABI;
extern void _gcry_serpent_sse2_ocb_auth(serpent_context_t *ctx,
const unsigned char *abuf,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[8]) ASM_FUNC_ABI;
+ const u64 Ls[8]) ASM_FUNC_ABI;
#endif
#ifdef USE_AVX2
@@ -165,20 +165,20 @@ extern void _gcry_serpent_avx2_ocb_enc(serpent_context_t *ctx,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
extern void _gcry_serpent_avx2_ocb_dec(serpent_context_t *ctx,
unsigned char *out,
const unsigned char *in,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
extern void _gcry_serpent_avx2_ocb_auth(serpent_context_t *ctx,
const unsigned char *abuf,
unsigned char *offset,
unsigned char *checksum,
- const void *Ls[16]) ASM_FUNC_ABI;
+ const u64 Ls[16]) ASM_FUNC_ABI;
#endif
#ifdef USE_NEON
@@ -1249,25 +1249,27 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
if (ctx->use_avx2)
{
int did_use_avx2 = 0;
- const void *Ls[16];
+ u64 Ls[16];
unsigned int n = 16 - (blkn % 16);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 16)
{
for (i = 0; i < 16; i += 8)
{
- Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(15 + n) % 16];
/* Process data in 16 block chunks. */
@@ -1275,7 +1277,7 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 16;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
if (encrypt)
_gcry_serpent_avx2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -1305,19 +1307,21 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
#ifdef USE_SSE2
{
int did_use_sse2 = 0;
- const void *Ls[8];
+ u64 Ls[8];
unsigned int n = 8 - (blkn % 8);
- const void **l;
+ u64 *l;
if (nblocks >= 8)
{
- Ls[(0 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(1 + n) % 8] = c->u_mode.ocb.L[1];
- Ls[(2 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(3 + n) % 8] = c->u_mode.ocb.L[2];
- Ls[(4 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(5 + n) % 8] = c->u_mode.ocb.L[1];
- Ls[(6 + n) % 8] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
l = &Ls[(7 + n) % 8];
/* Process data in 8 block chunks. */
@@ -1325,7 +1329,7 @@ _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 8;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 8);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 8);
if (encrypt)
_gcry_serpent_sse2_ocb_enc(ctx, outbuf, inbuf, c->u_iv.iv,
@@ -1435,25 +1439,27 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
if (ctx->use_avx2)
{
int did_use_avx2 = 0;
- const void *Ls[16];
+ u64 Ls[16];
unsigned int n = 16 - (blkn % 16);
- const void **l;
+ u64 *l;
int i;
if (nblocks >= 16)
{
for (i = 0; i < 16; i += 8)
{
- Ls[(i + 0 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 1 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 2 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 3 + n) % 16] = c->u_mode.ocb.L[2];
- Ls[(i + 4 + n) % 16] = c->u_mode.ocb.L[0];
- Ls[(i + 5 + n) % 16] = c->u_mode.ocb.L[1];
- Ls[(i + 6 + n) % 16] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
}
- Ls[(7 + n) % 16] = c->u_mode.ocb.L[3];
+ Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
l = &Ls[(15 + n) % 16];
/* Process data in 16 block chunks. */
@@ -1461,7 +1467,7 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 16;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 16);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 16);
_gcry_serpent_avx2_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,
c->u_mode.ocb.aad_sum, Ls);
@@ -1486,19 +1492,21 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
#ifdef USE_SSE2
{
int did_use_sse2 = 0;
- const void *Ls[8];
+ u64 Ls[8];
unsigned int n = 8 - (blkn % 8);
- const void **l;
+ u64 *l;
if (nblocks >= 8)
{
- Ls[(0 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(1 + n) % 8] = c->u_mode.ocb.L[1];
- Ls[(2 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(3 + n) % 8] = c->u_mode.ocb.L[2];
- Ls[(4 + n) % 8] = c->u_mode.ocb.L[0];
- Ls[(5 + n) % 8] = c->u_mode.ocb.L[1];
- Ls[(6 + n) % 8] = c->u_mode.ocb.L[0];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+ Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+ Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+ Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
l = &Ls[(7 + n) % 8];
/* Process data in 8 block chunks. */
@@ -1506,7 +1514,7 @@ _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
{
/* l_tmp will be used only every 65536-th block. */
blkn += 8;
- *l = ocb_get_l(c, l_tmp, blkn - blkn % 8);
+ *l = (uintptr_t)(void *)ocb_get_l(c, l_tmp, blkn - blkn % 8);
_gcry_serpent_sse2_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,
c->u_mode.ocb.aad_sum, Ls);
diff --git a/cipher/twofish.c b/cipher/twofish.c
index 7f361c9..f6ecd67 100644
--- a/cipher/twofish.c
+++ b/cipher/twofish.c
@@ -734,15 +734,15 @@ extern void _gcry_twofish_amd64_cfb_dec(const TWOFISH_context *c, byte *out,
extern void _gcry_twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out,
const byte *in, byte *offset,
- byte *checksum, const void *Ls[3]);
+ byte *checksum, const u64 Ls[3]);
extern void _gcry_twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out,
const byte *in, byte *offset,
- byte *checksum, const void *Ls[3]);
+ byte *checksum, const u64 Ls[3]);
extern void _gcry_twofish_amd64_ocb_auth(const TWOFISH_context *ctx,
const byte *abuf, byte *offset,
- byte *checksum, const void *Ls[3]);
+ byte *checksum, const u64 Ls[3]);
#ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
static inline void
@@ -854,7 +854,7 @@ twofish_amd64_cfb_dec(const TWOFISH_context *c, byte *out, const byte *in,
static inline void
twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out, const byte *in,
- byte *offset, byte *checksum, const void *Ls[3])
+ byte *offset, byte *checksum, const u64 Ls[3])
{
#ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
call_sysv_fn6(_gcry_twofish_amd64_ocb_enc, ctx, out, in, offset, checksum, Ls);
@@ -865,7 +865,7 @@ twofish_amd64_ocb_enc(const TWOFISH_context *ctx, byte *out, const byte *in,
static inline void
twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out, const byte *in,
- byte *offset, byte *checksum, const void *Ls[3])
+ byte *offset, byte *checksum, const u64 Ls[3])
{
#ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
call_sysv_fn6(_gcry_twofish_amd64_ocb_dec, ctx, out, in, offset, checksum, Ls);
@@ -876,7 +876,7 @@ twofish_amd64_ocb_dec(const TWOFISH_context *ctx, byte *out, const byte *in,
static inline void
twofish_amd64_ocb_auth(const TWOFISH_context *ctx, const byte *abuf,
- byte *offset, byte *checksum, const void *Ls[3])
+ byte *offset, byte *checksum, const u64 Ls[3])
{
#ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
call_sysv_fn5(_gcry_twofish_amd64_ocb_auth, ctx, abuf, offset, checksum, Ls);
@@ -1261,15 +1261,17 @@ _gcry_twofish_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
u64 blkn = c->u_mode.ocb.data_nblocks;
{
- const void *Ls[3];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ u64 Ls[3];
/* Process data in 3 block chunks. */
while (nblocks >= 3)
{
/* l_tmp will be used only every 65536-th block. */
- Ls[0] = ocb_get_l(c, l_tmp, blkn + 1);
- Ls[1] = ocb_get_l(c, l_tmp, blkn + 2);
- Ls[2] = ocb_get_l(c, l_tmp, blkn + 3);
+ Ls[0] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 1);
+ Ls[1] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 2);
+ Ls[2] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 3);
blkn += 3;
if (encrypt)
@@ -1320,15 +1322,17 @@ _gcry_twofish_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
u64 blkn = c->u_mode.ocb.aad_nblocks;
{
- const void *Ls[3];
+ /* Use u64 to store pointers for x32 support (assembly function
+ * assumes 64-bit pointers). */
+ u64 Ls[3];
/* Process data in 3 block chunks. */
while (nblocks >= 3)
{
/* l_tmp will be used only every 65536-th block. */
- Ls[0] = ocb_get_l(c, l_tmp, blkn + 1);
- Ls[1] = ocb_get_l(c, l_tmp, blkn + 2);
- Ls[2] = ocb_get_l(c, l_tmp, blkn + 3);
+ Ls[0] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 1);
+ Ls[1] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 2);
+ Ls[2] = (uintptr_t)(const void *)ocb_get_l(c, l_tmp, blkn + 3);
blkn += 3;
twofish_amd64_ocb_auth(ctx, abuf, c->u_mode.ocb.aad_offset,
commit ae40af427fd2a856b24ec2a41323ec8b80ffc9c0
Author: Jussi Kivilinna <jussi.kivilinna at iki.fi>
Date: Fri Oct 23 22:24:47 2015 +0300
bench-slope: add KDF/PBKDF2 benchmark
* tests/bench-slope.c (bench_kdf_mode, bench_kdf_init, bench_kdf_free)
(bench_kdf_do_bench, kdf_ops, kdf_bench_one, kdf_bench): New.
(print_help): Add 'kdf'.
(main): Add KDF benchmarks.
--
Introduce KDF benchmarking to bench-slope. Output is given as
nanosecs/iter (and cycles/iter if --cpu-mhz used). Only PBKDF2
is support with this initial patch.
For example, below shows output of KDF bench-slope before
and after commit "md: keep contexts for HMAC in GcryDigestEntry",
on Intel Core i5-4570 @ 3.2 Ghz:
Before:
$ tests/bench-slope --cpu-mhz 3201 kdf
KDF:
| nanosecs/iter cycles/iter
PBKDF2-HMAC-MD5 | 882.4 2824.7
PBKDF2-HMAC-SHA1 | 832.6 2665.0
PBKDF2-HMAC-RIPEMD160 | 1148.3 3675.6
PBKDF2-HMAC-TIGER192 | 1339.6 4288.2
PBKDF2-HMAC-SHA256 | 1460.5 4675.1
PBKDF2-HMAC-SHA384 | 1723.2 5515.8
PBKDF2-HMAC-SHA512 | 1729.1 5534.7
PBKDF2-HMAC-SHA224 | 1424.0 4558.3
PBKDF2-HMAC-WHIRLPOOL | 2459.7 7873.5
PBKDF2-HMAC-TIGER | 1350.2 4322.1
PBKDF2-HMAC-TIGER2 | 1348.7 4317.3
PBKDF2-HMAC-GOSTR3411_94 | 7374.1 23604.4
PBKDF2-HMAC-STRIBOG256 | 6060.0 19398.1
PBKDF2-HMAC-STRIBOG512 | 7512.8 24048.3
PBKDF2-HMAC-GOSTR3411_CP | 7378.3 23618.0
PBKDF2-HMAC-SHA3-224 | 2789.6 8929.5
PBKDF2-HMAC-SHA3-256 | 2785.1 8915.0
PBKDF2-HMAC-SHA3-384 | 2955.5 9460.5
PBKDF2-HMAC-SHA3-512 | 2859.7 9153.9
=
After:
$ tests/bench-slope --cpu-mhz 3201 kdf
KDF:
| nanosecs/iter cycles/iter
PBKDF2-HMAC-MD5 | 405.9 1299.2
PBKDF2-HMAC-SHA1 | 392.1 1255.0
PBKDF2-HMAC-RIPEMD160 | 540.9 1731.5
PBKDF2-HMAC-TIGER192 | 637.1 2039.4
PBKDF2-HMAC-SHA256 | 691.8 2214.3
PBKDF2-HMAC-SHA384 | 848.0 2714.3
PBKDF2-HMAC-SHA512 | 875.7 2803.1
PBKDF2-HMAC-SHA224 | 689.2 2206.0
PBKDF2-HMAC-WHIRLPOOL | 1535.6 4915.5
PBKDF2-HMAC-TIGER | 636.3 2036.7
PBKDF2-HMAC-TIGER2 | 636.6 2037.7
PBKDF2-HMAC-GOSTR3411_94 | 5311.5 17002.2
PBKDF2-HMAC-STRIBOG256 | 4308.0 13790.0
PBKDF2-HMAC-STRIBOG512 | 5767.4 18461.4
PBKDF2-HMAC-GOSTR3411_CP | 5309.4 16995.4
PBKDF2-HMAC-SHA3-224 | 1333.1 4267.2
PBKDF2-HMAC-SHA3-256 | 1327.8 4250.4
PBKDF2-HMAC-SHA3-384 | 1392.8 4458.3
PBKDF2-HMAC-SHA3-512 | 1428.5 4572.7
=
Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
diff --git a/tests/bench-slope.c b/tests/bench-slope.c
index 394d7fc..2679556 100644
--- a/tests/bench-slope.c
+++ b/tests/bench-slope.c
@@ -1571,13 +1571,176 @@ mac_bench (char **argv, int argc)
}
+/************************************************************ KDF benchmarks. */
+
+struct bench_kdf_mode
+{
+ struct bench_ops *ops;
+
+ int algo;
+ int subalgo;
+};
+
+
+static int
+bench_kdf_init (struct bench_obj *obj)
+{
+ struct bench_kdf_mode *mode = obj->priv;
+
+ if (mode->algo == GCRY_KDF_PBKDF2)
+ {
+ obj->min_bufsize = 2;
+ obj->max_bufsize = 2 * 32;
+ obj->step_size = 2;
+ }
+
+ obj->num_measure_repetitions = num_measurement_repetitions;
+
+ return 0;
+}
+
+static void
+bench_kdf_free (struct bench_obj *obj)
+{
+ (void)obj;
+}
+
+static void
+bench_kdf_do_bench (struct bench_obj *obj, void *buf, size_t buflen)
+{
+ struct bench_kdf_mode *mode = obj->priv;
+ char keybuf[16];
+
+ (void)buf;
+
+ if (mode->algo == GCRY_KDF_PBKDF2)
+ {
+ gcry_kdf_derive("qwerty", 6, mode->algo, mode->subalgo, "01234567", 8,
+ buflen, sizeof(keybuf), keybuf);
+ }
+}
+
+static struct bench_ops kdf_ops = {
+ &bench_kdf_init,
+ &bench_kdf_free,
+ &bench_kdf_do_bench
+};
+
+
+static void
+kdf_bench_one (int algo, int subalgo)
+{
+ struct bench_kdf_mode mode = { &kdf_ops };
+ struct bench_obj obj = { 0 };
+ double nsecs_per_iteration;
+ double cycles_per_iteration;
+ char algo_name[32];
+ char nsecpiter_buf[16];
+ char cpiter_buf[16];
+
+ mode.algo = algo;
+ mode.subalgo = subalgo;
+
+ switch (subalgo)
+ {
+ case GCRY_MD_CRC32:
+ case GCRY_MD_CRC32_RFC1510:
+ case GCRY_MD_CRC24_RFC2440:
+ case GCRY_MD_MD4:
+ /* Skip CRC32s. */
+ return;
+ }
+
+ *algo_name = 0;
+
+ if (algo == GCRY_KDF_PBKDF2)
+ {
+ snprintf (algo_name, sizeof(algo_name), "PBKDF2-HMAC-%s",
+ gcry_md_algo_name (subalgo));
+ }
+
+ bench_print_algo (-24, algo_name);
+
+ obj.ops = mode.ops;
+ obj.priv = &mode;
+
+ nsecs_per_iteration = do_slope_benchmark (&obj);
+
+ strcpy(cpiter_buf, csv_mode ? "" : "-");
+
+ double_to_str (nsecpiter_buf, sizeof (nsecpiter_buf), nsecs_per_iteration);
+
+ /* If user didn't provide CPU speed, we cannot show cycles/iter results. */
+ if (cpu_ghz > 0.0)
+ {
+ cycles_per_iteration = nsecs_per_iteration * cpu_ghz;
+ double_to_str (cpiter_buf, sizeof (cpiter_buf), cycles_per_iteration);
+ }
+
+ if (csv_mode)
+ {
+ printf ("%s,%s,%s,,,,,,,,,%s,ns/iter,%s,c/iter\n",
+ current_section_name,
+ current_algo_name ? current_algo_name : "",
+ current_mode_name ? current_mode_name : "",
+ nsecpiter_buf,
+ cpiter_buf);
+ }
+ else
+ {
+ printf ("%14s %13s\n", nsecpiter_buf, cpiter_buf);
+ }
+}
+
+void
+kdf_bench (char **argv, int argc)
+{
+ char algo_name[32];
+ int i, j;
+
+ bench_print_section ("kdf", "KDF");
+
+ if (!csv_mode)
+ {
+ printf (" %-*s | ", 24, "");
+ printf ("%14s %13s\n", "nanosecs/iter", "cycles/iter");
+ }
+
+ if (argv && argc)
+ {
+ for (i = 0; i < argc; i++)
+ {
+ for (j = 1; j < 400; j++)
+ {
+ if (gcry_md_test_algo (j))
+ continue;
+
+ snprintf (algo_name, sizeof(algo_name), "PBKDF2-HMAC-%s",
+ gcry_md_algo_name (j));
+
+ if (!strcmp(argv[i], algo_name))
+ kdf_bench_one (GCRY_KDF_PBKDF2, j);
+ }
+ }
+ }
+ else
+ {
+ for (i = 1; i < 400; i++)
+ if (!gcry_md_test_algo (i))
+ kdf_bench_one (GCRY_KDF_PBKDF2, i);
+ }
+
+ bench_print_footer (24);
+}
+
+
/************************************************************** Main program. */
void
print_help (void)
{
static const char *help_lines[] = {
- "usage: bench-slope [options] [hash|mac|cipher [algonames]]",
+ "usage: bench-slope [options] [hash|mac|cipher|kdf [algonames]]",
"",
" options:",
" --cpu-mhz <mhz> Set CPU speed for calculating cycles",
@@ -1744,6 +1907,7 @@ main (int argc, char **argv)
hash_bench (NULL, 0);
mac_bench (NULL, 0);
cipher_bench (NULL, 0);
+ kdf_bench (NULL, 0);
}
else if (!strcmp (*argv, "hash"))
{
@@ -1769,6 +1933,14 @@ main (int argc, char **argv)
warm_up_cpu ();
cipher_bench ((argc == 0) ? NULL : argv, argc);
}
+ else if (!strcmp (*argv, "kdf"))
+ {
+ argc--;
+ argv++;
+
+ warm_up_cpu ();
+ kdf_bench ((argc == 0) ? NULL : argv, argc);
+ }
else
{
fprintf (stderr, PGM ": unknown argument: %s\n", *argv);
-----------------------------------------------------------------------
Summary of changes:
cipher/Makefile.am | 2 +-
cipher/camellia-glue.c | 116 ++++---
cipher/hash-common.h | 12 +-
cipher/keccak.c | 808 ++++++++++++++++++++++++++++++++-------------
cipher/keccak_permute_32.h | 535 ++++++++++++++++++++++++++++++
cipher/keccak_permute_64.h | 290 ++++++++++++++++
cipher/serpent.c | 104 +++---
cipher/sha1.c | 2 +-
cipher/sha256.c | 4 +-
cipher/sha512.c | 4 +-
cipher/twofish.c | 32 +-
src/g10lib.h | 21 +-
src/hwf-x86.c | 34 +-
src/hwfeatures.c | 27 +-
tests/bench-slope.c | 174 +++++++++-
15 files changed, 1775 insertions(+), 390 deletions(-)
create mode 100644 cipher/keccak_permute_32.h
create mode 100644 cipher/keccak_permute_64.h
hooks/post-receive
--
The GNU crypto library
http://git.gnupg.org
_______________________________________________
Gnupg-commits mailing list
Gnupg-commits at gnupg.org
http://lists.gnupg.org/mailman/listinfo/gnupg-commits
More information about the Gcrypt-devel
mailing list