From gniibe at fsij.org Mon Sep 2 05:03:03 2019 From: gniibe at fsij.org (NIIBE Yutaka) Date: Mon, 02 Sep 2019 12:03:03 +0900 Subject: [Patch] use Requires.private do avoid unnecessary linkage In-Reply-To: <20190830171307.GC14673@argenau.bebt.de> References: <20190830171307.GC14673@argenau.bebt.de> Message-ID: <87v9ubl7rs.fsf@iwagami.gniibe.org> Andreas Metzler wrote: > find attached a one-line patch to avoid unnecessary linkage when using > pkg-config. For dynamic linking it is not necessary to link libgpg-error > when linking against libgcrypt (unless functions from libgpg-error are > used directly, obviously). > > With this patch this works. > (sid)ametzler at argenau:$ pkg-config --libs libgcrypt > -lgcrypt > (sid)ametzler at argenau:$ pkg-config --libs --static libgcrypt > -lgcrypt -lgpg-error Thanks. It's better to support Requires.private field. I will. Let me explain the reason why it is so, currently. Libraries for GnuPG (libgpg-error, libassuan, libgcrypt, npth, libksba, ntbtls) had their own *-config program, and GnuPG used to use each *-config program for its build. (Note that, before pkg-config was introduced, having each *-config program was common practice.) Now, we support *.pc files as well. This is both for (1) developers who use pkg-config and for (2) build of GnuPG. We are going to deprecate *-config program except gpgrt-config. For build of GnuPG, we don't use pkg-config to minimize dependency. We keep using a single program, gpgrt-config, which handles *.pc files. (We moved from using multiple *-config programs to a single program gpgrt-config.) At the first phase (currently), we make our *.pc files to have exact same behavior of old *-config program. -- From jussi.kivilinna at iki.fi Fri Sep 6 21:46:05 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 6 Sep 2019 22:46:05 +0300 Subject: [PATCH] poly1305: add fast addition macro for ppc64 Message-ID: <156779916559.26561.8201518439887543961.stgit@localhost.localdomain> * cipher/poly1305.c [USE_MPI_64BIT && __powerpc__] (ADD_1305_64): New. -- Benchmark on POWER8 (~3.8Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte POLY1305 | 0.547 ns/B 1742 MiB/s 2.08 c/B After (~8% faster): | nanosecs/byte mebibytes/sec cycles/byte POLY1305 | 0.502 ns/B 1901 MiB/s 1.91 c/B Benchmark on POWER9 (~3.8Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte POLY1305 | 0.493 ns/B 1934 MiB/s 1.87 c/B After (~7% faster): | nanosecs/byte mebibytes/sec cycles/byte POLY1305 | 0.459 ns/B 2077 MiB/s 1.74 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/poly1305.c b/cipher/poly1305.c index cded7cb2e..698050851 100644 --- a/cipher/poly1305.c +++ b/cipher/poly1305.c @@ -99,6 +99,19 @@ static void poly1305_init (poly1305_context_t *ctx, #endif /* __x86_64__ */ +#if defined (__powerpc__) && __GNUC__ >= 4 + +/* A += B (ppc64) */ +#define ADD_1305_64(A2, A1, A0, B2, B1, B0) \ + __asm__ ("addc %0, %3, %0\n" \ + "adde %1, %4, %1\n" \ + "adde %2, %5, %2\n" \ + : "+r" (A0), "+r" (A1), "+r" (A2) \ + : "r" (B0), "r" (B1), "r" (B2) \ + : "cc" ) + +#endif /* __powerpc__ */ + #ifndef ADD_1305_64 /* A += B (generic, mpi) */ # define ADD_1305_64(A2, A1, A0, B2, B1, B0) do { \ From jussi.kivilinna at iki.fi Tue Sep 10 22:57:40 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 10 Sep 2019 23:57:40 +0300 Subject: [PATCH] Add PowerPC vector implementation of ChaCha20 Message-ID: <156814906060.28680.566587237212070963.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'chacha20-ppc.c'. * cipher/chacha20-ppc.c: New. * cipher/chacha20.c (USE_PPC_VEC, _gcry_chacha20_ppc8_blocks4) (_gcry_chacha20_ppc8_blocks1, USE_PPC_VEC_POLY1305) (_gcry_chacha20_poly1305_ppc8_blocks4): New. (CHACHA20_context_t): Add 'use_ppc'. (chacha20_blocks, chacha20_keysetup) (do_chacha20_encrypt_stream_tail): Add USE_PPC_VEC code. (_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt): Add USE_PPC_VEC_POLY1305 code. * configure.ac: Add 'chacha20-ppc.lo'. * src/g10lib.h (HWF_PPC_ARCH_2_07): New. * src/hwf-ppc.c (PPC_FEATURE2_ARCH_2_07): New. (ppc_features): Add HWF_PPC_ARCH_2_07. * src/hwfeatures.c (hwflist): Add 'ppc-arch_2_07'. -- This patch adds 1-way, 2-way and 4-way ChaCha20 vector implementations and 4-way stitched ChaCha20+Poly1305 implementation for PowerPC. Benchmark on POWER8 (~3.8Ghz): Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 2.60 ns/B 366.2 MiB/s 9.90 c/B STREAM dec | 2.61 ns/B 366.1 MiB/s 9.90 c/B POLY1305 enc | 3.11 ns/B 307.1 MiB/s 11.80 c/B POLY1305 dec | 3.11 ns/B 307.0 MiB/s 11.80 c/B POLY1305 auth | 0.502 ns/B 1900 MiB/s 1.91 c/B After (~4x faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.619 ns/B 1540 MiB/s 2.35 c/B STREAM dec | 0.619 ns/B 1541 MiB/s 2.35 c/B POLY1305 enc | 0.785 ns/B 1215 MiB/s 2.98 c/B POLY1305 dec | 0.769 ns/B 1240 MiB/s 2.92 c/B POLY1305 auth | 0.502 ns/B 1901 MiB/s 1.91 c/B Benchmark on POWER9 (~3.8Ghz): Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 2.27 ns/B 419.9 MiB/s 8.63 c/B STREAM dec | 2.27 ns/B 419.8 MiB/s 8.63 c/B POLY1305 enc | 2.73 ns/B 349.1 MiB/s 10.38 c/B POLY1305 dec | 2.73 ns/B 349.3 MiB/s 10.37 c/B POLY1305 auth | 0.459 ns/B 2076 MiB/s 1.75 c/B After (chacha20 ~3x faster, chacha20+poly1305 ~2.5x faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.690 ns/B 1381 MiB/s 2.62 c/B STREAM dec | 0.690 ns/B 1382 MiB/s 2.62 c/B POLY1305 enc | 1.09 ns/B 878.2 MiB/s 4.13 c/B POLY1305 dec | 1.07 ns/B 887.8 MiB/s 4.08 c/B POLY1305 auth | 0.459 ns/B 2076 MiB/s 1.75 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 4a5110d65..2e90506ed 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -78,6 +78,7 @@ EXTRA_libcipher_la_SOURCES = \ cast5.c cast5-amd64.S cast5-arm.S \ chacha20.c chacha20-amd64-ssse3.S chacha20-amd64-avx2.S \ chacha20-armv7-neon.S chacha20-aarch64.S \ + chacha20-ppc.c \ crc.c crc-intel-pclmul.c crc-armv8-ce.c \ crc-armv8-aarch64-ce.S \ des.c des-amd64.S \ diff --git a/cipher/chacha20-ppc.c b/cipher/chacha20-ppc.c new file mode 100644 index 000000000..17e2f0902 --- /dev/null +++ b/cipher/chacha20-ppc.c @@ -0,0 +1,624 @@ +/* chacha20-ppc.c - PowerPC vector implementation of ChaCha20 + * Copyright (C) 2019 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(ENABLE_PPC_CRYPTO_SUPPORT) && \ + defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) && \ + defined(USE_CHACHA20) && \ + __GNUC__ >= 4 + +#include +#include "bufhelp.h" +#include "poly1305-internal.h" + +#include "mpi-internal.h" +#include "longlong.h" + + +typedef vector unsigned char vector16x_u8; +typedef vector unsigned int vector4x_u32; +typedef vector unsigned long long vector2x_u64; + + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + + +#ifdef WORDS_BIGENDIAN +static const vector16x_u8 le_bswap_const = + { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }; +#endif + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_rol_elems(vector4x_u32 v, unsigned int idx) +{ +#ifndef WORDS_BIGENDIAN + return vec_sld (v, v, (16 - (4 * idx)) & 15); +#else + return vec_sld (v, v, (4 * idx) & 15); +#endif +} + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_load_le(unsigned long offset, const unsigned char *ptr) +{ + vector4x_u32 vec; + vec = vec_vsx_ld (offset, (const u32 *)ptr); +#ifdef WORDS_BIGENDIAN + vec = (vector4x_u32)vec_perm((vector16x_u8)vec, (vector16x_u8)vec, + le_bswap_const); +#endif + return vec; +} + + +static ASM_FUNC_ATTR_INLINE void +vec_store_le(vector4x_u32 vec, unsigned long offset, unsigned char *ptr) +{ +#ifdef WORDS_BIGENDIAN + vec = (vector4x_u32)vec_perm((vector16x_u8)vec, (vector16x_u8)vec, + le_bswap_const); +#endif + vec_vsx_st (vec, offset, (u32 *)ptr); +} + + +/********************************************************************** + 2-way && 1-way chacha20 + **********************************************************************/ + +#define ROTATE(v1,rolv) \ + __asm__ ("vrlw %0,%1,%2\n\t" : "=v" (v1) : "v" (v1), "v" (rolv)) + +#define WORD_ROL(v1,c) \ + ((v1) = vec_rol_elems((v1), (c))) + +#define XOR(ds,s) \ + ((ds) ^= (s)) + +#define PLUS(ds,s) \ + ((ds) += (s)) + +#define QUARTERROUND4(x0,x1,x2,x3,rol_x1,rol_x2,rol_x3) \ + PLUS(x0, x1); XOR(x3, x0); ROTATE(x3, rotate_16); \ + PLUS(x2, x3); XOR(x1, x2); ROTATE(x1, rotate_12); \ + PLUS(x0, x1); XOR(x3, x0); ROTATE(x3, rotate_8); \ + PLUS(x2, x3); \ + WORD_ROL(x3, rol_x3); \ + XOR(x1, x2); \ + WORD_ROL(x2, rol_x2); \ + ROTATE(x1, rotate_7); \ + WORD_ROL(x1, rol_x1); + +unsigned int ASM_FUNC_ATTR +_gcry_chacha20_ppc8_blocks1(u32 *state, byte *dst, const byte *src, + size_t nblks) +{ + vector4x_u32 counter_1 = { 1, 0, 0, 0 }; + vector4x_u32 rotate_16 = { 16, 16, 16, 16 }; + vector4x_u32 rotate_12 = { 12, 12, 12, 12 }; + vector4x_u32 rotate_8 = { 8, 8, 8, 8 }; + vector4x_u32 rotate_7 = { 7, 7, 7, 7 }; + vector4x_u32 state0, state1, state2, state3; + vector4x_u32 v0, v1, v2, v3; + vector4x_u32 v4, v5, v6, v7; + int i; + + /* force preload of constants to vector registers */ + __asm__ ("": "+v" (counter_1) :: "memory"); + __asm__ ("": "+v" (rotate_16) :: "memory"); + __asm__ ("": "+v" (rotate_12) :: "memory"); + __asm__ ("": "+v" (rotate_8) :: "memory"); + __asm__ ("": "+v" (rotate_7) :: "memory"); + + state0 = vec_vsx_ld(0 * 16, state); + state1 = vec_vsx_ld(1 * 16, state); + state2 = vec_vsx_ld(2 * 16, state); + state3 = vec_vsx_ld(3 * 16, state); + + while (nblks >= 2) + { + v0 = state0; + v1 = state1; + v2 = state2; + v3 = state3; + + v4 = state0; + v5 = state1; + v6 = state2; + v7 = state3; + v7 += counter_1; + + for (i = 20; i > 0; i -= 2) + { + QUARTERROUND4(v0, v1, v2, v3, 1, 2, 3); + QUARTERROUND4(v4, v5, v6, v7, 1, 2, 3); + QUARTERROUND4(v0, v1, v2, v3, 3, 2, 1); + QUARTERROUND4(v4, v5, v6, v7, 3, 2, 1); + } + + v0 += state0; + v1 += state1; + v2 += state2; + v3 += state3; + state3 += counter_1; /* update counter */ + v4 += state0; + v5 += state1; + v6 += state2; + v7 += state3; + state3 += counter_1; /* update counter */ + + v0 ^= vec_load_le(0 * 16, src); + v1 ^= vec_load_le(1 * 16, src); + v2 ^= vec_load_le(2 * 16, src); + v3 ^= vec_load_le(3 * 16, src); + vec_store_le(v0, 0 * 16, dst); + vec_store_le(v1, 1 * 16, dst); + vec_store_le(v2, 2 * 16, dst); + vec_store_le(v3, 3 * 16, dst); + src += 64; + dst += 64; + v4 ^= vec_load_le(0 * 16, src); + v5 ^= vec_load_le(1 * 16, src); + v6 ^= vec_load_le(2 * 16, src); + v7 ^= vec_load_le(3 * 16, src); + vec_store_le(v4, 0 * 16, dst); + vec_store_le(v5, 1 * 16, dst); + vec_store_le(v6, 2 * 16, dst); + vec_store_le(v7, 3 * 16, dst); + src += 64; + dst += 64; + + nblks -= 2; + } + + while (nblks) + { + v0 = state0; + v1 = state1; + v2 = state2; + v3 = state3; + + for (i = 20; i > 0; i -= 2) + { + QUARTERROUND4(v0, v1, v2, v3, 1, 2, 3); + QUARTERROUND4(v0, v1, v2, v3, 3, 2, 1); + } + + v0 += state0; + v1 += state1; + v2 += state2; + v3 += state3; + state3 += counter_1; /* update counter */ + + v0 ^= vec_load_le(0 * 16, src); + v1 ^= vec_load_le(1 * 16, src); + v2 ^= vec_load_le(2 * 16, src); + v3 ^= vec_load_le(3 * 16, src); + vec_store_le(v0, 0 * 16, dst); + vec_store_le(v1, 1 * 16, dst); + vec_store_le(v2, 2 * 16, dst); + vec_store_le(v3, 3 * 16, dst); + src += 64; + dst += 64; + + nblks--; + } + + vec_vsx_st(state3, 3 * 16, state); /* store counter */ + + return 0; +} + + +/********************************************************************** + 4-way chacha20 + **********************************************************************/ + +/* 4x4 32-bit integer matrix transpose */ +#define transpose_4x4(x0, x1, x2, x3) ({ \ + vector4x_u32 t1 = vec_mergeh(x0, x2); \ + vector4x_u32 t2 = vec_mergel(x0, x2); \ + vector4x_u32 t3 = vec_mergeh(x1, x3); \ + x3 = vec_mergel(x1, x3); \ + x0 = vec_mergeh(t1, t3); \ + x1 = vec_mergel(t1, t3); \ + x2 = vec_mergeh(t2, x3); \ + x3 = vec_mergel(t2, x3); \ + }) + +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2) \ + PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2); \ + ROTATE(d1, rotate_16); ROTATE(d2, rotate_16); \ + PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2); \ + ROTATE(b1, rotate_12); ROTATE(b2, rotate_12); \ + PLUS(a1,b1); PLUS(a2,b2); XOR(d1,a1); XOR(d2,a2); \ + ROTATE(d1, rotate_8); ROTATE(d2, rotate_8); \ + PLUS(c1,d1); PLUS(c2,d2); XOR(b1,c1); XOR(b2,c2); \ + ROTATE(b1, rotate_7); ROTATE(b2, rotate_7); + +unsigned int ASM_FUNC_ATTR +_gcry_chacha20_ppc8_blocks4(u32 *state, byte *dst, const byte *src, + size_t nblks) +{ + vector4x_u32 counters_0123 = { 0, 1, 2, 3 }; + vector4x_u32 counter_4 = { 4, 0, 0, 0 }; + vector4x_u32 rotate_16 = { 16, 16, 16, 16 }; + vector4x_u32 rotate_12 = { 12, 12, 12, 12 }; + vector4x_u32 rotate_8 = { 8, 8, 8, 8 }; + vector4x_u32 rotate_7 = { 7, 7, 7, 7 }; + vector4x_u32 state0, state1, state2, state3; + vector4x_u32 v0, v1, v2, v3, v4, v5, v6, v7; + vector4x_u32 v8, v9, v10, v11, v12, v13, v14, v15; + vector4x_u32 tmp; + int i; + + /* force preload of constants to vector registers */ + __asm__ ("": "+v" (counters_0123) :: "memory"); + __asm__ ("": "+v" (counter_4) :: "memory"); + __asm__ ("": "+v" (rotate_16) :: "memory"); + __asm__ ("": "+v" (rotate_12) :: "memory"); + __asm__ ("": "+v" (rotate_8) :: "memory"); + __asm__ ("": "+v" (rotate_7) :: "memory"); + + state0 = vec_vsx_ld(0 * 16, state); + state1 = vec_vsx_ld(1 * 16, state); + state2 = vec_vsx_ld(2 * 16, state); + state3 = vec_vsx_ld(3 * 16, state); + + do + { + v0 = vec_splat(state0, 0); + v1 = vec_splat(state0, 1); + v2 = vec_splat(state0, 2); + v3 = vec_splat(state0, 3); + v4 = vec_splat(state1, 0); + v5 = vec_splat(state1, 1); + v6 = vec_splat(state1, 2); + v7 = vec_splat(state1, 3); + v8 = vec_splat(state2, 0); + v9 = vec_splat(state2, 1); + v10 = vec_splat(state2, 2); + v11 = vec_splat(state2, 3); + v12 = vec_splat(state3, 0); + v13 = vec_splat(state3, 1); + v14 = vec_splat(state3, 2); + v15 = vec_splat(state3, 3); + + v12 += counters_0123; + v13 -= vec_cmplt(v12, counters_0123); + + for (i = 20; i > 0; i -= 2) + { + QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) + QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) + QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) + QUARTERROUND2(v2, v7, v8, v13, v3, v4, v9, v14) + } + + v0 += vec_splat(state0, 0); + v1 += vec_splat(state0, 1); + v2 += vec_splat(state0, 2); + v3 += vec_splat(state0, 3); + v4 += vec_splat(state1, 0); + v5 += vec_splat(state1, 1); + v6 += vec_splat(state1, 2); + v7 += vec_splat(state1, 3); + v8 += vec_splat(state2, 0); + v9 += vec_splat(state2, 1); + v10 += vec_splat(state2, 2); + v11 += vec_splat(state2, 3); + tmp = vec_splat(state3, 0); + tmp += counters_0123; + v12 += tmp; + v13 += vec_splat(state3, 1) - vec_cmplt(tmp, counters_0123); + v14 += vec_splat(state3, 2); + v15 += vec_splat(state3, 3); + state3 += counter_4; /* update counter */ + + transpose_4x4(v0, v1, v2, v3); + transpose_4x4(v4, v5, v6, v7); + transpose_4x4(v8, v9, v10, v11); + transpose_4x4(v12, v13, v14, v15); + + v0 ^= vec_load_le((64 * 0 + 16 * 0), src); + v1 ^= vec_load_le((64 * 1 + 16 * 0), src); + v2 ^= vec_load_le((64 * 2 + 16 * 0), src); + v3 ^= vec_load_le((64 * 3 + 16 * 0), src); + + v4 ^= vec_load_le((64 * 0 + 16 * 1), src); + v5 ^= vec_load_le((64 * 1 + 16 * 1), src); + v6 ^= vec_load_le((64 * 2 + 16 * 1), src); + v7 ^= vec_load_le((64 * 3 + 16 * 1), src); + + v8 ^= vec_load_le((64 * 0 + 16 * 2), src); + v9 ^= vec_load_le((64 * 1 + 16 * 2), src); + v10 ^= vec_load_le((64 * 2 + 16 * 2), src); + v11 ^= vec_load_le((64 * 3 + 16 * 2), src); + + v12 ^= vec_load_le((64 * 0 + 16 * 3), src); + v13 ^= vec_load_le((64 * 1 + 16 * 3), src); + v14 ^= vec_load_le((64 * 2 + 16 * 3), src); + v15 ^= vec_load_le((64 * 3 + 16 * 3), src); + + vec_store_le(v0, (64 * 0 + 16 * 0), dst); + vec_store_le(v1, (64 * 1 + 16 * 0), dst); + vec_store_le(v2, (64 * 2 + 16 * 0), dst); + vec_store_le(v3, (64 * 3 + 16 * 0), dst); + + vec_store_le(v4, (64 * 0 + 16 * 1), dst); + vec_store_le(v5, (64 * 1 + 16 * 1), dst); + vec_store_le(v6, (64 * 2 + 16 * 1), dst); + vec_store_le(v7, (64 * 3 + 16 * 1), dst); + + vec_store_le(v8, (64 * 0 + 16 * 2), dst); + vec_store_le(v9, (64 * 1 + 16 * 2), dst); + vec_store_le(v10, (64 * 2 + 16 * 2), dst); + vec_store_le(v11, (64 * 3 + 16 * 2), dst); + + vec_store_le(v12, (64 * 0 + 16 * 3), dst); + vec_store_le(v13, (64 * 1 + 16 * 3), dst); + vec_store_le(v14, (64 * 2 + 16 * 3), dst); + vec_store_le(v15, (64 * 3 + 16 * 3), dst); + + src += 4*64; + dst += 4*64; + + nblks -= 4; + } + while (nblks); + + vec_vsx_st(state3, 3 * 16, state); /* store counter */ + + return 0; +} + + +#if SIZEOF_UNSIGNED_LONG == 8 + +/********************************************************************** + 4-way stitched chacha20-poly1305 + **********************************************************************/ + +#define ADD_1305_64(A2, A1, A0, B2, B1, B0) \ + __asm__ ("addc %0, %3, %0\n" \ + "adde %1, %4, %1\n" \ + "adde %2, %5, %2\n" \ + : "+r" (A0), "+r" (A1), "+r" (A2) \ + : "r" (B0), "r" (B1), "r" (B2) \ + : "cc" ) + +#define MUL_MOD_1305_64_PART1(H2, H1, H0, R1, R0, R1_MULT5) do { \ + /* x = a * r (partial mod 2^130-5) */ \ + umul_ppmm(x0_hi, x0_lo, H0, R0); /* h0 * r0 */ \ + umul_ppmm(x1_hi, x1_lo, H0, R1); /* h0 * r1 */ \ + \ + umul_ppmm(t0_hi, t0_lo, H1, R1_MULT5); /* h1 * r1 mod 2^130-5 */ \ + } while (0) + +#define MUL_MOD_1305_64_PART2(H2, H1, H0, R1, R0, R1_MULT5) do { \ + add_ssaaaa(x0_hi, x0_lo, x0_hi, x0_lo, t0_hi, t0_lo); \ + umul_ppmm(t1_hi, t1_lo, H1, R0); /* h1 * r0 */ \ + add_ssaaaa(x1_hi, x1_lo, x1_hi, x1_lo, t1_hi, t1_lo); \ + \ + t1_lo = H2 * R1_MULT5; /* h2 * r1 mod 2^130-5 */ \ + t1_hi = H2 * R0; /* h2 * r0 */ \ + add_ssaaaa(H0, H1, x1_hi, x1_lo, t1_hi, t1_lo); \ + \ + /* carry propagation */ \ + H2 = H0 & 3; \ + H0 = (H0 >> 2) * 5; /* msb mod 2^130-5 */ \ + ADD_1305_64(H2, H1, H0, (u64)0, x0_hi, x0_lo); \ + } while (0) + +#define POLY1305_BLOCK_PART1(in_pos) do { \ + m0 = buf_get_le64(poly1305_src + (in_pos) + 0); \ + m1 = buf_get_le64(poly1305_src + (in_pos) + 8); \ + /* a = h + m */ \ + ADD_1305_64(h2, h1, h0, m2, m1, m0); \ + /* h = a * r (partial mod 2^130-5) */ \ + MUL_MOD_1305_64_PART1(h2, h1, h0, r1, r0, r1_mult5); \ + } while (0) + +#define POLY1305_BLOCK_PART2(in_pos) do { \ + MUL_MOD_1305_64_PART2(h2, h1, h0, r1, r0, r1_mult5); \ + } while (0) + +unsigned int ASM_FUNC_ATTR +_gcry_chacha20_poly1305_ppc8_blocks4(u32 *state, byte *dst, const byte *src, + size_t nblks, POLY1305_STATE *st, + const byte *poly1305_src) +{ + vector4x_u32 counters_0123 = { 0, 1, 2, 3 }; + vector4x_u32 counter_4 = { 4, 0, 0, 0 }; + vector4x_u32 rotate_16 = { 16, 16, 16, 16 }; + vector4x_u32 rotate_12 = { 12, 12, 12, 12 }; + vector4x_u32 rotate_8 = { 8, 8, 8, 8 }; + vector4x_u32 rotate_7 = { 7, 7, 7, 7 }; + vector4x_u32 state0, state1, state2, state3; + vector4x_u32 v0, v1, v2, v3, v4, v5, v6, v7; + vector4x_u32 v8, v9, v10, v11, v12, v13, v14, v15; + vector4x_u32 tmp; + u64 r0, r1, r1_mult5; + u64 h0, h1, h2; + u64 m0, m1, m2; + u64 x0_lo, x0_hi, x1_lo, x1_hi; + u64 t0_lo, t0_hi, t1_lo, t1_hi; + int i; + + /* load poly1305 state */ + m2 = 1; + h0 = st->h[0] + ((u64)st->h[1] << 32); + h1 = st->h[2] + ((u64)st->h[3] << 32); + h2 = st->h[4]; + r0 = st->r[0] + ((u64)st->r[1] << 32); + r1 = st->r[2] + ((u64)st->r[3] << 32); + r1_mult5 = (r1 >> 2) + r1; + + /* force preload of constants to vector registers */ + __asm__ ("": "+v" (counters_0123) :: "memory"); + __asm__ ("": "+v" (counter_4) :: "memory"); + __asm__ ("": "+v" (rotate_16) :: "memory"); + __asm__ ("": "+v" (rotate_12) :: "memory"); + __asm__ ("": "+v" (rotate_8) :: "memory"); + __asm__ ("": "+v" (rotate_7) :: "memory"); + + state0 = vec_vsx_ld(0 * 16, state); + state1 = vec_vsx_ld(1 * 16, state); + state2 = vec_vsx_ld(2 * 16, state); + state3 = vec_vsx_ld(3 * 16, state); + + do + { + v0 = vec_splat(state0, 0); + v1 = vec_splat(state0, 1); + v2 = vec_splat(state0, 2); + v3 = vec_splat(state0, 3); + v4 = vec_splat(state1, 0); + v5 = vec_splat(state1, 1); + v6 = vec_splat(state1, 2); + v7 = vec_splat(state1, 3); + v8 = vec_splat(state2, 0); + v9 = vec_splat(state2, 1); + v10 = vec_splat(state2, 2); + v11 = vec_splat(state2, 3); + v12 = vec_splat(state3, 0); + v13 = vec_splat(state3, 1); + v14 = vec_splat(state3, 2); + v15 = vec_splat(state3, 3); + + v12 += counters_0123; + v13 -= vec_cmplt(v12, counters_0123); + + for (i = 0; i < 16; i += 2) + { + POLY1305_BLOCK_PART1((i + 0) * 16); + QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) + POLY1305_BLOCK_PART2(); + QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) + POLY1305_BLOCK_PART1((i + 1) * 16); + QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) + POLY1305_BLOCK_PART2(); + QUARTERROUND2(v2, v7, v8, v13, v3, v4, v9, v14) + } + for (; i < 20; i += 2) + { + QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) + QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) + QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) + QUARTERROUND2(v2, v7, v8, v13, v3, v4, v9, v14) + } + + v0 += vec_splat(state0, 0); + v1 += vec_splat(state0, 1); + v2 += vec_splat(state0, 2); + v3 += vec_splat(state0, 3); + v4 += vec_splat(state1, 0); + v5 += vec_splat(state1, 1); + v6 += vec_splat(state1, 2); + v7 += vec_splat(state1, 3); + v8 += vec_splat(state2, 0); + v9 += vec_splat(state2, 1); + v10 += vec_splat(state2, 2); + v11 += vec_splat(state2, 3); + tmp = vec_splat(state3, 0); + tmp += counters_0123; + v12 += tmp; + v13 += vec_splat(state3, 1) - vec_cmplt(tmp, counters_0123); + v14 += vec_splat(state3, 2); + v15 += vec_splat(state3, 3); + state3 += counter_4; /* update counter */ + + transpose_4x4(v0, v1, v2, v3); + transpose_4x4(v4, v5, v6, v7); + transpose_4x4(v8, v9, v10, v11); + transpose_4x4(v12, v13, v14, v15); + + v0 ^= vec_load_le((64 * 0 + 16 * 0), src); + v1 ^= vec_load_le((64 * 1 + 16 * 0), src); + v2 ^= vec_load_le((64 * 2 + 16 * 0), src); + v3 ^= vec_load_le((64 * 3 + 16 * 0), src); + + v4 ^= vec_load_le((64 * 0 + 16 * 1), src); + v5 ^= vec_load_le((64 * 1 + 16 * 1), src); + v6 ^= vec_load_le((64 * 2 + 16 * 1), src); + v7 ^= vec_load_le((64 * 3 + 16 * 1), src); + + v8 ^= vec_load_le((64 * 0 + 16 * 2), src); + v9 ^= vec_load_le((64 * 1 + 16 * 2), src); + v10 ^= vec_load_le((64 * 2 + 16 * 2), src); + v11 ^= vec_load_le((64 * 3 + 16 * 2), src); + + v12 ^= vec_load_le((64 * 0 + 16 * 3), src); + v13 ^= vec_load_le((64 * 1 + 16 * 3), src); + v14 ^= vec_load_le((64 * 2 + 16 * 3), src); + v15 ^= vec_load_le((64 * 3 + 16 * 3), src); + + vec_store_le(v0, (64 * 0 + 16 * 0), dst); + vec_store_le(v1, (64 * 1 + 16 * 0), dst); + vec_store_le(v2, (64 * 2 + 16 * 0), dst); + vec_store_le(v3, (64 * 3 + 16 * 0), dst); + + vec_store_le(v4, (64 * 0 + 16 * 1), dst); + vec_store_le(v5, (64 * 1 + 16 * 1), dst); + vec_store_le(v6, (64 * 2 + 16 * 1), dst); + vec_store_le(v7, (64 * 3 + 16 * 1), dst); + + vec_store_le(v8, (64 * 0 + 16 * 2), dst); + vec_store_le(v9, (64 * 1 + 16 * 2), dst); + vec_store_le(v10, (64 * 2 + 16 * 2), dst); + vec_store_le(v11, (64 * 3 + 16 * 2), dst); + + vec_store_le(v12, (64 * 0 + 16 * 3), dst); + vec_store_le(v13, (64 * 1 + 16 * 3), dst); + vec_store_le(v14, (64 * 2 + 16 * 3), dst); + vec_store_le(v15, (64 * 3 + 16 * 3), dst); + + src += 4*64; + dst += 4*64; + poly1305_src += 16*16; + + nblks -= 4; + } + while (nblks); + + vec_vsx_st(state3, 3 * 16, state); /* store counter */ + + /* store poly1305 state */ + st->h[0] = h0; + st->h[1] = h0 >> 32; + st->h[2] = h1; + st->h[3] = h1 >> 32; + st->h[4] = h2; + + return 0; +} + +#endif /* SIZEOF_UNSIGNED_LONG == 8 */ + +#endif /* ENABLE_PPC_CRYPTO_SUPPORT */ diff --git a/cipher/chacha20.c b/cipher/chacha20.c index 48fff6250..b34d8d197 100644 --- a/cipher/chacha20.c +++ b/cipher/chacha20.c @@ -85,6 +85,18 @@ # endif #endif +/* USE_PPC_VEC indicates whether to enable PowerPC vector + * accelerated code. */ +#undef USE_PPC_VEC +#ifdef ENABLE_PPC_CRYPTO_SUPPORT +# if defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) +# if __GNUC__ >= 4 +# define USE_PPC_VEC 1 +# endif +# endif +#endif + /* Assembly implementations use SystemV ABI, ABI conversion and additional * stack to store XMM6-XMM15 needed on Win64. */ #undef ASM_FUNC_ABI @@ -104,6 +116,7 @@ typedef struct CHACHA20_context_s int use_ssse3:1; int use_avx2:1; int use_neon:1; + int use_ppc:1; } CHACHA20_context_t; @@ -139,6 +152,26 @@ unsigned int _gcry_chacha20_poly1305_amd64_avx2_blocks8( #endif /* USE_AVX2 */ +#ifdef USE_PPC_VEC + +unsigned int _gcry_chacha20_ppc8_blocks4(u32 *state, byte *dst, + const byte *src, + size_t nblks); + +unsigned int _gcry_chacha20_ppc8_blocks1(u32 *state, byte *dst, + const byte *src, + size_t nblks); + +#undef USE_PPC_VEC_POLY1305 +#if SIZEOF_UNSIGNED_LONG == 8 +#define USE_PPC_VEC_POLY1305 1 +unsigned int _gcry_chacha20_poly1305_ppc8_blocks4( + u32 *state, byte *dst, const byte *src, size_t nblks, + POLY1305_STATE *st, const byte *poly1305_src); +#endif + +#endif /* USE_PPC_VEC */ + #ifdef USE_ARMV7_NEON unsigned int _gcry_chacha20_armv7_neon_blocks4(u32 *state, byte *dst, @@ -267,6 +300,13 @@ chacha20_blocks (CHACHA20_context_t *ctx, byte *dst, const byte *src, } #endif +#ifdef USE_PPC_VEC + if (ctx->use_ppc) + { + return _gcry_chacha20_ppc8_blocks1(ctx->input, dst, src, nblks); + } +#endif + return do_chacha20_blocks (ctx->input, dst, src, nblks); } @@ -391,6 +431,9 @@ chacha20_do_setkey (CHACHA20_context_t *ctx, #ifdef USE_AARCH64_SIMD ctx->use_neon = (features & HWF_ARM_NEON) != 0; #endif +#ifdef USE_PPC_VEC + ctx->use_ppc = (features & HWF_PPC_ARCH_2_07) != 0; +#endif (void)features; @@ -478,6 +521,19 @@ do_chacha20_encrypt_stream_tail (CHACHA20_context_t *ctx, byte *outbuf, } #endif +#ifdef USE_PPC_VEC + if (ctx->use_ppc && length >= CHACHA20_BLOCK_SIZE * 4) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 4; + nburn = _gcry_chacha20_ppc8_blocks4(ctx->input, outbuf, inbuf, nblocks); + burn = nburn > burn ? nburn : burn; + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + if (length >= CHACHA20_BLOCK_SIZE) { size_t nblocks = length / CHACHA20_BLOCK_SIZE; @@ -632,6 +688,18 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, inbuf += 1 * CHACHA20_BLOCK_SIZE; } #endif +#ifdef USE_PPC_VEC_POLY1305 + else if (ctx->use_ppc && length >= CHACHA20_BLOCK_SIZE * 4) + { + nburn = _gcry_chacha20_ppc8_blocks4(ctx->input, outbuf, inbuf, 4); + burn = nburn > burn ? nburn : burn; + + authptr = outbuf; + length -= 4 * CHACHA20_BLOCK_SIZE; + outbuf += 4 * CHACHA20_BLOCK_SIZE; + inbuf += 4 * CHACHA20_BLOCK_SIZE; + } +#endif if (authptr) { @@ -695,6 +763,26 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, } #endif +#ifdef USE_PPC_VEC_POLY1305 + if (ctx->use_ppc && + length >= 4 * CHACHA20_BLOCK_SIZE && + authoffset >= 4 * CHACHA20_BLOCK_SIZE) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 4; + + nburn = _gcry_chacha20_poly1305_ppc8_blocks4( + ctx->input, outbuf, inbuf, nblocks, + &c->u_mode.poly1305.ctx.state, authptr); + burn = nburn > burn ? nburn : burn; + + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + authptr += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + if (authoffset > 0) { _gcry_poly1305_update (&c->u_mode.poly1305.ctx, authptr, authoffset); @@ -825,6 +913,23 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, } #endif +#ifdef USE_PPC_VEC_POLY1305 + if (ctx->use_ppc && length >= 4 * CHACHA20_BLOCK_SIZE) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 4; + + nburn = _gcry_chacha20_poly1305_ppc8_blocks4( + ctx->input, outbuf, inbuf, nblocks, + &c->u_mode.poly1305.ctx.state, inbuf); + burn = nburn > burn ? nburn : burn; + + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + while (length) { size_t currlen = length; diff --git a/configure.ac b/configure.ac index 7b8f6cf41..6333076b7 100644 --- a/configure.ac +++ b/configure.ac @@ -2503,6 +2503,18 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-aarch64.lo" ;; + powerpc64le-*-*) + # Build with the ppc8 vector implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-ppc.lo" + ;; + powerpc64-*-*) + # Build with the ppc8 vector implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-ppc.lo" + ;; + powerpc-*-*) + # Build with the ppc8 vector implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS chacha20-ppc.lo" + ;; esac if test x"$neonsupport" = xyes ; then diff --git a/src/g10lib.h b/src/g10lib.h index bbdaf58be..c85e66492 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -238,6 +238,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_PPC_VCRYPTO (1 << 22) #define HWF_PPC_ARCH_3_00 (1 << 23) +#define HWF_PPC_ARCH_2_07 (1 << 24) gpg_err_code_t _gcry_disable_hw_feature (const char *name); void _gcry_detect_hw_features (void); diff --git a/src/hwf-ppc.c b/src/hwf-ppc.c index 2ed60c0f1..7477a71bd 100644 --- a/src/hwf-ppc.c +++ b/src/hwf-ppc.c @@ -83,6 +83,9 @@ struct feature_map_s # define AT_HWCAP2 26 #endif +#ifndef PPC_FEATURE2_ARCH_2_07 +# define PPC_FEATURE2_ARCH_2_07 0x80000000 +#endif #ifndef PPC_FEATURE2_VEC_CRYPTO # define PPC_FEATURE2_VEC_CRYPTO 0x02000000 #endif @@ -92,6 +95,7 @@ struct feature_map_s static const struct feature_map_s ppc_features[] = { + { 0, PPC_FEATURE2_ARCH_2_07, HWF_PPC_ARCH_2_07 }, #ifdef ENABLE_PPC_CRYPTO_SUPPORT { 0, PPC_FEATURE2_VEC_CRYPTO, HWF_PPC_VCRYPTO }, #endif diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 1021bd3b1..bff0f1c84 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -69,6 +69,7 @@ static struct #elif defined(HAVE_CPU_ARCH_PPC) { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, + { HWF_PPC_ARCH_2_07, "ppc-arch_2_07" }, #endif }; From jowerstc at gmail.com Fri Sep 13 18:38:18 2019 From: jowerstc at gmail.com (Tyler Jowers) Date: Fri, 13 Sep 2019 12:38:18 -0400 Subject: ECC shared secret Message-ID: Good Friday, I'm using gcrypt_pk_genkey() to generate a public-key from a private-key by supplying the d-mpi. After that, I'm struggling to figure how the shared secret should be generated using the same function. In particular, I'm using NIST P-192 and my SEXP to generate the initial public-key is "(genkey (ecc (curve "NIST P-192") (d %M)))" supplied with the u8-pointer private-key converted to an MPI. For the shared-secret, I've tried "(genkey (ecc (curve "NIST P-192") %S (d %M)))" with the q-point of the other-party and the local private-key as an MPI. I did this for both the client and server, and unfortunately I don't get matching shared-secrets, and yes I made certain to use the right public-key and private-key for server-side and client-side. Another attempt, was setting the other-party's q-point (public key) as the generator point with the local d-mpi: gcry_pk_genkey(..., "(genkey (ecc (curve "NIST P-192") (g %S) (d %M)))" gcry_sexp_cdr(gcry_sexp_find_token(other_party_pubkeydata, "q", 1)), local_privkey_mpi); This gave me a public-key (q-point) that matched the supplied q-point (supplied as the G parameter) verbatim. How should I generate the shared secret after generating the public keys on both sides? Regards, Tyler Jowers From jussi.kivilinna at iki.fi Sun Sep 15 21:28:42 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 15 Sep 2019 22:28:42 +0300 Subject: [PATCH] Add PowerPC vpmsum implementation of CRC Message-ID: <156857572202.26439.2481519561803625589.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'crc-ppc.c'. * cipher/crc-armv8-ce.c: Remove 'USE_INTEL_PCLMUL' comment. * cipher/crc-ppc.c: New. * cipher/crc.c (USE_PPC_VPMSUM): New. (CRC_CONTEXT): Add 'use_vpmsum'. (_gcry_crc32_ppc8_vpmsum, _gcry_crc24rfc2440_ppc8_vpmsum): New. (crc32_init, crc24rfc2440_init): Add HWF check for 'use_vpmsum'. (crc32_write, crc24rfc2440_write): Add 'use_vpmsum' code-path. * configure.ac: Add 'vpmsumd' instruction to PowerPC VSX inline assembly check; Add 'crc-ppc.lo'. -- Benchmark on POWER8 (ppc64le, ~3.8Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.978 ns/B 975.0 MiB/s 3.72 c/B CRC24RFC2440 | 0.974 ns/B 978.8 MiB/s 3.70 c/B After(~20x faster): | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.045 ns/B 20990 MiB/s 0.173 c/B CRC24RFC2440 | 0.047 ns/B 20260 MiB/s 0.179 c/B Benchmark on POWER9 (ppc64le, ~3.8Ghz): Before: | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 1.01 ns/B 943.7 MiB/s 3.84 c/B CRC24RFC2440 | 0.993 ns/B 960.6 MiB/s 3.77 c/B After (~17x-20x faster): | nanosecs/byte mebibytes/sec cycles/byte CRC32 | 0.046 ns/B 20741 MiB/s 0.175 c/B CRC24RFC2440 | 0.056 ns/B 17081 MiB/s 0.212 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 2e90506ed..b283d2400 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -81,6 +81,7 @@ EXTRA_libcipher_la_SOURCES = \ chacha20-ppc.c \ crc.c crc-intel-pclmul.c crc-armv8-ce.c \ crc-armv8-aarch64-ce.S \ + crc-ppc.c \ des.c des-amd64.S \ dsa.c \ elgamal.c \ diff --git a/cipher/crc-armv8-ce.c b/cipher/crc-armv8-ce.c index 8dd07cce6..17e555482 100644 --- a/cipher/crc-armv8-ce.c +++ b/cipher/crc-armv8-ce.c @@ -226,4 +226,4 @@ _gcry_crc24rfc2440_armv8_ce_pmull (u32 *pcrc, const byte *inbuf, size_t inlen) crc32_less_than_16 (pcrc, inbuf, inlen, consts); } -#endif /* USE_INTEL_PCLMUL */ +#endif diff --git a/cipher/crc-ppc.c b/cipher/crc-ppc.c new file mode 100644 index 000000000..70852772d --- /dev/null +++ b/cipher/crc-ppc.c @@ -0,0 +1,594 @@ +/* crc-ppc.c - POWER8 vpmsum accelerated CRC implementation + * Copyright (C) 2019 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA + * + */ + +#include +#include +#include +#include + +#include "g10lib.h" + +#include "bithelp.h" +#include "bufhelp.h" + + +#if defined(ENABLE_PPC_CRYPTO_SUPPORT) && \ + defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) && \ + __GNUC__ >= 4 + +#include +#include "bufhelp.h" + + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + +#define ALIGNED_64 __attribute__ ((aligned (64))) + + +typedef vector unsigned char vector16x_u8; +typedef vector unsigned int vector4x_u32; +typedef vector unsigned long long vector2x_u64; + + +/* Constants structure for generic reflected/non-reflected CRC32 PMULL + * functions. */ +struct crc32_consts_s +{ + /* k: { x^(32*17), x^(32*15), x^(32*5), x^(32*3), x^(32*2), 0 } mod P(x) */ + unsigned long long k[6]; + /* my_p: { floor(x^64 / P(x)), P(x) } */ + unsigned long long my_p[2]; +}; + +/* PMULL constants for CRC32 and CRC32RFC1510. */ +static const struct crc32_consts_s crc32_consts ALIGNED_64 = +{ + { /* k[6] = reverse_33bits( x^(32*y) mod P(x) ) */ + U64_C(0x154442bd4), U64_C(0x1c6e41596), /* y = { 17, 15 } */ + U64_C(0x1751997d0), U64_C(0x0ccaa009e), /* y = { 5, 3 } */ + U64_C(0x163cd6124), 0 /* y = 2 */ + }, + { /* my_p[2] = reverse_33bits ( { floor(x^64 / P(x)), P(x) } ) */ + U64_C(0x1f7011641), U64_C(0x1db710641) + } +}; + +/* PMULL constants for CRC24RFC2440 (polynomial multiplied with x?). */ +static const struct crc32_consts_s crc24rfc2440_consts ALIGNED_64 = +{ + { /* k[6] = x^(32*y) mod P(x) << 32*/ + U64_C(0x08289a00) << 32, U64_C(0x74b44a00) << 32, /* y = { 17, 15 } */ + U64_C(0xc4b14d00) << 32, U64_C(0xfd7e0c00) << 32, /* y = { 5, 3 } */ + U64_C(0xd9fe8c00) << 32, 0 /* y = 2 */ + }, + { /* my_p[2] = { floor(x^64 / P(x)), P(x) } */ + U64_C(0x1f845fe24), U64_C(0x1864cfb00) + } +}; + + +static ASM_FUNC_ATTR_INLINE vector2x_u64 +asm_vpmsumd(vector2x_u64 a, vector2x_u64 b) +{ + __asm__("vpmsumd %0, %1, %2" + : "=v" (a) + : "v" (a), "v" (b)); + return a; +} + + +static ASM_FUNC_ATTR_INLINE vector2x_u64 +asm_swap_u64(vector2x_u64 a) +{ + __asm__("xxswapd %x0, %x1" + : "=wa" (a) + : "wa" (a)); + return a; +} + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_sld_u32(vector4x_u32 a, vector4x_u32 b, unsigned int idx) +{ + return vec_sld (a, b, (4 * idx) & 15); +} + + +static const byte crc32_partial_fold_input_mask[16 + 16] ALIGNED_64 = + { + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, + 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, + 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, + }; +static const byte crc32_shuf_shift[3 * 16] ALIGNED_64 = + { + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x0f, 0x0e, 0x0d, 0x0c, 0x0b, 0x0a, 0x09, 0x08, + 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + }; +static const byte crc32_refl_shuf_shift[3 * 16] ALIGNED_64 = + { + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, + 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, 0x1f, + }; +static const vector16x_u8 bswap_const ALIGNED_64 = + { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; + + +#define CRC_VEC_SWAP(v) ({ vector2x_u64 __vecu64 = (v); \ + vec_perm(__vecu64, __vecu64, bswap_const); }) + +#ifdef WORDS_BIGENDIAN +# define CRC_VEC_U64_DEF(lo, hi) { (hi), (lo) } +# define CRC_VEC_U64_LOAD(offs, ptr) \ + asm_swap_u64(vec_vsx_ld((offs), (const unsigned long long *)(ptr))) +# define CRC_VEC_U64_LOAD_LE(offs, ptr) \ + CRC_VEC_SWAP(vec_vsx_ld((offs), (const unsigned long long *)(ptr))) +# define CRC_VEC_U64_LOAD_BE(offs, ptr) \ + vec_vsx_ld((offs), (const unsigned long long *)(ptr)) +# define CRC_VEC_SWAP_TO_LE(v) CRC_VEC_SWAP(v) +# define CRC_VEC_SWAP_TO_BE(v) (v) +# define VEC_U64_LO 1 +# define VEC_U64_HI 0 +#else +# define CRC_VEC_U64_DEF(lo, hi) { (lo), (hi) } +# define CRC_VEC_U64_LOAD(offs, ptr) \ + vec_vsx_ld((offs), (const unsigned long long *)(ptr)) +# define CRC_VEC_U64_LOAD_LE(offs, ptr) CRC_VEC_U64_LOAD((offs), (ptr)) +# define CRC_VEC_U64_LOAD_BE(offs, ptr) \ + CRC_VEC_SWAP(CRC_VEC_U64_LOAD((offs), (ptr))) +# define CRC_VEC_SWAP_TO_LE(v) (v) +# define CRC_VEC_SWAP_TO_BE(v) CRC_VEC_SWAP(v) +# define VEC_U64_LO 0 +# define VEC_U64_HI 1 +#endif + + +static ASM_FUNC_ATTR_INLINE void +crc32r_ppc8_ce_bulk (u32 *pcrc, const byte *inbuf, size_t inlen, + const struct crc32_consts_s *consts) +{ + vector4x_u32 zero = { 0, 0, 0, 0 }; + vector2x_u64 low_64bit_mask = CRC_VEC_U64_DEF((u64)-1, 0); + vector2x_u64 low_32bit_mask = CRC_VEC_U64_DEF((u32)-1, 0); + vector2x_u64 my_p = CRC_VEC_U64_LOAD(0, &consts->my_p[0]); + vector2x_u64 k1k2 = CRC_VEC_U64_LOAD(0, &consts->k[1 - 1]); + vector2x_u64 k3k4 = CRC_VEC_U64_LOAD(0, &consts->k[3 - 1]); + vector2x_u64 k4lo = CRC_VEC_U64_DEF(k3k4[VEC_U64_HI], 0); + vector2x_u64 k5lo = CRC_VEC_U64_LOAD(0, &consts->k[5 - 1]); + vector2x_u64 crc = CRC_VEC_U64_DEF(*pcrc, 0); + vector2x_u64 crc0, crc1, crc2, crc3; + vector2x_u64 v0; + + if (inlen >= 8 * 16) + { + crc0 = CRC_VEC_U64_LOAD_LE(0 * 16, inbuf); + crc0 ^= crc; + crc1 = CRC_VEC_U64_LOAD_LE(1 * 16, inbuf); + crc2 = CRC_VEC_U64_LOAD_LE(2 * 16, inbuf); + crc3 = CRC_VEC_U64_LOAD_LE(3 * 16, inbuf); + + inbuf += 4 * 16; + inlen -= 4 * 16; + + /* Fold by 4. */ + while (inlen >= 4 * 16) + { + v0 = CRC_VEC_U64_LOAD_LE(0 * 16, inbuf); + crc0 = asm_vpmsumd(crc0, k1k2) ^ v0; + + v0 = CRC_VEC_U64_LOAD_LE(1 * 16, inbuf); + crc1 = asm_vpmsumd(crc1, k1k2) ^ v0; + + v0 = CRC_VEC_U64_LOAD_LE(2 * 16, inbuf); + crc2 = asm_vpmsumd(crc2, k1k2) ^ v0; + + v0 = CRC_VEC_U64_LOAD_LE(3 * 16, inbuf); + crc3 = asm_vpmsumd(crc3, k1k2) ^ v0; + + inbuf += 4 * 16; + inlen -= 4 * 16; + } + + /* Fold 4 to 1. */ + crc1 ^= asm_vpmsumd(crc0, k3k4); + crc2 ^= asm_vpmsumd(crc1, k3k4); + crc3 ^= asm_vpmsumd(crc2, k3k4); + crc = crc3; + } + else + { + v0 = CRC_VEC_U64_LOAD_LE(0, inbuf); + crc ^= v0; + + inbuf += 16; + inlen -= 16; + } + + /* Fold by 1. */ + while (inlen >= 16) + { + v0 = CRC_VEC_U64_LOAD_LE(0, inbuf); + crc = asm_vpmsumd(k3k4, crc); + crc ^= v0; + + inbuf += 16; + inlen -= 16; + } + + /* Partial fold. */ + if (inlen) + { + /* Load last input and add padding zeros. */ + vector2x_u64 mask = CRC_VEC_U64_LOAD_LE(inlen, crc32_partial_fold_input_mask); + vector2x_u64 shl_shuf = CRC_VEC_U64_LOAD_LE(inlen, crc32_refl_shuf_shift); + vector2x_u64 shr_shuf = CRC_VEC_U64_LOAD_LE(inlen + 16, crc32_refl_shuf_shift); + + v0 = CRC_VEC_U64_LOAD_LE(inlen - 16, inbuf); + v0 &= mask; + + crc = CRC_VEC_SWAP_TO_LE(crc); + v0 |= (vector2x_u64)vec_perm((vector16x_u8)crc, (vector16x_u8)zero, + (vector16x_u8)shr_shuf); + crc = (vector2x_u64)vec_perm((vector16x_u8)crc, (vector16x_u8)zero, + (vector16x_u8)shl_shuf); + crc = asm_vpmsumd(k3k4, crc); + crc ^= v0; + + inbuf += inlen; + inlen -= inlen; + } + + /* Final fold. */ + + /* reduce 128-bits to 96-bits */ + v0 = asm_swap_u64(crc); + v0 &= low_64bit_mask; + crc = asm_vpmsumd(k4lo, crc); + crc ^= v0; + + /* reduce 96-bits to 64-bits */ + v0 = (vector2x_u64)vec_sld_u32((vector4x_u32)crc, + (vector4x_u32)crc, 3); /* [x0][x3][x2][x1] */ + v0 &= low_64bit_mask; /* [00][00][x2][x1] */ + crc = crc & low_32bit_mask; /* [00][00][00][x0] */ + crc = v0 ^ asm_vpmsumd(k5lo, crc); /* [00][00][xx][xx] */ + + /* barrett reduction */ + v0 = crc << 32; /* [00][00][x0][00] */ + v0 = asm_vpmsumd(my_p, v0); + v0 = asm_swap_u64(v0); + v0 = asm_vpmsumd(my_p, v0); + crc = (vector2x_u64)vec_sld_u32((vector4x_u32)crc, + zero, 1); /* [00][x1][x0][00] */ + crc ^= v0; + + *pcrc = (u32)crc[VEC_U64_HI]; +} + + +static ASM_FUNC_ATTR_INLINE u32 +crc32r_ppc8_ce_reduction_4 (u32 data, u32 crc, + const struct crc32_consts_s *consts) +{ + vector4x_u32 zero = { 0, 0, 0, 0 }; + vector2x_u64 my_p = CRC_VEC_U64_LOAD(0, &consts->my_p[0]); + vector2x_u64 v0 = CRC_VEC_U64_DEF((u64)data, 0); + v0 = asm_vpmsumd(v0, my_p); /* [00][00][xx][xx] */ + v0 = (vector2x_u64)vec_sld_u32((vector4x_u32)v0, + zero, 3); /* [x0][00][00][00] */ + v0 = (vector2x_u64)vec_sld_u32((vector4x_u32)v0, + (vector4x_u32)v0, 3); /* [00][x0][00][00] */ + v0 = asm_vpmsumd(v0, my_p); /* [00][00][xx][xx] */ + return (v0[VEC_U64_LO] >> 32) ^ crc; +} + + +static ASM_FUNC_ATTR_INLINE void +crc32r_less_than_16 (u32 *pcrc, const byte *inbuf, size_t inlen, + const struct crc32_consts_s *consts) +{ + u32 crc = *pcrc; + u32 data; + + while (inlen >= 4) + { + data = buf_get_le32(inbuf); + data ^= crc; + + inlen -= 4; + inbuf += 4; + + crc = crc32r_ppc8_ce_reduction_4 (data, 0, consts); + } + + switch (inlen) + { + case 0: + break; + case 1: + data = inbuf[0]; + data ^= crc; + data <<= 24; + crc >>= 8; + crc = crc32r_ppc8_ce_reduction_4 (data, crc, consts); + break; + case 2: + data = inbuf[0] << 0; + data |= inbuf[1] << 8; + data ^= crc; + data <<= 16; + crc >>= 16; + crc = crc32r_ppc8_ce_reduction_4 (data, crc, consts); + break; + case 3: + data = inbuf[0] << 0; + data |= inbuf[1] << 8; + data |= inbuf[2] << 16; + data ^= crc; + data <<= 8; + crc >>= 24; + crc = crc32r_ppc8_ce_reduction_4 (data, crc, consts); + break; + } + + *pcrc = crc; +} + + +static ASM_FUNC_ATTR_INLINE void +crc32_ppc8_ce_bulk (u32 *pcrc, const byte *inbuf, size_t inlen, + const struct crc32_consts_s *consts) +{ + vector4x_u32 zero = { 0, 0, 0, 0 }; + vector2x_u64 low_96bit_mask = CRC_VEC_U64_DEF(~0, ~((u64)(u32)-1 << 32)); + vector2x_u64 p_my = asm_swap_u64(CRC_VEC_U64_LOAD(0, &consts->my_p[0])); + vector2x_u64 p_my_lo, p_my_hi; + vector2x_u64 k2k1 = asm_swap_u64(CRC_VEC_U64_LOAD(0, &consts->k[1 - 1])); + vector2x_u64 k4k3 = asm_swap_u64(CRC_VEC_U64_LOAD(0, &consts->k[3 - 1])); + vector2x_u64 k4hi = CRC_VEC_U64_DEF(0, consts->k[4 - 1]); + vector2x_u64 k5hi = CRC_VEC_U64_DEF(0, consts->k[5 - 1]); + vector2x_u64 crc = CRC_VEC_U64_DEF(0, _gcry_bswap64(*pcrc)); + vector2x_u64 crc0, crc1, crc2, crc3; + vector2x_u64 v0; + + if (inlen >= 8 * 16) + { + crc0 = CRC_VEC_U64_LOAD_BE(0 * 16, inbuf); + crc0 ^= crc; + crc1 = CRC_VEC_U64_LOAD_BE(1 * 16, inbuf); + crc2 = CRC_VEC_U64_LOAD_BE(2 * 16, inbuf); + crc3 = CRC_VEC_U64_LOAD_BE(3 * 16, inbuf); + + inbuf += 4 * 16; + inlen -= 4 * 16; + + /* Fold by 4. */ + while (inlen >= 4 * 16) + { + v0 = CRC_VEC_U64_LOAD_BE(0 * 16, inbuf); + crc0 = asm_vpmsumd(crc0, k2k1) ^ v0; + + v0 = CRC_VEC_U64_LOAD_BE(1 * 16, inbuf); + crc1 = asm_vpmsumd(crc1, k2k1) ^ v0; + + v0 = CRC_VEC_U64_LOAD_BE(2 * 16, inbuf); + crc2 = asm_vpmsumd(crc2, k2k1) ^ v0; + + v0 = CRC_VEC_U64_LOAD_BE(3 * 16, inbuf); + crc3 = asm_vpmsumd(crc3, k2k1) ^ v0; + + inbuf += 4 * 16; + inlen -= 4 * 16; + } + + /* Fold 4 to 1. */ + crc1 ^= asm_vpmsumd(crc0, k4k3); + crc2 ^= asm_vpmsumd(crc1, k4k3); + crc3 ^= asm_vpmsumd(crc2, k4k3); + crc = crc3; + } + else + { + v0 = CRC_VEC_U64_LOAD_BE(0, inbuf); + crc ^= v0; + + inbuf += 16; + inlen -= 16; + } + + /* Fold by 1. */ + while (inlen >= 16) + { + v0 = CRC_VEC_U64_LOAD_BE(0, inbuf); + crc = asm_vpmsumd(k4k3, crc); + crc ^= v0; + + inbuf += 16; + inlen -= 16; + } + + /* Partial fold. */ + if (inlen) + { + /* Load last input and add padding zeros. */ + vector2x_u64 mask = CRC_VEC_U64_LOAD_LE(inlen, crc32_partial_fold_input_mask); + vector2x_u64 shl_shuf = CRC_VEC_U64_LOAD_LE(32 - inlen, crc32_refl_shuf_shift); + vector2x_u64 shr_shuf = CRC_VEC_U64_LOAD_LE(inlen + 16, crc32_shuf_shift); + + v0 = CRC_VEC_U64_LOAD_LE(inlen - 16, inbuf); + v0 &= mask; + + crc = CRC_VEC_SWAP_TO_LE(crc); + crc2 = (vector2x_u64)vec_perm((vector16x_u8)crc, (vector16x_u8)zero, + (vector16x_u8)shr_shuf); + v0 |= crc2; + v0 = CRC_VEC_SWAP(v0); + crc = (vector2x_u64)vec_perm((vector16x_u8)crc, (vector16x_u8)zero, + (vector16x_u8)shl_shuf); + crc = asm_vpmsumd(k4k3, crc); + crc ^= v0; + + inbuf += inlen; + inlen -= inlen; + } + + /* Final fold. */ + + /* reduce 128-bits to 96-bits */ + v0 = (vector2x_u64)vec_sld_u32((vector4x_u32)crc, + (vector4x_u32)zero, 2); + crc = asm_vpmsumd(k4hi, crc); + crc ^= v0; /* bottom 32-bit are zero */ + + /* reduce 96-bits to 64-bits */ + v0 = crc & low_96bit_mask; /* [00][x2][x1][00] */ + crc >>= 32; /* [00][x3][00][x0] */ + crc = asm_vpmsumd(k5hi, crc); /* [00][xx][xx][00] */ + crc ^= v0; /* top and bottom 32-bit are zero */ + + /* barrett reduction */ + p_my_hi = p_my; + p_my_lo = p_my; + p_my_hi[VEC_U64_LO] = 0; + p_my_lo[VEC_U64_HI] = 0; + v0 = crc >> 32; /* [00][00][00][x1] */ + crc = asm_vpmsumd(p_my_hi, crc); /* [00][xx][xx][xx] */ + crc = (vector2x_u64)vec_sld_u32((vector4x_u32)crc, + (vector4x_u32)crc, 3); /* [x0][00][x2][x1] */ + crc = asm_vpmsumd(p_my_lo, crc); /* [00][xx][xx][xx] */ + crc ^= v0; + + *pcrc = _gcry_bswap32(crc[VEC_U64_LO]); +} + + +static ASM_FUNC_ATTR_INLINE u32 +crc32_ppc8_ce_reduction_4 (u32 data, u32 crc, + const struct crc32_consts_s *consts) +{ + vector2x_u64 my_p = CRC_VEC_U64_LOAD(0, &consts->my_p[0]); + vector2x_u64 v0 = CRC_VEC_U64_DEF((u64)data << 32, 0); + v0 = asm_vpmsumd(v0, my_p); /* [00][x1][x0][00] */ + v0[VEC_U64_LO] = 0; /* [00][x1][00][00] */ + v0 = asm_vpmsumd(v0, my_p); /* [00][00][xx][xx] */ + return _gcry_bswap32(v0[VEC_U64_LO]) ^ crc; +} + + +static ASM_FUNC_ATTR_INLINE void +crc32_less_than_16 (u32 *pcrc, const byte *inbuf, size_t inlen, + const struct crc32_consts_s *consts) +{ + u32 crc = *pcrc; + u32 data; + + while (inlen >= 4) + { + data = buf_get_le32(inbuf); + data ^= crc; + data = _gcry_bswap32(data); + + inlen -= 4; + inbuf += 4; + + crc = crc32_ppc8_ce_reduction_4 (data, 0, consts); + } + + switch (inlen) + { + case 0: + break; + case 1: + data = inbuf[0]; + data ^= crc; + data = data & 0xffU; + crc = crc >> 8; + crc = crc32_ppc8_ce_reduction_4 (data, crc, consts); + break; + case 2: + data = inbuf[0] << 0; + data |= inbuf[1] << 8; + data ^= crc; + data = _gcry_bswap32(data << 16); + crc = crc >> 16; + crc = crc32_ppc8_ce_reduction_4 (data, crc, consts); + break; + case 3: + data = inbuf[0] << 0; + data |= inbuf[1] << 8; + data |= inbuf[2] << 16; + data ^= crc; + data = _gcry_bswap32(data << 8); + crc = crc >> 24; + crc = crc32_ppc8_ce_reduction_4 (data, crc, consts); + break; + } + + *pcrc = crc; +} + +void ASM_FUNC_ATTR +_gcry_crc32_ppc8_vpmsum (u32 *pcrc, const byte *inbuf, size_t inlen) +{ + const struct crc32_consts_s *consts = &crc32_consts; + + if (!inlen) + return; + + if (inlen >= 16) + crc32r_ppc8_ce_bulk (pcrc, inbuf, inlen, consts); + else + crc32r_less_than_16 (pcrc, inbuf, inlen, consts); +} + +void ASM_FUNC_ATTR +_gcry_crc24rfc2440_ppc8_vpmsum (u32 *pcrc, const byte *inbuf, size_t inlen) +{ + const struct crc32_consts_s *consts = &crc24rfc2440_consts; + + if (!inlen) + return; + + /* Note: *pcrc in input endian. */ + + if (inlen >= 16) + crc32_ppc8_ce_bulk (pcrc, inbuf, inlen, consts); + else + crc32_less_than_16 (pcrc, inbuf, inlen, consts); +} + +#endif diff --git a/cipher/crc.c b/cipher/crc.c index 2abbab288..6d70f644f 100644 --- a/cipher/crc.c +++ b/cipher/crc.c @@ -52,6 +52,19 @@ # endif #endif /* USE_ARM_PMULL */ +/* USE_PPC_VPMSUM indicates whether to enable PowerPC vector + * accelerated code. */ +#undef USE_PPC_VPMSUM +#ifdef ENABLE_PPC_CRYPTO_SUPPORT +# if defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) +# if __GNUC__ >= 4 +# define USE_PPC_VPMSUM 1 +# endif +# endif +#endif /* USE_PPC_VPMSUM */ + + typedef struct { u32 CRC; @@ -60,6 +73,9 @@ typedef struct #endif #ifdef USE_ARM_PMULL unsigned int use_pmull:1; /* ARMv8 PMULL shall be used. */ +#endif +#ifdef USE_PPC_VPMSUM + unsigned int use_vpmsum:1; /* POWER vpmsum shall be used. */ #endif byte buf[4]; } @@ -80,6 +96,13 @@ void _gcry_crc24rfc2440_armv8_ce_pmull (u32 *pcrc, const byte *inbuf, size_t inlen); #endif +#ifdef USE_PPC_VPMSUM +/*-- crc-ppc.c --*/ +void _gcry_crc32_ppc8_vpmsum (u32 *pcrc, const byte *inbuf, size_t inlen); +void _gcry_crc24rfc2440_ppc8_vpmsum (u32 *pcrc, const byte *inbuf, + size_t inlen); +#endif + /* * Code generated by universal_crc by Danjel McGougan @@ -388,6 +411,9 @@ crc32_init (void *context, unsigned int flags) #ifdef USE_ARM_PMULL ctx->use_pmull = (hwf & HWF_ARM_NEON) && (hwf & HWF_ARM_PMULL); #endif +#ifdef USE_PPC_VPMSUM + ctx->use_vpmsum = !!(hwf & HWF_PPC_ARCH_2_07); +#endif (void)flags; (void)hwf; @@ -416,6 +442,13 @@ crc32_write (void *context, const void *inbuf_arg, size_t inlen) return; } #endif +#ifdef USE_PPC_VPMSUM + if (ctx->use_vpmsum) + { + _gcry_crc32_ppc8_vpmsum(&ctx->CRC, inbuf, inlen); + return; + } +#endif if (!inbuf || !inlen) return; @@ -477,6 +510,9 @@ crc32rfc1510_init (void *context, unsigned int flags) #ifdef USE_ARM_PMULL ctx->use_pmull = (hwf & HWF_ARM_NEON) && (hwf & HWF_ARM_PMULL); #endif +#ifdef USE_PPC_VPMSUM + ctx->use_vpmsum = !!(hwf & HWF_PPC_ARCH_2_07); +#endif (void)flags; (void)hwf; @@ -811,6 +847,9 @@ crc24rfc2440_init (void *context, unsigned int flags) #ifdef USE_ARM_PMULL ctx->use_pmull = (hwf & HWF_ARM_NEON) && (hwf & HWF_ARM_PMULL); #endif +#ifdef USE_PPC_VPMSUM + ctx->use_vpmsum = !!(hwf & HWF_PPC_ARCH_2_07); +#endif (void)hwf; (void)flags; @@ -839,6 +878,13 @@ crc24rfc2440_write (void *context, const void *inbuf_arg, size_t inlen) return; } #endif +#ifdef USE_PPC_VPMSUM + if (ctx->use_vpmsum) + { + _gcry_crc24rfc2440_ppc8_vpmsum(&ctx->CRC, inbuf, inlen); + return; + } +#endif if (!inbuf || !inlen) return; diff --git a/configure.ac b/configure.ac index 6333076b7..75ec82160 100644 --- a/configure.ac +++ b/configure.ac @@ -1908,6 +1908,7 @@ AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto "vadduwm %v0, %v1, %v22;\n" "vshasigmaw %v0, %v1, 0, 15;\n" "vshasigmad %v0, %v1, 0, 15;\n" + "vpmsumd %v11, %v11, %v11;\n" ); ]])], [gcry_cv_gcc_inline_asm_ppc_altivec=yes]) @@ -2564,6 +2565,15 @@ if test "$found" = "1" ; then GCRYPT_CIPHERS="$GCRYPT_CIPHERS crc-armv8-ce.lo" GCRYPT_CIPHERS="$GCRYPT_CIPHERS crc-armv8-aarch64-ce.lo" ;; + powerpc64le-*-*) + GCRYPT_CIPHERS="$GCRYPT_CIPHERS crc-ppc.lo" + ;; + powerpc64-*-*) + GCRYPT_CIPHERS="$GCRYPT_CIPHERS crc-ppc.lo" + ;; + powerpc-*-*) + GCRYPT_CIPHERS="$GCRYPT_CIPHERS crc-ppc.lo" + ;; esac fi From stefbon at gmail.com Mon Sep 16 10:34:55 2019 From: stefbon at gmail.com (Stef Bon) Date: Mon, 16 Sep 2019 10:34:55 +0200 Subject: ECC shared secret In-Reply-To: References: Message-ID: Hi, yes these methods are hard to find. Does: https://tools.ietf.org/html/rfc5656 offer something? S. From jussi.kivilinna at iki.fi Mon Sep 16 17:59:33 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 16 Sep 2019 18:59:33 +0300 Subject: [PATCH] Add PowerPC extra CFLAGS also for chacha20-ppc and crc-ppc Message-ID: <156864957330.30010.3694749927884618165.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'ppc_vcrypto_cflags' for chacha20-ppc.o/.lo and crc-ppc.o/.lo. -- Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index b283d2400..bf13c199a 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -225,3 +225,15 @@ sha512-ppc.o: $(srcdir)/sha512-ppc.c Makefile sha512-ppc.lo: $(srcdir)/sha512-ppc.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +chacha20-ppc.o: $(srcdir)/chacha20-ppc.c Makefile + `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +chacha20-ppc.lo: $(srcdir)/chacha20-ppc.c Makefile + `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +crc-ppc.o: $(srcdir)/crc-ppc.c Makefile + `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +crc-ppc.lo: $(srcdir)/crc-ppc.c Makefile + `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` From jussi.kivilinna at iki.fi Thu Sep 19 21:20:25 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 19 Sep 2019 22:20:25 +0300 Subject: [PATCH] Reduce size of x86-64 stitched Chacha20-Poly1305 implementations Message-ID: <156892082514.7371.15125050938051411098.stgit@localhost.localdomain> * cipher/chacha20-amd64-avx2.c (_gcry_chacha20_poly1305_amd64_avx2_blocks8): De-unroll round loop. * cipher/chacha20-amd64-ssse3.c (_gcry_chacha20_poly1305_amd64_ssse3_blocks4): (_gcry_chacha20_poly1305_amd64_ssse3_blocks1): Ditto. -- Object size before: text data bss dec hex filename 13428 0 0 13428 3474 cipher/.libs/chacha20-amd64-avx2.o 23175 0 0 23175 5a87 cipher/.libs/chacha20-amd64-ssse3.o Object size after: text data bss dec hex filename 4815 0 0 4815 12cf cipher/.libs/chacha20-amd64-avx2.o 9284 0 0 9284 2444 cipher/.libs/chacha20-amd64-ssse3.o Benchmark on AMD Ryzen 3700X (AVX2 impl.): Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.267 ns/B 3575 MiB/s 1.15 c/B 4318 STREAM dec | 0.266 ns/B 3586 MiB/s 1.15 c/B 4329 POLY1305 enc | 0.315 ns/B 3024 MiB/s 1.36 c/B 4315?1 POLY1305 dec | 0.296 ns/B 3220 MiB/s 1.28 c/B 4310 POLY1305 auth | 0.223 ns/B 4270 MiB/s 0.968 c/B 4335 After: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.266 ns/B 3583 MiB/s 1.15 c/B 4327 STREAM dec | 0.265 ns/B 3603 MiB/s 1.16 c/B 4371?1 POLY1305 enc | 0.293 ns/B 3251 MiB/s 1.27 c/B 4315 POLY1305 dec | 0.279 ns/B 3418 MiB/s 1.19 c/B 4282?3 POLY1305 auth | 0.225 ns/B 4241 MiB/s 0.978 c/B 4351 Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/chacha20-amd64-avx2.S b/cipher/chacha20-amd64-avx2.S index de6263b69..053638d02 100644 --- a/cipher/chacha20-amd64-avx2.S +++ b/cipher/chacha20-amd64-avx2.S @@ -331,6 +331,8 @@ ELF(.size _gcry_chacha20_amd64_avx2_blocks8, 8-way stitched chacha20-poly1305 **********************************************************************/ +#define _ /*_*/ + .align 8 .globl _gcry_chacha20_poly1305_amd64_avx2_blocks8 ELF(.type _gcry_chacha20_poly1305_amd64_avx2_blocks8, at function;) @@ -353,7 +355,7 @@ _gcry_chacha20_poly1305_amd64_avx2_blocks8: vzeroupper; - subq $(8 * 8) + STACK_MAX + 32, %rsp; + subq $(9 * 8) + STACK_MAX + 32, %rsp; andq $~31, %rsp; movq %rbx, (STACK_MAX + 0 * 8)(%rsp); @@ -406,33 +408,14 @@ _gcry_chacha20_poly1305_amd64_avx2_blocks8: vpbroadcastd (15 * 4)(INPUT), X15; vmovdqa X15, (STACK_TMP)(%rsp); - # rounds 0,1 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART1(0 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(1 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(2 * 16), - POLY1305_BLOCK_PART2()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(3 * 16)) + /* Process eight ChaCha20 blocks and 32 Poly1305 blocks. */ - # rounds 2,3 + movl $20, (STACK_MAX + 8 * 8 + 4)(%rsp); +.Lround8_with_poly1305_outer: + movl $8, (STACK_MAX + 8 * 8)(%rsp); +.Lround8_with_poly1305_inner: + /* rounds 0-7 & 10-17 */ + POLY1305_BLOCK_PART1(0 * 16) QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3(), @@ -440,231 +423,59 @@ _gcry_chacha20_poly1305_amd64_avx2_blocks8: POLY1305_BLOCK_PART5()) vmovdqa (STACK_TMP)(%rsp), X15; vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART1(4 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(5 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(6 * 16), - POLY1305_BLOCK_PART2()) - - # rounds 4,5 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(7 * 16)) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); + POLY1305_BLOCK_PART1(1 * 16) QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3(), POLY1305_BLOCK_PART4(), POLY1305_BLOCK_PART5()) + POLY1305_BLOCK_PART1(2 * 16) QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART1(8 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(9 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - - # rounds 6,7 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(10 * 16), - POLY1305_BLOCK_PART2()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(11 * 16)) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART1(12 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - - # rounds 8,9 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(13 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(14 * 16), - POLY1305_BLOCK_PART2()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(15 * 16)) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3(), POLY1305_BLOCK_PART4(), POLY1305_BLOCK_PART5()) - - # rounds 10,11 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART1(16 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(17 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(18 * 16), - POLY1305_BLOCK_PART2()) vmovdqa (STACK_TMP)(%rsp), X8; vmovdqa X15, (STACK_TMP)(%rsp); + POLY1305_BLOCK_PART1(3 * 16) + lea (4 * 16)(POLY_RSRC), POLY_RSRC; QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(19 * 16)) - - # rounds 12,13 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3(), POLY1305_BLOCK_PART4(), POLY1305_BLOCK_PART5()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART1(20 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(21 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(22 * 16), - POLY1305_BLOCK_PART2()) - # rounds 14,15 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(23 * 16)) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART1(24 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(25 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) + subl $2, (STACK_MAX + 8 * 8)(%rsp); + jnz .Lround8_with_poly1305_inner; - # rounds 16,17 + /* rounds 8-9 & 18-19 */ QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(26 * 16), - POLY1305_BLOCK_PART2()) + _, + _, + _, + _) vmovdqa (STACK_TMP)(%rsp), X15; vmovdqa X8, (STACK_TMP)(%rsp); QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(27 * 16)) + _, + _, + _, + _) QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) + _, + _, + _, + _) vmovdqa (STACK_TMP)(%rsp), X8; vmovdqa X15, (STACK_TMP)(%rsp); QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART1(28 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) + _, + _, + _, + _) - # rounds 18,19 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(29 * 16), - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - vmovdqa (STACK_TMP)(%rsp), X15; - vmovdqa X8, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(30 * 16), - POLY1305_BLOCK_PART2()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(31 * 16)) - vmovdqa (STACK_TMP)(%rsp), X8; - vmovdqa X15, (STACK_TMP)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X15, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) + subl $10, (STACK_MAX + 8 * 8 + 4)(%rsp); + jnz .Lround8_with_poly1305_outer; movq (STACK_MAX + 5 * 8)(%rsp), SRC; movq (STACK_MAX + 6 * 8)(%rsp), DST; @@ -741,7 +552,6 @@ _gcry_chacha20_poly1305_amd64_avx2_blocks8: subq $8, (STACK_MAX + 7 * 8)(%rsp); # NBLKS - lea (32 * 16)(POLY_RSRC), POLY_RSRC; lea (8 * 64)(DST), DST; lea (8 * 64)(SRC), SRC; movq SRC, (STACK_MAX + 5 * 8)(%rsp); diff --git a/cipher/chacha20-amd64-ssse3.S b/cipher/chacha20-amd64-ssse3.S index 6bbf12fc1..77a27d349 100644 --- a/cipher/chacha20-amd64-ssse3.S +++ b/cipher/chacha20-amd64-ssse3.S @@ -511,6 +511,8 @@ ELF(.size _gcry_chacha20_amd64_ssse3_blocks1, 4-way stitched chacha20-poly1305 **********************************************************************/ +#define _ /*_*/ + .align 8 .globl _gcry_chacha20_poly1305_amd64_ssse3_blocks4 ELF(.type _gcry_chacha20_poly1305_amd64_ssse3_blocks4, at function;) @@ -531,7 +533,7 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks4: movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - subq $(8 * 8) + STACK_MAX + 16, %rsp; + subq $(9 * 8) + STACK_MAX + 16, %rsp; andq $~15, %rsp; movq %rbx, (STACK_MAX + 0 * 8)(%rsp); @@ -586,51 +588,14 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks4: movdqa X11, (STACK_TMP)(%rsp); movdqa X15, (STACK_TMP1)(%rsp); - /* rounds 0,1 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART1(0 * 16), - POLY1305_BLOCK_PART2()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(1 * 16)) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - - /* rounds 2,3 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART1(2 * 16), - POLY1305_BLOCK_PART2()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(3 * 16)) + /* Process four ChaCha20 blocks and sixteen Poly1305 blocks. */ - /* rounds 4,5 */ + movl $20, (STACK_MAX + 8 * 8 + 4)(%rsp); +.Lround4_with_poly1305_outer: + movl $8, (STACK_MAX + 8 * 8)(%rsp); +.Lround4_with_poly1305_inner: + /* rounds 0-7 & 10-17 */ + POLY1305_BLOCK_PART1(0 * 16) QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3()) @@ -641,50 +606,8 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks4: QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, POLY1305_BLOCK_PART4(), POLY1305_BLOCK_PART5()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART1(4 * 16), - POLY1305_BLOCK_PART2()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - - /* rounds 6,7 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(5 * 16)) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART1(6 * 16), - POLY1305_BLOCK_PART2()) - - /* rounds 8,9 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(7 * 16)) + POLY1305_BLOCK_PART1(1 * 16) + lea (2 * 16)(POLY_RSRC), POLY_RSRC; QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, POLY1305_BLOCK_PART2(), POLY1305_BLOCK_PART3()) @@ -696,115 +619,33 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks4: POLY1305_BLOCK_PART4(), POLY1305_BLOCK_PART5()) - /* rounds 10,11 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART1(8 * 16), - POLY1305_BLOCK_PART2()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(9 * 16)) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - - /* rounds 12,13 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART1(10 * 16), - POLY1305_BLOCK_PART2()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(11 * 16)) - - /* rounds 14,15 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART1(12 * 16), - POLY1305_BLOCK_PART2()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) + subl $2, (STACK_MAX + 8 * 8)(%rsp); + jnz .Lround4_with_poly1305_inner; - /* rounds 16,17 */ + /* rounds 8-9 & 18-19 */ QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(13 * 16)) + _, + _) movdqa (STACK_TMP)(%rsp), X11; movdqa (STACK_TMP1)(%rsp), X15; movdqa X8, (STACK_TMP)(%rsp); movdqa X9, (STACK_TMP1)(%rsp); QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) + _, + _) QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) + _, + _) movdqa (STACK_TMP)(%rsp), X8; movdqa (STACK_TMP1)(%rsp), X9; movdqa X11, (STACK_TMP)(%rsp); movdqa X15, (STACK_TMP1)(%rsp); QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART1(14 * 16), - POLY1305_BLOCK_PART2()) + _, + _) - /* rounds 18,19 */ - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,X11,X15, - POLY1305_BLOCK_PART3(), - POLY1305_BLOCK_PART4()) - movdqa (STACK_TMP)(%rsp), X11; - movdqa (STACK_TMP1)(%rsp), X15; - movdqa X8, (STACK_TMP)(%rsp); - movdqa X9, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,X8,X9, - POLY1305_BLOCK_PART5(), - POLY1305_BLOCK_PART1(15 * 16)) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,X8,X9, - POLY1305_BLOCK_PART2(), - POLY1305_BLOCK_PART3()) - movdqa (STACK_TMP)(%rsp), X8; - movdqa (STACK_TMP1)(%rsp), X9; - movdqa X11, (STACK_TMP)(%rsp); - movdqa X15, (STACK_TMP1)(%rsp); - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,X11,X15, - POLY1305_BLOCK_PART4(), - POLY1305_BLOCK_PART5()) + subl $10, (STACK_MAX + 8 * 8 + 4)(%rsp); + jnz .Lround4_with_poly1305_outer; /* tmp := X15 */ movdqa (STACK_TMP)(%rsp), X11; @@ -877,7 +718,6 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks4: subq $4, (STACK_MAX + 7 * 8)(%rsp); # NBLKS - lea (16 * 16)(POLY_RSRC), POLY_RSRC; lea (4 * 64)(DST), DST; lea (4 * 64)(SRC), SRC; movq SRC, (STACK_MAX + 5 * 8)(%rsp); @@ -954,7 +794,7 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks1: movq %rsp, %rbp; CFI_DEF_CFA_REGISTER(%rbp); - subq $(8 * 8), %rsp; + subq $(9 * 8), %rsp; movq %rbx, (0 * 8)(%rsp); movq %r12, (1 * 8)(%rsp); movq %r13, (2 * 8)(%rsp); @@ -999,95 +839,31 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks1: /* Process two ChaCha20 blocks and eight Poly1305 blocks. */ + movl $20, (8 * 8 + 4)(%rsp); +.Lround2_with_poly1305_outer: + movl $8, (8 * 8)(%rsp); +.Lround2_with_poly1305_inner: POLY1305_BLOCK_PART1(0 * 16); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); + lea (1 * 16)(POLY_RSRC), POLY_RSRC; POLY1305_BLOCK_PART2(); QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); POLY1305_BLOCK_PART3(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); POLY1305_BLOCK_PART4(); QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(1 * 16); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(2 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART1(3 * 16); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART1(4 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(5 * 16); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(6 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); + subl $2, (8 * 8)(%rsp); + jnz .Lround2_with_poly1305_inner; - POLY1305_BLOCK_PART3(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART4(); QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART5(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART1(7 * 16); QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X8, X9, X14, X15, X5, X6, X7, 0x93, 0x4e, 0x39); + subl $10, (8 * 8 + 4)(%rsp); + jnz .Lround2_with_poly1305_outer; movq (5 * 8)(%rsp), SRC; movq (6 * 8)(%rsp), DST; @@ -1123,7 +899,6 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks1: clear(X15); subq $2, (7 * 8)(%rsp); # NBLKS - lea (2 * 64)(POLY_RSRC), POLY_RSRC; lea (2 * 64)(SRC), SRC; lea (2 * 64)(DST), DST; movq SRC, (5 * 8)(%rsp); @@ -1137,55 +912,31 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks1: movdqa X13, X3; /* Process one ChaCha20 block and four Poly1305 blocks. */ + + movl $20, (8 * 8 + 4)(%rsp); +.Lround1_with_poly1305_outer: + movl $8, (8 * 8)(%rsp); +.Lround1_with_poly1305_inner: POLY1305_BLOCK_PART1(0 * 16); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); POLY1305_BLOCK_PART2(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); + lea (1 * 16)(POLY_RSRC), POLY_RSRC; POLY1305_BLOCK_PART3(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); POLY1305_BLOCK_PART4(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(1 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART1(2 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART2(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART3(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART1(3 * 16); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); + subl $4, (8 * 8)(%rsp); + jnz .Lround1_with_poly1305_inner; - POLY1305_BLOCK_PART2(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART3(); QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); - POLY1305_BLOCK_PART4(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x39, 0x4e, 0x93); - POLY1305_BLOCK_PART5(); - QUARTERROUND4(X0, X1, X2, X3, X5, X6, X7, 0x93, 0x4e, 0x39); + subl $10, (8 * 8 + 4)(%rsp); + jnz .Lround1_with_poly1305_outer; movq (5 * 8)(%rsp), SRC; movq (6 * 8)(%rsp), DST; @@ -1204,7 +955,6 @@ _gcry_chacha20_poly1305_amd64_ssse3_blocks1: xor_src_dst(DST, SRC, 12 * 4, X3, X7); subq $1, (7 * 8)(%rsp); # NBLKS - lea (64)(POLY_RSRC), POLY_RSRC; lea (64)(SRC), SRC; lea (64)(DST), DST; movq SRC, (5 * 8)(%rsp); From jussi.kivilinna at iki.fi Thu Sep 19 21:36:29 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Thu, 19 Sep 2019 22:36:29 +0300 Subject: [PATCH] Small tweak for PowerPC Chacha20-Poly1305 round loop Message-ID: <156892178962.28778.3980500179028590554.stgit@localhost.localdomain> * cipher/chacha20-ppc.c (_gcry_chacha20_poly1305_ppc8_block4): Use inner/outer round loop structure instead of two separate loops for stitched and non-stitched parts. -- Benchmark on POWER8 ~3.8Ghz: Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.619 ns/B 1541 MiB/s 2.35 c/B STREAM dec | 0.619 ns/B 1541 MiB/s 2.35 c/B POLY1305 enc | 0.784 ns/B 1216 MiB/s 2.98 c/B POLY1305 dec | 0.770 ns/B 1239 MiB/s 2.93 c/B POLY1305 auth | 0.502 ns/B 1898 MiB/s 1.91 c/B After (~2% faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte POLY1305 enc | 0.765 ns/B 1247 MiB/s 2.91 c/B POLY1305 dec | 0.749 ns/B 1273 MiB/s 2.85 c/B Benchmark on POWER9 ~3.8Ghz: Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.687 ns/B 1389 MiB/s 2.61 c/B STREAM dec | 0.692 ns/B 1379 MiB/s 2.63 c/B POLY1305 enc | 1.08 ns/B 880.9 MiB/s 4.11 c/B POLY1305 dec | 1.07 ns/B 888.0 MiB/s 4.08 c/B POLY1305 auth | 0.459 ns/B 2078 MiB/s 1.74 c/B After (~5% faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte POLY1305 enc | 1.03 ns/B 929.2 MiB/s 3.90 c/B POLY1305 dec | 1.02 ns/B 936.6 MiB/s 3.87 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/chacha20-ppc.c b/cipher/chacha20-ppc.c index 17e2f0902..985f2fcd6 100644 --- a/cipher/chacha20-ppc.c +++ b/cipher/chacha20-ppc.c @@ -469,7 +469,7 @@ _gcry_chacha20_poly1305_ppc8_blocks4(u32 *state, byte *dst, const byte *src, u64 m0, m1, m2; u64 x0_lo, x0_hi, x1_lo, x1_hi; u64 t0_lo, t0_hi, t1_lo, t1_hi; - int i; + unsigned int i, o; /* load poly1305 state */ m2 = 1; @@ -515,19 +515,21 @@ _gcry_chacha20_poly1305_ppc8_blocks4(u32 *state, byte *dst, const byte *src, v12 += counters_0123; v13 -= vec_cmplt(v12, counters_0123); - for (i = 0; i < 16; i += 2) - { - POLY1305_BLOCK_PART1((i + 0) * 16); - QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) - POLY1305_BLOCK_PART2(); - QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) - POLY1305_BLOCK_PART1((i + 1) * 16); - QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) - POLY1305_BLOCK_PART2(); - QUARTERROUND2(v2, v7, v8, v13, v3, v4, v9, v14) - } - for (; i < 20; i += 2) + for (o = 20; o; o -= 10) { + for (i = 8; i; i -= 2) + { + POLY1305_BLOCK_PART1(0 * 16); + QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) + POLY1305_BLOCK_PART2(); + QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) + POLY1305_BLOCK_PART1(1 * 16); + poly1305_src += 2 * 16; + QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) + POLY1305_BLOCK_PART2(); + QUARTERROUND2(v2, v7, v8, v13, v3, v4, v9, v14) + } + QUARTERROUND2(v0, v4, v8, v12, v1, v5, v9, v13) QUARTERROUND2(v2, v6, v10, v14, v3, v7, v11, v15) QUARTERROUND2(v0, v5, v10, v15, v1, v6, v11, v12) @@ -601,7 +603,6 @@ _gcry_chacha20_poly1305_ppc8_blocks4(u32 *state, byte *dst, const byte *src, src += 4*64; dst += 4*64; - poly1305_src += 16*16; nblks -= 4; } From jussi.kivilinna at iki.fi Tue Sep 24 23:24:53 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 25 Sep 2019 00:24:53 +0300 Subject: [PATCH] Add stitched ChaCha20-Poly1305 ARMv8/AArch64 implementation Message-ID: <156936029374.31744.1854933986359849463.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'asm-poly1305-aarch64.h'. * cipher/asm-poly1305-aarch64.h: New. * cipher/chacha20-aarch64.S (ROT8, _, ROTATE2_8): New. (ROTATE2): Add interleave operator. (QUARTERROUND2): Add interleave operators; Use ROTATE2_8. (chacha20_data): Rename to... (_gcry_chacha20_aarch64_blocks4_data_inc_counter): ...to this. (_gcry_chacha20_aarch64_blocks4_data_rot8): New. (_gcry_chacha20_aarch64_blocks4): Preload ROT8; Fill empty parameters for QUARTERROUND2 interleave operators. (_gcry_chacha20_poly1305_aarch64_blocks4): New. * cipher/chacha20.c [USE_AARCH64_SIMD] (_gcry_chacha20_poly1305_aarch64_blocks4): New. (_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt) [USE_AARCH64_SIMD]: Use stitched implementation if ctr->use_neon is set. -- Patch also make small tweak for regular ARMv8/AArch64 ChaCha20 implementation for 'rotate by 8' operation. Benchmark on Cortex-A53 @ 1104 Mhz: Before: CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 4.93 ns/B 193.5 MiB/s 5.44 c/B STREAM dec | 4.93 ns/B 193.6 MiB/s 5.44 c/B POLY1305 enc | 7.71 ns/B 123.7 MiB/s 8.51 c/B POLY1305 dec | 7.70 ns/B 123.8 MiB/s 8.50 c/B POLY1305 auth | 2.77 ns/B 343.7 MiB/s 3.06 c/B After (chacha20 ~6% faster, chacha20-poly1305 ~29% faster): CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 4.65 ns/B 205.2 MiB/s 5.13 c/B STREAM dec | 4.65 ns/B 205.1 MiB/s 5.13 c/B POLY1305 enc | 5.97 ns/B 159.7 MiB/s 6.59 c/B POLY1305 dec | 5.92 ns/B 161.1 MiB/s 6.54 c/B POLY1305 auth | 2.78 ns/B 343.3 MiB/s 3.07 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index bf13c199a..dc63a736f 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -70,8 +70,9 @@ libcipher_la_SOURCES = \ sha1.h EXTRA_libcipher_la_SOURCES = \ - asm-common-amd64.h \ asm-common-aarch64.h \ + asm-common-amd64.h \ + asm-poly1305-aarch64.h \ asm-poly1305-amd64.h \ arcfour.c arcfour-amd64.S \ blowfish.c blowfish-amd64.S blowfish-arm.S \ diff --git a/cipher/asm-poly1305-aarch64.h b/cipher/asm-poly1305-aarch64.h new file mode 100644 index 000000000..6c342bee7 --- /dev/null +++ b/cipher/asm-poly1305-aarch64.h @@ -0,0 +1,245 @@ +/* asm-common-aarch64.h - Poly1305 macros for ARMv8/AArch64 assembly + * + * Copyright (C) 2019 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#ifndef GCRY_ASM_POLY1305_AARCH64_H +#define GCRY_ASM_POLY1305_AARCH64_H + +#include "asm-common-aarch64.h" + +#ifdef __AARCH64EL__ + #define le_to_host(reg) /*_*/ +#else + #define le_to_host(reg) rev reg, reg; +#endif + +/********************************************************************** + poly1305 for stitched chacha20-poly1305 Aarch64 implementations + **********************************************************************/ + +#define POLY_RSTATE x8 +#define POLY_RSRC x9 + +#define POLY_R_H0 x10 +#define POLY_R_H1 x11 +#define POLY_R_H2 x12 +#define POLY_R_H2d w12 +#define POLY_R_R0 x13 +#define POLY_R_R1 x14 +#define POLY_R_R1_MUL5 x15 +#define POLY_R_X0_HI x16 +#define POLY_R_X0_LO x17 +#define POLY_R_X1_HI x19 +#define POLY_R_X1_LO x20 +#define POLY_R_ONE x21 +#define POLY_R_ONEd w21 + +#define POLY_TMP0 x22 +#define POLY_TMP1 x23 +#define POLY_TMP2 x24 +#define POLY_TMP3 x25 + +#define POLY_CHACHA_ROUND x26 + +#define POLY_S_R0 (4 * 4 + 0 * 8) +#define POLY_S_R1 (4 * 4 + 1 * 8) +#define POLY_S_H0 (4 * 4 + 2 * 8 + 0 * 8) +#define POLY_S_H1 (4 * 4 + 2 * 8 + 1 * 8) +#define POLY_S_H2d (4 * 4 + 2 * 8 + 2 * 8) + +#define POLY1305_PUSH_REGS() \ + stp x19, x20, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + CFI_REG_ON_STACK(19, 0); \ + CFI_REG_ON_STACK(20, 8); \ + stp x21, x22, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + CFI_REG_ON_STACK(21, 0); \ + CFI_REG_ON_STACK(22, 8); \ + stp x23, x24, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + CFI_REG_ON_STACK(23, 0); \ + CFI_REG_ON_STACK(24, 8); \ + stp x25, x26, [sp, #-16]!; \ + CFI_ADJUST_CFA_OFFSET(16); \ + CFI_REG_ON_STACK(25, 0); \ + CFI_REG_ON_STACK(26, 8); + +#define POLY1305_POP_REGS() \ + ldp x25, x26, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + CFI_RESTORE(x25); \ + CFI_RESTORE(x26); \ + ldp x23, x24, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + CFI_RESTORE(x23); \ + CFI_RESTORE(x24); \ + ldp x21, x22, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + CFI_RESTORE(x21); \ + CFI_RESTORE(x22); \ + ldp x19, x20, [sp], #16; \ + CFI_ADJUST_CFA_OFFSET(-16); \ + CFI_RESTORE(x19); \ + CFI_RESTORE(x20); + +#define POLY1305_LOAD_STATE() \ + ldr POLY_R_R1, [POLY_RSTATE, #(POLY_S_R1)]; \ + ldr POLY_R_H0, [POLY_RSTATE, #(POLY_S_H0)]; \ + ldr POLY_R_H1, [POLY_RSTATE, #(POLY_S_H1)]; \ + ldr POLY_R_H2d, [POLY_RSTATE, #(POLY_S_H2d)]; \ + ldr POLY_R_R0, [POLY_RSTATE, #(POLY_S_R0)]; \ + add POLY_R_R1_MUL5, POLY_R_R1, POLY_R_R1, lsr #2; \ + mov POLY_R_ONE, #1; + +#define POLY1305_STORE_STATE() \ + str POLY_R_H0, [POLY_RSTATE, #(POLY_S_H0)]; \ + str POLY_R_H1, [POLY_RSTATE, #(POLY_S_H1)]; \ + str POLY_R_H2d, [POLY_RSTATE, #(POLY_S_H2d)]; + +#define POLY1305_BLOCK_PART1(src_offset) \ + /* a = h + m */ \ + ldr POLY_TMP0, [POLY_RSRC, #((src_offset) + 0 * 8)]; +#define POLY1305_BLOCK_PART2(src_offset) \ + ldr POLY_TMP1, [POLY_RSRC, #((src_offset) + 1 * 8)]; +#define POLY1305_BLOCK_PART3() \ + le_to_host(POLY_TMP0); +#define POLY1305_BLOCK_PART4() \ + le_to_host(POLY_TMP1); +#define POLY1305_BLOCK_PART5() \ + adds POLY_R_H0, POLY_R_H0, POLY_TMP0; +#define POLY1305_BLOCK_PART6() \ + adcs POLY_R_H1, POLY_R_H1, POLY_TMP1; +#define POLY1305_BLOCK_PART7() \ + adc POLY_R_H2d, POLY_R_H2d, POLY_R_ONEd; + +#define POLY1305_BLOCK_PART8() \ + /* h = a * r (partial mod 2^130-5): */ \ + mul POLY_R_X1_LO, POLY_R_H0, POLY_R_R1; /* lo: h0 * r1 */ +#define POLY1305_BLOCK_PART9() \ + mul POLY_TMP0, POLY_R_H1, POLY_R_R0; /* lo: h1 * r0 */ +#define POLY1305_BLOCK_PART10() \ + mul POLY_R_X0_LO, POLY_R_H0, POLY_R_R0; /* lo: h0 * r0 */ +#define POLY1305_BLOCK_PART11() \ + umulh POLY_R_X1_HI, POLY_R_H0, POLY_R_R1; /* hi: h0 * r1 */ +#define POLY1305_BLOCK_PART12() \ + adds POLY_R_X1_LO, POLY_R_X1_LO, POLY_TMP0; +#define POLY1305_BLOCK_PART13() \ + umulh POLY_TMP1, POLY_R_H1, POLY_R_R0; /* hi: h1 * r0 */ +#define POLY1305_BLOCK_PART14() \ + mul POLY_TMP2, POLY_R_H1, POLY_R_R1_MUL5; /* lo: h1 * r1 mod 2^130-5 */ +#define POLY1305_BLOCK_PART15() \ + umulh POLY_R_X0_HI, POLY_R_H0, POLY_R_R0; /* hi: h0 * r0 */ +#define POLY1305_BLOCK_PART16() \ + adc POLY_R_X1_HI, POLY_R_X1_HI, POLY_TMP1; +#define POLY1305_BLOCK_PART17() \ + umulh POLY_TMP3, POLY_R_H1, POLY_R_R1_MUL5; /* hi: h1 * r1 mod 2^130-5 */ +#define POLY1305_BLOCK_PART18() \ + adds POLY_R_X0_LO, POLY_R_X0_LO, POLY_TMP2; +#define POLY1305_BLOCK_PART19() \ + mul POLY_R_H1, POLY_R_H2, POLY_R_R1_MUL5; /* h2 * r1 mod 2^130-5 */ +#define POLY1305_BLOCK_PART20() \ + adc POLY_R_X0_HI, POLY_R_X0_HI, POLY_TMP3; +#define POLY1305_BLOCK_PART21() \ + mul POLY_R_H2, POLY_R_H2, POLY_R_R0; /* h2 * r0 */ +#define POLY1305_BLOCK_PART22() \ + adds POLY_R_H1, POLY_R_H1, POLY_R_X1_LO; +#define POLY1305_BLOCK_PART23() \ + adc POLY_R_H0, POLY_R_H2, POLY_R_X1_HI; + +#define POLY1305_BLOCK_PART24() \ + /* carry propagation */ \ + and POLY_R_H2, POLY_R_H0, #3; +#define POLY1305_BLOCK_PART25() \ + mov POLY_R_H0, POLY_R_H0, lsr #2; +#define POLY1305_BLOCK_PART26() \ + add POLY_R_H0, POLY_R_H0, POLY_R_H0, lsl #2; +#define POLY1305_BLOCK_PART27() \ + adds POLY_R_H0, POLY_R_H0, POLY_R_X0_LO; +#define POLY1305_BLOCK_PART28() \ + adcs POLY_R_H1, POLY_R_H1, POLY_R_X0_HI; +#define POLY1305_BLOCK_PART29() \ + adc POLY_R_H2d, POLY_R_H2d, wzr; + +//#define TESTING_POLY1305_ASM +#ifdef TESTING_POLY1305_ASM +/* for testing only. */ +.align 3 +.globl _gcry_poly1305_aarch64_blocks1 +ELF(.type _gcry_poly1305_aarch64_blocks1,%function;) +_gcry_poly1305_aarch64_blocks1: + /* input: + * x0: poly1305-state + * x1: src + * x2: nblks + */ + CFI_STARTPROC() + POLY1305_PUSH_REGS(); + + mov POLY_RSTATE, x0; + mov POLY_RSRC, x1; + + POLY1305_LOAD_STATE(); + +.L_gcry_poly1305_aarch64_loop1: + POLY1305_BLOCK_PART1(0 * 16); + POLY1305_BLOCK_PART2(0 * 16); + add POLY_RSRC, POLY_RSRC, #16; + POLY1305_BLOCK_PART3(); + POLY1305_BLOCK_PART4(); + POLY1305_BLOCK_PART5(); + POLY1305_BLOCK_PART6(); + POLY1305_BLOCK_PART7(); + POLY1305_BLOCK_PART8(); + POLY1305_BLOCK_PART9(); + POLY1305_BLOCK_PART10(); + POLY1305_BLOCK_PART11(); + POLY1305_BLOCK_PART12(); + POLY1305_BLOCK_PART13(); + POLY1305_BLOCK_PART14(); + POLY1305_BLOCK_PART15(); + POLY1305_BLOCK_PART16(); + POLY1305_BLOCK_PART17(); + POLY1305_BLOCK_PART18(); + POLY1305_BLOCK_PART19(); + POLY1305_BLOCK_PART20(); + POLY1305_BLOCK_PART21(); + POLY1305_BLOCK_PART22(); + POLY1305_BLOCK_PART23(); + POLY1305_BLOCK_PART24(); + POLY1305_BLOCK_PART25(); + POLY1305_BLOCK_PART26(); + POLY1305_BLOCK_PART27(); + POLY1305_BLOCK_PART28(); + POLY1305_BLOCK_PART29(); + + subs x2, x2, #1; + b.ne .L_gcry_poly1305_aarch64_loop1; + + POLY1305_STORE_STATE(); + + mov x0, #0; + + POLY1305_POP_REGS(); + ret; + CFI_ENDPROC() +ELF(.size _gcry_poly1305_aarch64_blocks1, .-_gcry_poly1305_aarch64_blocks1;) +#endif + +#endif /* GCRY_ASM_POLY1305_AARCH64_H */ diff --git a/cipher/chacha20-aarch64.S b/cipher/chacha20-aarch64.S index 07b4bb5c0..7ace023fb 100644 --- a/cipher/chacha20-aarch64.S +++ b/cipher/chacha20-aarch64.S @@ -1,6 +1,6 @@ /* chacha20-aarch64.S - ARMv8/AArch64 accelerated chacha20 blocks function * - * Copyright (C) 2017,2018 Jussi Kivilinna + * Copyright (C) 2017-2019 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -38,6 +38,7 @@ .text +#include "asm-poly1305-aarch64.h" /* register macros */ #define INPUT x0 @@ -74,11 +75,14 @@ #define VTMP3 v4 #define X12_TMP v5 #define X13_TMP v6 +#define ROT8 v7 /********************************************************************** helper macros **********************************************************************/ +#define _(...) __VA_ARGS__ + #define vpunpckldq(s1, s2, dst) \ zip1 dst.4s, s2.4s, s1.4s; @@ -112,12 +116,18 @@ 4-way chacha20 **********************************************************************/ -#define ROTATE2(dst1,dst2,c,src1,src2) \ +#define ROTATE2(dst1,dst2,c,src1,src2,iop1) \ shl dst1.4s, src1.4s, #(c); \ shl dst2.4s, src2.4s, #(c); \ + iop1; \ sri dst1.4s, src1.4s, #(32 - (c)); \ sri dst2.4s, src2.4s, #(32 - (c)); +#define ROTATE2_8(dst1,dst2,src1,src2,iop1) \ + tbl dst1.16b, {src1.16b}, ROT8.16b; \ + iop1; \ + tbl dst2.16b, {src2.16b}, ROT8.16b; + #define ROTATE2_16(dst1,dst2,src1,src2) \ rev32 dst1.8h, src1.8h; \ rev32 dst2.8h, src2.8h; @@ -128,21 +138,33 @@ #define PLUS(ds,s) \ add ds.4s, ds.4s, s.4s; -#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2) \ - PLUS(a1,b1); PLUS(a2,b2); XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); \ - ROTATE2_16(d1, d2, tmp1, tmp2); \ - PLUS(c1,d1); PLUS(c2,d2); XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); \ - ROTATE2(b1, b2, 12, tmp1, tmp2); \ - PLUS(a1,b1); PLUS(a2,b2); XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); \ - ROTATE2(d1, d2, 8, tmp1, tmp2); \ - PLUS(c1,d1); PLUS(c2,d2); XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); \ - ROTATE2(b1, b2, 7, tmp1, tmp2); - -chacha20_data: +#define QUARTERROUND2(a1,b1,c1,d1,a2,b2,c2,d2,ign,tmp1,tmp2,iop1,iop2,iop3,iop4,iop5,iop6,iop7,iop8,iop9,iop10,iop11,iop12,iop13,iop14) \ + PLUS(a1,b1); PLUS(a2,b2); iop1; \ + XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); iop2; \ + ROTATE2_16(d1, d2, tmp1, tmp2); iop3; \ + PLUS(c1,d1); PLUS(c2,d2); iop4; \ + XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); iop5; \ + ROTATE2(b1, b2, 12, tmp1, tmp2, _(iop6)); iop7; \ + PLUS(a1,b1); PLUS(a2,b2); iop8; \ + XOR(tmp1,d1,a1); XOR(tmp2,d2,a2); iop9; \ + ROTATE2_8(d1, d2, tmp1, tmp2, _(iop10)); iop11; \ + PLUS(c1,d1); PLUS(c2,d2); iop12; \ + XOR(tmp1,b1,c1); XOR(tmp2,b2,c2); iop13; \ + ROTATE2(b1, b2, 7, tmp1, tmp2, _(iop14)); + .align 4 -.Linc_counter: +.globl _gcry_chacha20_aarch64_blocks4_data_inc_counter +_gcry_chacha20_aarch64_blocks4_data_inc_counter: .long 0,1,2,3 +.align 4 +.globl _gcry_chacha20_aarch64_blocks4_data_rot8 +_gcry_chacha20_aarch64_blocks4_data_rot8: + .byte 3,0,1,2 + .byte 7,4,5,6 + .byte 11,8,9,10 + .byte 15,12,13,14 + .align 3 .globl _gcry_chacha20_aarch64_blocks4 ELF(.type _gcry_chacha20_aarch64_blocks4,%function;) @@ -156,8 +178,10 @@ _gcry_chacha20_aarch64_blocks4: */ CFI_STARTPROC() - GET_DATA_POINTER(CTR, .Linc_counter); + GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); add INPUT_CTR, INPUT, #(12*4); + ld1 {ROT8.16b}, [CTR]; + GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); mov INPUT_POS, INPUT; ld1 {VCTR.16b}, [CTR]; @@ -195,10 +219,14 @@ _gcry_chacha20_aarch64_blocks4: .Lround2: subs ROUND, ROUND, #2 - QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,VTMP0,VTMP1) - QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,VTMP0,VTMP1) - QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,VTMP0,VTMP1) - QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,VTMP0,VTMP1) + QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,VTMP0,VTMP1, + ,,,,,,,,,,,,,) + QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,VTMP0,VTMP1, + ,,,,,,,,,,,,,) + QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,VTMP0,VTMP1, + ,,,,,,,,,,,,,) + QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,VTMP0,VTMP1, + ,,,,,,,,,,,,,) b.ne .Lround2; ld1 {VTMP0.16b, VTMP1.16b}, [INPUT_POS], #32; @@ -304,4 +332,285 @@ _gcry_chacha20_aarch64_blocks4: CFI_ENDPROC() ELF(.size _gcry_chacha20_aarch64_blocks4, .-_gcry_chacha20_aarch64_blocks4;) +/********************************************************************** + 4-way stitched chacha20-poly1305 + **********************************************************************/ + +.align 3 +.globl _gcry_chacha20_poly1305_aarch64_blocks4 +ELF(.type _gcry_chacha20_poly1305_aarch64_blocks4,%function;) + +_gcry_chacha20_poly1305_aarch64_blocks4: + /* input: + * x0: input + * x1: dst + * x2: src + * x3: nblks (multiple of 4) + * x4: poly1305-state + * x5: poly1305-src + */ + CFI_STARTPROC() + POLY1305_PUSH_REGS() + + mov POLY_RSTATE, x4; + mov POLY_RSRC, x5; + + GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_rot8); + add INPUT_CTR, INPUT, #(12*4); + ld1 {ROT8.16b}, [CTR]; + GET_DATA_POINTER(CTR, _gcry_chacha20_aarch64_blocks4_data_inc_counter); + mov INPUT_POS, INPUT; + ld1 {VCTR.16b}, [CTR]; + + POLY1305_LOAD_STATE() + +.Loop_poly4: + /* Construct counter vectors X12 and X13 */ + + ld1 {X15.16b}, [INPUT_CTR]; + ld1 {VTMP1.16b-VTMP3.16b}, [INPUT_POS]; + + dup X12.4s, X15.s[0]; + dup X13.4s, X15.s[1]; + ldr CTR, [INPUT_CTR]; + add X12.4s, X12.4s, VCTR.4s; + dup X0.4s, VTMP1.s[0]; + dup X1.4s, VTMP1.s[1]; + dup X2.4s, VTMP1.s[2]; + dup X3.4s, VTMP1.s[3]; + dup X14.4s, X15.s[2]; + cmhi VTMP0.4s, VCTR.4s, X12.4s; + dup X15.4s, X15.s[3]; + add CTR, CTR, #4; /* Update counter */ + dup X4.4s, VTMP2.s[0]; + dup X5.4s, VTMP2.s[1]; + dup X6.4s, VTMP2.s[2]; + dup X7.4s, VTMP2.s[3]; + sub X13.4s, X13.4s, VTMP0.4s; + dup X8.4s, VTMP3.s[0]; + dup X9.4s, VTMP3.s[1]; + dup X10.4s, VTMP3.s[2]; + dup X11.4s, VTMP3.s[3]; + mov X12_TMP.16b, X12.16b; + mov X13_TMP.16b, X13.16b; + str CTR, [INPUT_CTR]; + + mov ROUND, #20 +.Lround4_with_poly1305_outer: + mov POLY_CHACHA_ROUND, #6; +.Lround4_with_poly1305_inner1: + POLY1305_BLOCK_PART1(0 * 16) + QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,VTMP0,VTMP1, + POLY1305_BLOCK_PART2(0 * 16), + POLY1305_BLOCK_PART3(), + POLY1305_BLOCK_PART4(), + POLY1305_BLOCK_PART5(), + POLY1305_BLOCK_PART6(), + POLY1305_BLOCK_PART7(), + POLY1305_BLOCK_PART8(), + POLY1305_BLOCK_PART9(), + POLY1305_BLOCK_PART10(), + POLY1305_BLOCK_PART11(), + POLY1305_BLOCK_PART12(), + POLY1305_BLOCK_PART13(), + POLY1305_BLOCK_PART14(), + POLY1305_BLOCK_PART15()) + POLY1305_BLOCK_PART16() + QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,VTMP0,VTMP1, + POLY1305_BLOCK_PART17(), + POLY1305_BLOCK_PART18(), + POLY1305_BLOCK_PART19(), + POLY1305_BLOCK_PART20(), + POLY1305_BLOCK_PART21(), + POLY1305_BLOCK_PART22(), + POLY1305_BLOCK_PART23(), + POLY1305_BLOCK_PART24(), + POLY1305_BLOCK_PART25(), + POLY1305_BLOCK_PART26(), + POLY1305_BLOCK_PART27(), + POLY1305_BLOCK_PART28(), + POLY1305_BLOCK_PART29(), + POLY1305_BLOCK_PART1(1 * 16)) + POLY1305_BLOCK_PART2(1 * 16) + QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,VTMP0,VTMP1, + _(add POLY_RSRC, POLY_RSRC, #(2*16)), + POLY1305_BLOCK_PART3(), + POLY1305_BLOCK_PART4(), + POLY1305_BLOCK_PART5(), + POLY1305_BLOCK_PART6(), + POLY1305_BLOCK_PART7(), + POLY1305_BLOCK_PART8(), + POLY1305_BLOCK_PART9(), + POLY1305_BLOCK_PART10(), + POLY1305_BLOCK_PART11(), + POLY1305_BLOCK_PART12(), + POLY1305_BLOCK_PART13(), + POLY1305_BLOCK_PART14(), + POLY1305_BLOCK_PART15()) + POLY1305_BLOCK_PART16() + QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,VTMP0,VTMP1, + POLY1305_BLOCK_PART17(), + POLY1305_BLOCK_PART18(), + POLY1305_BLOCK_PART19(), + POLY1305_BLOCK_PART20(), + POLY1305_BLOCK_PART21(), + POLY1305_BLOCK_PART22(), + POLY1305_BLOCK_PART23(), + POLY1305_BLOCK_PART24(), + POLY1305_BLOCK_PART25(), + POLY1305_BLOCK_PART26(), + POLY1305_BLOCK_PART27(), + POLY1305_BLOCK_PART28(), + POLY1305_BLOCK_PART29(), + _(subs POLY_CHACHA_ROUND, POLY_CHACHA_ROUND, #2)); + b.ne .Lround4_with_poly1305_inner1; + + mov POLY_CHACHA_ROUND, #4; +.Lround4_with_poly1305_inner2: + POLY1305_BLOCK_PART1(0 * 16) + QUARTERROUND2(X0, X4, X8, X12, X1, X5, X9, X13, tmp:=,VTMP0,VTMP1,, + POLY1305_BLOCK_PART2(0 * 16),, + _(add POLY_RSRC, POLY_RSRC, #(1*16)),, + POLY1305_BLOCK_PART3(),, + POLY1305_BLOCK_PART4(),, + POLY1305_BLOCK_PART5(),, + POLY1305_BLOCK_PART6(),, + POLY1305_BLOCK_PART7()) + QUARTERROUND2(X2, X6, X10, X14, X3, X7, X11, X15, tmp:=,VTMP0,VTMP1, + POLY1305_BLOCK_PART8(),, + POLY1305_BLOCK_PART9(),, + POLY1305_BLOCK_PART10(),, + POLY1305_BLOCK_PART11(),, + POLY1305_BLOCK_PART12(),, + POLY1305_BLOCK_PART13(),, + POLY1305_BLOCK_PART14(),) + POLY1305_BLOCK_PART15() + QUARTERROUND2(X0, X5, X10, X15, X1, X6, X11, X12, tmp:=,VTMP0,VTMP1,, + POLY1305_BLOCK_PART16(),, + POLY1305_BLOCK_PART17(),, + POLY1305_BLOCK_PART18(),, + POLY1305_BLOCK_PART19(),, + POLY1305_BLOCK_PART20(),, + POLY1305_BLOCK_PART21(),, + POLY1305_BLOCK_PART22()) + QUARTERROUND2(X2, X7, X8, X13, X3, X4, X9, X14, tmp:=,VTMP0,VTMP1, + POLY1305_BLOCK_PART23(),, + POLY1305_BLOCK_PART24(),, + POLY1305_BLOCK_PART25(),, + POLY1305_BLOCK_PART26(),, + POLY1305_BLOCK_PART27(),, + POLY1305_BLOCK_PART28(),, + POLY1305_BLOCK_PART29(), + _(subs POLY_CHACHA_ROUND, POLY_CHACHA_ROUND, #2)) + b.ne .Lround4_with_poly1305_inner2; + + subs ROUND, ROUND, #10 + b.ne .Lround4_with_poly1305_outer; + + ld1 {VTMP0.16b, VTMP1.16b}, [INPUT_POS], #32; + + PLUS(X12, X12_TMP); /* INPUT + 12 * 4 + counter */ + PLUS(X13, X13_TMP); /* INPUT + 13 * 4 + counter */ + + dup VTMP2.4s, VTMP0.s[0]; /* INPUT + 0 * 4 */ + dup VTMP3.4s, VTMP0.s[1]; /* INPUT + 1 * 4 */ + dup X12_TMP.4s, VTMP0.s[2]; /* INPUT + 2 * 4 */ + dup X13_TMP.4s, VTMP0.s[3]; /* INPUT + 3 * 4 */ + PLUS(X0, VTMP2); + PLUS(X1, VTMP3); + PLUS(X2, X12_TMP); + PLUS(X3, X13_TMP); + + dup VTMP2.4s, VTMP1.s[0]; /* INPUT + 4 * 4 */ + dup VTMP3.4s, VTMP1.s[1]; /* INPUT + 5 * 4 */ + dup X12_TMP.4s, VTMP1.s[2]; /* INPUT + 6 * 4 */ + dup X13_TMP.4s, VTMP1.s[3]; /* INPUT + 7 * 4 */ + ld1 {VTMP0.16b, VTMP1.16b}, [INPUT_POS]; + mov INPUT_POS, INPUT; + PLUS(X4, VTMP2); + PLUS(X5, VTMP3); + PLUS(X6, X12_TMP); + PLUS(X7, X13_TMP); + + dup VTMP2.4s, VTMP0.s[0]; /* INPUT + 8 * 4 */ + dup VTMP3.4s, VTMP0.s[1]; /* INPUT + 9 * 4 */ + dup X12_TMP.4s, VTMP0.s[2]; /* INPUT + 10 * 4 */ + dup X13_TMP.4s, VTMP0.s[3]; /* INPUT + 11 * 4 */ + dup VTMP0.4s, VTMP1.s[2]; /* INPUT + 14 * 4 */ + dup VTMP1.4s, VTMP1.s[3]; /* INPUT + 15 * 4 */ + PLUS(X8, VTMP2); + PLUS(X9, VTMP3); + PLUS(X10, X12_TMP); + PLUS(X11, X13_TMP); + PLUS(X14, VTMP0); + PLUS(X15, VTMP1); + + transpose_4x4(X0, X1, X2, X3, VTMP0, VTMP1, VTMP2); + transpose_4x4(X4, X5, X6, X7, VTMP0, VTMP1, VTMP2); + transpose_4x4(X8, X9, X10, X11, VTMP0, VTMP1, VTMP2); + transpose_4x4(X12, X13, X14, X15, VTMP0, VTMP1, VTMP2); + + subs NBLKS, NBLKS, #4; + + ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64; + ld1 {X12_TMP.16b-X13_TMP.16b}, [SRC], #32; + eor VTMP0.16b, X0.16b, VTMP0.16b; + eor VTMP1.16b, X4.16b, VTMP1.16b; + eor VTMP2.16b, X8.16b, VTMP2.16b; + eor VTMP3.16b, X12.16b, VTMP3.16b; + eor X12_TMP.16b, X1.16b, X12_TMP.16b; + eor X13_TMP.16b, X5.16b, X13_TMP.16b; + st1 {VTMP0.16b-VTMP3.16b}, [DST], #64; + ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64; + st1 {X12_TMP.16b-X13_TMP.16b}, [DST], #32; + ld1 {X12_TMP.16b-X13_TMP.16b}, [SRC], #32; + eor VTMP0.16b, X9.16b, VTMP0.16b; + eor VTMP1.16b, X13.16b, VTMP1.16b; + eor VTMP2.16b, X2.16b, VTMP2.16b; + eor VTMP3.16b, X6.16b, VTMP3.16b; + eor X12_TMP.16b, X10.16b, X12_TMP.16b; + eor X13_TMP.16b, X14.16b, X13_TMP.16b; + st1 {VTMP0.16b-VTMP3.16b}, [DST], #64; + ld1 {VTMP0.16b-VTMP3.16b}, [SRC], #64; + st1 {X12_TMP.16b-X13_TMP.16b}, [DST], #32; + eor VTMP0.16b, X3.16b, VTMP0.16b; + eor VTMP1.16b, X7.16b, VTMP1.16b; + eor VTMP2.16b, X11.16b, VTMP2.16b; + eor VTMP3.16b, X15.16b, VTMP3.16b; + st1 {VTMP0.16b-VTMP3.16b}, [DST], #64; + + b.ne .Loop_poly4; + + POLY1305_STORE_STATE() + + /* clear the used vector registers and stack */ + clear(VTMP0); + clear(VTMP1); + clear(VTMP2); + clear(VTMP3); + clear(X12_TMP); + clear(X13_TMP); + clear(X0); + clear(X1); + clear(X2); + clear(X3); + clear(X4); + clear(X5); + clear(X6); + clear(X7); + clear(X8); + clear(X9); + clear(X10); + clear(X11); + clear(X12); + clear(X13); + clear(X14); + clear(X15); + + eor x0, x0, x0 + POLY1305_POP_REGS() + ret + CFI_ENDPROC() +ELF(.size _gcry_chacha20_poly1305_aarch64_blocks4, .-_gcry_chacha20_poly1305_aarch64_blocks4;) + #endif diff --git a/cipher/chacha20.c b/cipher/chacha20.c index b34d8d197..9d95723ba 100644 --- a/cipher/chacha20.c +++ b/cipher/chacha20.c @@ -185,6 +185,10 @@ unsigned int _gcry_chacha20_armv7_neon_blocks4(u32 *state, byte *dst, unsigned int _gcry_chacha20_aarch64_blocks4(u32 *state, byte *dst, const byte *src, size_t nblks); +unsigned int _gcry_chacha20_poly1305_aarch64_blocks4( + u32 *state, byte *dst, const byte *src, size_t nblks, + void *poly1305_state, const byte *poly1305_src); + #endif /* USE_AARCH64_SIMD */ @@ -688,6 +692,18 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, inbuf += 1 * CHACHA20_BLOCK_SIZE; } #endif +#ifdef USE_AARCH64_SIMD + else if (ctx->use_neon && length >= CHACHA20_BLOCK_SIZE * 4) + { + nburn = _gcry_chacha20_aarch64_blocks4(ctx->input, outbuf, inbuf, 4); + burn = nburn > burn ? nburn : burn; + + authptr = outbuf; + length -= 4 * CHACHA20_BLOCK_SIZE; + outbuf += 4 * CHACHA20_BLOCK_SIZE; + inbuf += 4 * CHACHA20_BLOCK_SIZE; + } +#endif #ifdef USE_PPC_VEC_POLY1305 else if (ctx->use_ppc && length >= CHACHA20_BLOCK_SIZE * 4) { @@ -763,6 +779,26 @@ _gcry_chacha20_poly1305_encrypt(gcry_cipher_hd_t c, byte *outbuf, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_neon && + length >= 4 * CHACHA20_BLOCK_SIZE && + authoffset >= 4 * CHACHA20_BLOCK_SIZE) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 4; + + nburn = _gcry_chacha20_poly1305_aarch64_blocks4( + ctx->input, outbuf, inbuf, nblocks, + &c->u_mode.poly1305.ctx.state, authptr); + burn = nburn > burn ? nburn : burn; + + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + authptr += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + #ifdef USE_PPC_VEC_POLY1305 if (ctx->use_ppc && length >= 4 * CHACHA20_BLOCK_SIZE && @@ -913,6 +949,23 @@ _gcry_chacha20_poly1305_decrypt(gcry_cipher_hd_t c, byte *outbuf, } #endif +#ifdef USE_AARCH64_SIMD + if (ctx->use_neon && length >= 4 * CHACHA20_BLOCK_SIZE) + { + size_t nblocks = length / CHACHA20_BLOCK_SIZE; + nblocks -= nblocks % 4; + + nburn = _gcry_chacha20_poly1305_aarch64_blocks4( + ctx->input, outbuf, inbuf, nblocks, + &c->u_mode.poly1305.ctx.state, inbuf); + burn = nburn > burn ? nburn : burn; + + length -= nblocks * CHACHA20_BLOCK_SIZE; + outbuf += nblocks * CHACHA20_BLOCK_SIZE; + inbuf += nblocks * CHACHA20_BLOCK_SIZE; + } +#endif + #ifdef USE_PPC_VEC_POLY1305 if (ctx->use_ppc && length >= 4 * CHACHA20_BLOCK_SIZE) { From gniibe at fsij.org Wed Sep 25 13:15:48 2019 From: gniibe at fsij.org (NIIBE Yutaka) Date: Wed, 25 Sep 2019 20:15:48 +0900 Subject: [PATCH] ecc: Add Curve448. Message-ID: <87sgoky6cb.fsf@jumper.gniibe.org> Hello, I'm going to push this change to master, so that we can use Curve448 encryption by GnuPG in near future. * cipher/ecc-curves.c (curve_aliases): Add Curve448. (domain_parms): Add domain parameter for Curve448. * cipher/ecc-ecdh.c (_gcry_ecc_mul_point): Support Curve448. * mpi/ec.c (ec_addm_448, ec_subm_448, ec_mulm_448): New. (ec_mul2_448, ec_pow2_448): New. (field_table): Add field specific routines for Curve448. (curve25519_bad_points): It's constants. (curve448_bad_points): New. (bad_points_table): New. (ec_p_init): Initialize by bad_points_table. * tests/Makefile.am (t-cv448): New. * tests/t-cv448.c: New. * tests/curves.c (N_CURVES): Increment. -- This change is to add "Curve448". In libgcrypt and GnuPG, we have an interface with a curve name, which was introduced before X25519/X448 function. While OID 1.3.101.111 is for X448 function on the curve, we use it to refer the curve itself. Signed-off-by: NIIBE Yutaka --- cipher/ecc-curves.c | 20 +- cipher/ecc-ecdh.c | 3 +- mpi/ec.c | 210 +++++++++++++++- tests/Makefile.am | 3 +- tests/curves.c | 2 +- tests/t-cv448.c | 602 ++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 829 insertions(+), 11 deletions(-) create mode 100644 tests/t-cv448.c diff --git a/cipher/ecc-curves.c b/cipher/ecc-curves.c index 85f14eff..74d6b5e5 100644 --- a/cipher/ecc-curves.c +++ b/cipher/ecc-curves.c @@ -54,10 +54,10 @@ static const struct { "Ed448", "1.3.101.113" }, /* rfc8410 */ { "X22519", "1.3.101.110" }, /* rfc8410 */ - - { "X448", "1.3.101.111" }, /* rfc8410 */ #endif + { "Curve448", "1.3.101.111" }, /* X448 in rfc8410 */ + { "NIST P-192", "1.2.840.10045.3.1.1" }, /* X9.62 OID */ { "NIST P-192", "prime192v1" }, /* X9.62 name. */ { "NIST P-192", "secp192r1" }, /* SECP name. */ @@ -157,6 +157,22 @@ static const ecc_domain_parms_t domain_parms[] = "0x5F51E65E475F794B1FE122D388B72EB36DC2B28192839E4DD6163A5D81312C14", "0x08" }, + { + /* (y^2 = x^3 + 156326*x^2 + x) */ + "Curve448", 448, 0, + MPI_EC_MONTGOMERY, ECC_DIALECT_STANDARD, + "0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE" + "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF", + "0x98A9", + "0x01", + "0x3FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF" + "7CCA23E9C44EDB49AED63690216CC2728DC58F552378C292AB5844F3", + "0x00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000005", + "0x7D235D1295F5B1F66C98AB6E58326FCECBAE5D34F55545D060F75DC2" + "8DF3F6EDB8027E2346430D211312C4B150677AF76FD7223D457B5B1A", + "0x04" + }, #if 0 /* No real specs yet found. */ { /* x^2 + y^2 = 1 + 3617x^2y^2 mod 2^414 - 17 */ diff --git a/cipher/ecc-ecdh.c b/cipher/ecc-ecdh.c index bfd07d40..405fc142 100644 --- a/cipher/ecc-ecdh.c +++ b/cipher/ecc-ecdh.c @@ -88,8 +88,7 @@ _gcry_ecc_mul_point (int algo, unsigned char *result, else if (algo == GCRY_ECC_CURVE448) { nbits = ECC_CURVE448_BITS; - curve = "X448"; - return gpg_error (GPG_ERR_UNSUPPORTED_ALGORITHM); + curve = "Curve448"; } else return gpg_error (GPG_ERR_UNKNOWN_ALGORITHM); diff --git a/mpi/ec.c b/mpi/ec.c index 97afbfed..abdca997 100644 --- a/mpi/ec.c +++ b/mpi/ec.c @@ -366,7 +366,7 @@ mpih_set_cond (mpi_ptr_t wp, mpi_ptr_t up, mpi_size_t usize, unsigned long set) wp[i] = wp[i] ^ x; } } - + /* Routines for 2^255 - 19. */ #define LIMB_SIZE_25519 ((256+BITS_PER_MPI_LIMB-1)/BITS_PER_MPI_LIMB) @@ -477,7 +477,167 @@ ec_pow2_25519 (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx) { ec_mulm_25519 (w, b, b, ctx); } + +/* Routines for 2^448 - 2^224 - 1. */ + +#define LIMB_SIZE_448 ((448+BITS_PER_MPI_LIMB-1)/BITS_PER_MPI_LIMB) +#define LIMB_SIZE_HALF_448 ((LIMB_SIZE_448+1)/2) + +static void +ec_addm_448 (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx) +{ + mpi_ptr_t wp, up, vp; + mpi_size_t wsize = LIMB_SIZE_448; + mpi_limb_t n[LIMB_SIZE_448]; + mpi_limb_t cy; + + if (w->nlimbs != wsize || u->nlimbs != wsize || v->nlimbs != wsize) + log_bug ("addm_448: different sizes\n"); + + memset (n, 0, sizeof n); + up = u->d; + vp = v->d; + wp = w->d; + + cy = _gcry_mpih_add_n (wp, up, vp, wsize); + mpih_set_cond (n, ctx->p->d, wsize, (cy != 0UL)); + _gcry_mpih_sub_n (wp, wp, n, wsize); +} + +static void +ec_subm_448 (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx) +{ + mpi_ptr_t wp, up, vp; + mpi_size_t wsize = LIMB_SIZE_448; + mpi_limb_t n[LIMB_SIZE_448]; + mpi_limb_t borrow; + + if (w->nlimbs != wsize || u->nlimbs != wsize || v->nlimbs != wsize) + log_bug ("subm_448: different sizes\n"); + + memset (n, 0, sizeof n); + up = u->d; + vp = v->d; + wp = w->d; + + borrow = _gcry_mpih_sub_n (wp, up, vp, wsize); + mpih_set_cond (n, ctx->p->d, wsize, (borrow != 0UL)); + _gcry_mpih_add_n (wp, wp, n, wsize); +} + +static void +ec_mulm_448 (gcry_mpi_t w, gcry_mpi_t u, gcry_mpi_t v, mpi_ec_t ctx) +{ + mpi_ptr_t wp, up, vp; + mpi_size_t wsize = LIMB_SIZE_448; + mpi_limb_t n[LIMB_SIZE_448*2]; + mpi_limb_t a2[LIMB_SIZE_HALF_448]; + mpi_limb_t a3[LIMB_SIZE_HALF_448]; + mpi_limb_t b0[LIMB_SIZE_HALF_448]; + mpi_limb_t b1[LIMB_SIZE_HALF_448]; + mpi_limb_t cy; + int i; +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + mpi_limb_t b1_rest, a3_rest; +#endif + + if (w->nlimbs != wsize || u->nlimbs != wsize || v->nlimbs != wsize) + log_bug ("mulm_448: different sizes\n"); + + up = u->d; + vp = v->d; + wp = w->d; + + _gcry_mpih_mul_n (n, up, vp, wsize); + + for (i = 0; i < (wsize + 1)/ 2; i++) + { + b0[i] = n[i]; + b1[i] = n[i+wsize/2]; + a2[i] = n[i+wsize]; + a3[i] = n[i+wsize+wsize/2]; + } + +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + b0[LIMB_SIZE_HALF_448-1] &= (1UL<<32)-1; + a2[LIMB_SIZE_HALF_448-1] &= (1UL<<32)-1; + + b1_rest = 0; + a3_rest = 0; + + for (i = (wsize + 1)/ 2 -1; i >= 0; i--) + { + mpi_limb_t b1v, a3v; + b1v = b1[i]; + a3v = a3[i]; + b1[i] = (b1_rest<<32) | (b1v >> 32); + a3[i] = (a3_rest<<32) | (a3v >> 32); + b1_rest = b1v & ((1UL <<32)-1); + a3_rest = a3v & ((1UL <<32)-1); + } +#endif + + cy = _gcry_mpih_add_n (b0, b0, a2, LIMB_SIZE_HALF_448); + cy += _gcry_mpih_add_n (b0, b0, a3, LIMB_SIZE_HALF_448); + for (i = 0; i < (wsize + 1)/ 2; i++) + wp[i] = b0[i]; +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + wp[LIMB_SIZE_HALF_448-1] &= ((1UL <<32)-1); +#endif + +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + cy = b0[LIMB_SIZE_HALF_448-1] >> 32; +#endif + + cy = _gcry_mpih_add_1 (b1, b1, LIMB_SIZE_HALF_448, cy); + cy += _gcry_mpih_add_n (b1, b1, a2, LIMB_SIZE_HALF_448); + cy += _gcry_mpih_add_n (b1, b1, a3, LIMB_SIZE_HALF_448); + cy += _gcry_mpih_add_n (b1, b1, a3, LIMB_SIZE_HALF_448); +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + b1_rest = 0; + for (i = (wsize + 1)/ 2 -1; i >= 0; i--) + { + mpi_limb_t b1v = b1[i]; + b1[i] = (b1_rest<<32) | (b1v >> 32); + b1_rest = b1v & ((1UL <<32)-1); + } + wp[LIMB_SIZE_HALF_448-1] |= (b1_rest << 32); +#endif + for (i = 0; i < wsize / 2; i++) + wp[i+(wsize + 1) / 2] = b1[i]; + +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + cy = b1[LIMB_SIZE_HALF_448-1]; +#endif + + memset (n, 0, wsize * BYTES_PER_MPI_LIMB); + +#if (LIMB_SIZE_HALF_448 > LIMB_SIZE_448/2) + n[LIMB_SIZE_HALF_448-1] = cy << 32; +#else + n[LIMB_SIZE_HALF_448] = cy; +#endif + n[0] = cy; + _gcry_mpih_add_n (wp, wp, n, wsize); + + memset (n, 0, wsize * BYTES_PER_MPI_LIMB); + cy = _gcry_mpih_sub_n (wp, wp, ctx->p->d, wsize); + mpih_set_cond (n, ctx->p->d, wsize, (cy != 0UL)); + _gcry_mpih_add_n (wp, wp, n, wsize); +} +static void +ec_mul2_448 (gcry_mpi_t w, gcry_mpi_t u, mpi_ec_t ctx) +{ + ec_addm_448 (w, u, u, ctx); +} + +static void +ec_pow2_448 (gcry_mpi_t w, const gcry_mpi_t b, mpi_ec_t ctx) +{ + ec_mulm_448 (w, b, b, ctx); +} + struct field_table { const char *p; @@ -498,6 +658,15 @@ static const struct field_table field_table[] = { ec_mul2_25519, ec_pow2_25519 }, + { + "0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE" + "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF", + ec_addm_448, + ec_subm_448, + ec_mulm_448, + ec_mul2_448, + ec_pow2_448 + }, { NULL, NULL, NULL, NULL, NULL, NULL }, }; @@ -544,17 +713,37 @@ ec_get_two_inv_p (mpi_ec_t ec) } -static const char *curve25519_bad_points[] = { +static const char *const curve25519_bad_points[] = { + "0x7fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffed", "0x0000000000000000000000000000000000000000000000000000000000000000", "0x0000000000000000000000000000000000000000000000000000000000000001", "0x00b8495f16056286fdb1329ceb8d09da6ac49ff1fae35616aeb8413b7c7aebe0", "0x57119fd0dd4e22d8868e1c58c45c44045bef839c55b1d0b1248c50a3bc959c5f", "0x7fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffec", - "0x7fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffed", "0x7fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffee", NULL }; + +static const char *const curve448_bad_points[] = { + "0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffe" + "ffffffffffffffffffffffffffffffffffffffffffffffffffffffff", + "0x00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000", + "0x00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000001", + "0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffe" + "fffffffffffffffffffffffffffffffffffffffffffffffffffffffe", + "0xffffffffffffffffffffffffffffffffffffffffffffffffffffffff" + "00000000000000000000000000000000000000000000000000000000", + NULL +}; + +static const char *const *bad_points_table[] = { + curve25519_bad_points, + curve448_bad_points, +}; + static gcry_mpi_t scanval (const char *string) { @@ -607,8 +796,19 @@ ec_p_init (mpi_ec_t ctx, enum gcry_mpi_ec_models model, if (model == MPI_EC_MONTGOMERY) { - for (i=0; i< DIM(ctx->t.scratch) && curve25519_bad_points[i]; i++) - ctx->t.scratch[i] = scanval (curve25519_bad_points[i]); + for (i=0; i< DIM(bad_points_table); i++) + { + gcry_mpi_t p_candidate = scanval (bad_points_table[i][0]); + int match_p = !mpi_cmp (ctx->p, p_candidate); + int j; + + mpi_free (p_candidate); + if (!match_p) + continue; + + for (j=0; i< DIM(ctx->t.scratch) && bad_points_table[i][j]; j++) + ctx->t.scratch[j] = scanval (bad_points_table[i][j]); + } } else { diff --git a/tests/Makefile.am b/tests/Makefile.am index 9e117970..2ae70e54 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -22,7 +22,8 @@ tests_bin = \ version t-secmem mpitests t-sexp t-convert \ t-mpi-bit t-mpi-point curves t-lock \ prime basic keygen pubkey hmac hashtest t-kdf keygrip \ - fips186-dsa aeswrap pkcs1v2 random dsa-rfc6979 t-ed25519 t-cv25519 + fips186-dsa aeswrap pkcs1v2 random dsa-rfc6979 \ + t-ed25519 t-cv25519 t-cv448 tests_bin_last = benchmark bench-slope diff --git a/tests/curves.c b/tests/curves.c index dc5ebe77..bacc1302 100644 --- a/tests/curves.c +++ b/tests/curves.c @@ -33,7 +33,7 @@ #include "t-common.h" /* Number of curves defined in ../cipger/ecc.c */ -#define N_CURVES 22 +#define N_CURVES 23 /* A real world sample public key. */ static char const sample_key_1[] = diff --git a/tests/t-cv448.c b/tests/t-cv448.c new file mode 100644 index 00000000..85fea298 --- /dev/null +++ b/tests/t-cv448.c @@ -0,0 +1,602 @@ +/* t-cv448.c - Check the Curve488 computation + * Copyright (C) 2019 g10 Code GmbH + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this program; if not, see . + * SPDX-License-Identifier: LGPL-2.1+ + */ + +#ifdef HAVE_CONFIG_H +#include +#endif +#include +#include +#include +#include +#include +#include + +#include "stopwatch.h" + +#define PGM "t-cv448" +#include "t-common.h" +#define N_TESTS 9 + + +static void +print_mpi (const char *text, gcry_mpi_t a) +{ + gcry_error_t err; + char *buf; + void *bufaddr = &buf; + + err = gcry_mpi_aprint (GCRYMPI_FMT_HEX, bufaddr, NULL, a); + if (err) + fprintf (stderr, "%s: [error printing number: %s]\n", + text, gpg_strerror (err)); + else + { + fprintf (stderr, "%s: %s\n", text, buf); + gcry_free (buf); + } +} + + +static void +show_note (const char *format, ...) +{ + va_list arg_ptr; + + if (!verbose && getenv ("srcdir")) + fputs (" ", stderr); /* To align above "PASS: ". */ + else + fprintf (stderr, "%s: ", PGM); + va_start (arg_ptr, format); + vfprintf (stderr, format, arg_ptr); + if (*format && format[strlen(format)-1] != '\n') + putc ('\n', stderr); + va_end (arg_ptr); +} + + +/* Convert STRING consisting of hex characters into its binary + representation and return it as an allocated buffer. The valid + length of the buffer is returned at R_LENGTH. The string is + delimited by end of string. The function returns NULL on + error. */ +static void * +hex2buffer (const char *string, size_t *r_length) +{ + const char *s; + unsigned char *buffer; + size_t length; + + buffer = xmalloc (strlen(string)/2+1); + length = 0; + for (s=string; *s; s +=2 ) + { + if (!hexdigitp (s) || !hexdigitp (s+1)) + return NULL; /* Invalid hex digits. */ + ((unsigned char*)buffer)[length++] = xtoi_2 (s); + } + *r_length = length; + return buffer; +} + +static void +reverse_buffer (unsigned char *buffer, unsigned int length) +{ + unsigned int tmp, i; + + for (i=0; i < length/2; i++) + { + tmp = buffer[i]; + buffer[i] = buffer[length-1-i]; + buffer[length-1-i] = tmp; + } +} + + +/* + * Test X448 functionality through higher layer crypto routines. + * + * Input: K (as hex string), U (as hex string), R (as hex string) + * + * where R is expected result of X448 (K, U). + * + */ +static void +test_cv_hl (int testno, const char *k_str, const char *u_str, + const char *result_str) +{ + gpg_error_t err; + void *buffer = NULL; + size_t buflen; + gcry_sexp_t s_pk = NULL; + gcry_mpi_t mpi_k = NULL; + gcry_sexp_t s_data = NULL; + gcry_sexp_t s_result = NULL; + gcry_sexp_t s_tmp = NULL; + unsigned char *res = NULL; + size_t res_len; + + if (verbose > 1) + info ("Running test %d\n", testno); + + if (!(buffer = hex2buffer (k_str, &buflen)) || buflen != 56) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "k", "invalid hex string"); + goto leave; + } + + reverse_buffer (buffer, buflen); + if ((err = gcry_mpi_scan (&mpi_k, GCRYMPI_FMT_USG, buffer, buflen, NULL))) + { + fail ("error converting MPI for test %d: %s", testno, gpg_strerror (err)); + goto leave; + } + + if ((err = gcry_sexp_build (&s_data, NULL, "%m", mpi_k))) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "data", gpg_strerror (err)); + goto leave; + } + + xfree (buffer); + if (!(buffer = hex2buffer (u_str, &buflen)) || buflen != 56) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "u", "invalid hex string"); + goto leave; + } + + /* + * The procedure of decodeUCoordinate will be done internally + * by _gcry_ecc_mont_decodepoint. So, we just put the little-endian + * binary to build S-exp. + * + * We could add the prefix 0x40, but libgcrypt also supports + * format with no prefix. So, it is OK not to put the prefix. + */ + if ((err = gcry_sexp_build (&s_pk, NULL, + "(public-key" + " (ecc" + " (curve \"Curve448\")" + " (flags djb-tweak)" + " (q%b)))", (int)buflen, buffer))) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "pk", gpg_strerror (err)); + goto leave; + } + + xfree (buffer); + buffer = NULL; + + if ((err = gcry_pk_encrypt (&s_result, s_data, s_pk))) + fail ("gcry_pk_encrypt failed for test %d: %s", testno, + gpg_strerror (err)); + + s_tmp = gcry_sexp_find_token (s_result, "s", 0); + if (!s_tmp || !(res = gcry_sexp_nth_buffer (s_tmp, 1, &res_len))) + fail ("gcry_pk_encrypt failed for test %d: %s", testno, "missing value"); + else + { + char *r, *r0; + int i; + + /* To skip the prefix 0x40, for-loop start with i=1 */ + r0 = r = xmalloc (2*(res_len)+1); + if (!r0) + { + fail ("memory allocation for test %d", testno); + goto leave; + } + + for (i=1; i < res_len; i++, r += 2) + snprintf (r, 3, "%02x", res[i]); + if (strcmp (result_str, r0)) + { + fail ("gcry_pk_encrypt failed for test %d: %s", + testno, "wrong value returned"); + info (" expected: '%s'", result_str); + info (" got: '%s'", r0); + } + xfree (r0); + } + + leave: + xfree (res); + gcry_mpi_release (mpi_k); + gcry_sexp_release (s_tmp); + gcry_sexp_release (s_result); + gcry_sexp_release (s_data); + gcry_sexp_release (s_pk); + xfree (buffer); +} + +/* + * Test X448 functionality through the API for X448. + * + * Input: K (as hex string), U (as hex string), R (as hex string) + * + * where R is expected result of X448 (K, U). + * + */ +static void +test_cv_x448 (int testno, const char *k_str, const char *u_str, + const char *result_str) +{ + gpg_error_t err; + void *scalar; + void *point = NULL; + size_t buflen; + unsigned char result[56]; + char result_hex[113]; + int i; + + if (verbose > 1) + info ("Running test %d\n", testno); + + if (!(scalar = hex2buffer (k_str, &buflen)) || buflen != 56) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "k", "invalid hex string"); + goto leave; + } + + if (!(point = hex2buffer (u_str, &buflen)) || buflen != 56) + { + fail ("error building s-exp for test %d, %s: %s", + testno, "u", "invalid hex string"); + goto leave; + } + + if ((err = gcry_ecc_mul_point (GCRY_ECC_CURVE448, result, scalar, point))) + fail ("gcry_ecc_mul_point failed for test %d: %s", testno, + gpg_strerror (err)); + + for (i=0; i < 56; i++) + snprintf (&result_hex[i*2], 3, "%02x", result[i]); + + if (strcmp (result_str, result_hex)) + { + fail ("gcry_ecc_mul_point failed for test %d: %s", + testno, "wrong value returned"); + info (" expected: '%s'", result_str); + info (" got: '%s'", result_hex); + } + + leave: + xfree (scalar); + xfree (point); +} + +static void +test_cv (int testno, const char *k_str, const char *u_str, + const char *result_str) +{ + test_cv_hl (testno, k_str, u_str, result_str); + test_cv_x448 (testno, k_str, u_str, result_str); +} + +/* + * Test iterative X448 computation through lower layer MPI routines. + * + * Input: K (as hex string), ITER, R (as hex string) + * + * where R is expected result of iterating X448 by ITER times. + * + */ +static void +test_it (int testno, const char *k_str, int iter, const char *result_str) +{ + gcry_ctx_t ctx; + gpg_error_t err; + void *buffer = NULL; + size_t buflen; + gcry_mpi_t mpi_k = NULL; + gcry_mpi_t mpi_x = NULL; + gcry_mpi_point_t P = NULL; + gcry_mpi_point_t Q; + int i; + gcry_mpi_t mpi_kk = NULL; + + if (verbose > 1) + info ("Running test %d: iteration=%d\n", testno, iter); + + gcry_mpi_ec_new (&ctx, NULL, "Curve448"); + Q = gcry_mpi_point_new (0); + + if (!(buffer = hex2buffer (k_str, &buflen)) || buflen != 56) + { + fail ("error scanning MPI for test %d, %s: %s", + testno, "k", "invalid hex string"); + goto leave; + } + reverse_buffer (buffer, buflen); + if ((err = gcry_mpi_scan (&mpi_x, GCRYMPI_FMT_USG, buffer, buflen, NULL))) + { + fail ("error scanning MPI for test %d, %s: %s", + testno, "x", gpg_strerror (err)); + goto leave; + } + + xfree (buffer); + buffer = NULL; + + P = gcry_mpi_point_set (NULL, mpi_x, NULL, GCRYMPI_CONST_ONE); + + mpi_k = gcry_mpi_copy (mpi_x); + if (debug) + print_mpi ("k", mpi_k); + + for (i = 0; i < iter; i++) + { + /* + * Another variant of decodeScalar448 thing. + */ + mpi_kk = gcry_mpi_set (mpi_kk, mpi_k); + gcry_mpi_set_bit (mpi_kk, 447); + gcry_mpi_clear_bit (mpi_kk, 0); + gcry_mpi_clear_bit (mpi_kk, 1); + + gcry_mpi_ec_mul (Q, mpi_kk, P, ctx); + + P = gcry_mpi_point_set (P, mpi_k, NULL, GCRYMPI_CONST_ONE); + gcry_mpi_ec_get_affine (mpi_k, NULL, Q, ctx); + + if (debug) + print_mpi ("k", mpi_k); + } + + { + unsigned char res[56]; + char *r, *r0; + + gcry_mpi_print (GCRYMPI_FMT_USG, res, 56, NULL, mpi_k); + reverse_buffer (res, 56); + + r0 = r = xmalloc (113); + if (!r0) + { + fail ("memory allocation for test %d", testno); + goto leave; + } + + for (i=0; i < 56; i++, r += 2) + snprintf (r, 3, "%02x", res[i]); + + if (strcmp (result_str, r0)) + { + fail ("X448 failed for test %d: %s", + testno, "wrong value returned"); + info (" expected: '%s'", result_str); + info (" got: '%s'", r0); + } + xfree (r0); + } + + leave: + gcry_mpi_release (mpi_kk); + gcry_mpi_release (mpi_k); + gcry_mpi_point_release (P); + gcry_mpi_release (mpi_x); + xfree (buffer); + gcry_mpi_point_release (Q); + gcry_ctx_release (ctx); +} + +/* + * X-coordinate of generator of the X448. + */ +#define G_X ("0500000000000000000000000000000000000000000000000000000000000000" \ + "000000000000000000000000000000000000000000000000") + +/* + * Test Diffie-Hellman in RFC-7748. + * + * Note that it's not like the ECDH of OpenPGP, where we use + * ephemeral public key. + */ +static void +test_dh (int testno, const char *a_priv_str, const char *a_pub_str, + const char *b_priv_str, const char *b_pub_str, + const char *result_str) +{ + /* Test A for private key corresponds to public key. */ + test_cv (testno, a_priv_str, G_X, a_pub_str); + /* Test B for private key corresponds to public key. */ + test_cv (testno, b_priv_str, G_X, b_pub_str); + /* Test DH with A's private key and B's public key. */ + test_cv (testno, a_priv_str, b_pub_str, result_str); + /* Test DH with B's private key and A's public key. */ + test_cv (testno, b_priv_str, a_pub_str, result_str); +} + + +static void +check_x448 (void) +{ + int ntests; + + info ("Checking X448.\n"); + + ntests = 0; + + /* + * Values are cited from RFC-7748: 5.2. Test Vectors. + * Following two tests are for the first type test. + */ + test_cv (1, + "3d262fddf9ec8e88495266fea19a34d28882acef045104d0d1aae121" + "700a779c984c24f8cdd78fbff44943eba368f54b29259a4f1c600ad3", + "06fce640fa3487bfda5f6cf2d5263f8aad88334cbd07437f020f08f9" + "814dc031ddbdc38c19c6da2583fa5429db94ada18aa7a7fb4ef8a086", + "ce3e4ff95a60dc6697da1db1d85e6afbdf79b50a2412d7546d5f239f" + "e14fbaadeb445fc66a01b0779d98223961111e21766282f73dd96b6f"); + ntests++; + test_cv (2, + "203d494428b8399352665ddca42f9de8fef600908e0d461cb021f8c5" + "38345dd77c3e4806e25f46d3315c44e0a5b4371282dd2c8d5be3095f", + "0fbcc2f993cd56d3305b0b7d9e55d4c1a8fb5dbb52f8e9a1e9b6201b" + "165d015894e56c4d3570bee52fe205e28a78b91cdfbde71ce8d157db", + "884a02576239ff7a2f2f63b2db6a9ff37047ac13568e1e30fe63c4a7" + "ad1b3ee3a5700df34321d62077e63633c575c1c954514e99da7c179d"); + ntests++; + + /* + * Additional test. Value is from second type test. + */ + test_cv (3, + G_X, + G_X, + "3f482c8a9f19b01e6c46ee9711d9dc14fd4bf67af30765c2ae2b846a" + "4d23a8cd0db897086239492caf350b51f833868b9bc2b3bca9cf4113"); + ntests++; + + /* + * Following two tests are for the second type test, + * with one iteration and 1,000 iterations. (1,000,000 iterations + * takes too long.) + */ + test_it (4, + G_X, + 1, + "3f482c8a9f19b01e6c46ee9711d9dc14fd4bf67af30765c2ae2b846a" + "4d23a8cd0db897086239492caf350b51f833868b9bc2b3bca9cf4113"); + ntests++; + + test_it (5, + G_X, + 1000, + "aa3b4749d55b9daf1e5b00288826c467274ce3ebbdd5c17b975e09d4" + "af6c67cf10d087202db88286e2b79fceea3ec353ef54faa26e219f38"); + ntests++; + + /* + * Last test is from: 6. Diffie-Hellman, 6.2. Curve448 + */ + test_dh (6, + /* Alice's private key, a */ + "9a8f4925d1519f5775cf46b04b5800d4ee9ee8bae8bc5565d498c28d" + "d9c9baf574a9419744897391006382a6f127ab1d9ac2d8c0a598726b", + /* Alice's public key, X448(a, 5) */ + "9b08f7cc31b7e3e67d22d5aea121074a273bd2b83de09c63faa73d2c" + "22c5d9bbc836647241d953d40c5b12da88120d53177f80e532c41fa0", + /* Bob's private key, b */ + "1c306a7ac2a0e2e0990b294470cba339e6453772b075811d8fad0d1d" + "6927c120bb5ee8972b0d3e21374c9c921b09d1b0366f10b65173992d", + /* Bob's public key, X448(b, 5) */ + "3eb7a829b0cd20f5bcfc0b599b6feccf6da4627107bdb0d4f345b430" + "27d8b972fc3e34fb4232a13ca706dcb57aec3dae07bdc1c67bf33609", + /* Their shared secret, K */ + "07fff4181ac6cc95ec1c16a94a0f74d12da232ce40a77552281d282b" + "b60c0b56fd2464c335543936521c24403085d59a449a5037514a879d"); + ntests++; + + /* three tests which results 0. */ + test_cv (7, + "3d262fddf9ec8e88495266fea19a34d28882acef045104d0d1aae121" + "700a779c984c24f8cdd78fbff44943eba368f54b29259a4f1c600ad3", + "00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000", + "00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000"); + ntests++; + + test_cv (8, + "3d262fddf9ec8e88495266fea19a34d28882acef045104d0d1aae121" + "700a779c984c24f8cdd78fbff44943eba368f54b29259a4f1c600ad3", + "01000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000", + "00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000"); + ntests++; + + test_cv (9, + "3d262fddf9ec8e88495266fea19a34d28882acef045104d0d1aae121" + "700a779c984c24f8cdd78fbff44943eba368f54b29259a4f1c600ad3", + "feffffffffffffffffffffffffffffffffffffffffffffffffffffff" + "feffffffffffffffffffffffffffffffffffffffffffffffffffffff", + "00000000000000000000000000000000000000000000000000000000" + "00000000000000000000000000000000000000000000000000000000"); + ntests++; + + if (ntests != N_TESTS) + fail ("did %d tests but expected %d", ntests, N_TESTS); + else if ((ntests % 256)) + show_note ("%d tests done\n", ntests); +} + + +int +main (int argc, char **argv) +{ + int last_argc = -1; + + if (argc) + { argc--; argv++; } + + while (argc && last_argc != argc ) + { + last_argc = argc; + if (!strcmp (*argv, "--")) + { + argc--; argv++; + break; + } + else if (!strcmp (*argv, "--help")) + { + fputs ("usage: " PGM " [options]\n" + "Options:\n" + " --verbose print timings etc.\n" + " --debug flyswatter\n", + stdout); + exit (0); + } + else if (!strcmp (*argv, "--verbose")) + { + verbose++; + argc--; argv++; + } + else if (!strcmp (*argv, "--debug")) + { + verbose += 2; + debug++; + argc--; argv++; + } + else if (!strncmp (*argv, "--", 2)) + die ("unknown option '%s'", *argv); + } + + xgcry_control ((GCRYCTL_DISABLE_SECMEM, 0)); + if (!gcry_check_version (GCRYPT_VERSION)) + die ("version mismatch\n"); + if (debug) + xgcry_control ((GCRYCTL_SET_DEBUG_FLAGS, 1u , 0)); + xgcry_control ((GCRYCTL_ENABLE_QUICK_RANDOM, 0)); + xgcry_control ((GCRYCTL_INITIALIZATION_FINISHED, 0)); + + start_timer (); + check_x448 (); + stop_timer (); + + info ("All tests completed in %s. Errors: %d\n", + elapsed_time (1), error_count); + return !!error_count; +} -- 2.20.1 From wk at gnupg.org Thu Sep 26 10:02:20 2019 From: wk at gnupg.org (Werner Koch) Date: Thu, 26 Sep 2019 10:02:20 +0200 Subject: [PATCH] ecc: Add Curve448. In-Reply-To: <87sgoky6cb.fsf@jumper.gniibe.org> (NIIBE Yutaka's message of "Wed, 25 Sep 2019 20:15:48 +0900") References: <87sgoky6cb.fsf@jumper.gniibe.org> Message-ID: <87muerwkmr.fsf@wheatstone.g10code.de> On Wed, 25 Sep 2019 20:15, gniibe at fsij.org said: > I'm going to push this change to master, so that we can use Curve448 > encryption by GnuPG in near future. :-) > - { "X448", "1.3.101.111" }, /* rfc8410 */ > #endif > > + { "Curve448", "1.3.101.111" }, /* X448 in rfc8410 */ Would you mind to briefly explain the name change? Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From gniibe at fsij.org Thu Sep 26 11:19:23 2019 From: gniibe at fsij.org (Niibe Yutaka) Date: Thu, 26 Sep 2019 18:19:23 +0900 Subject: [PATCH] ecc: Add Curve448. In-Reply-To: <87muerwkmr.fsf@wheatstone.g10code.de> References: <87sgoky6cb.fsf@jumper.gniibe.org> <87muerwkmr.fsf@wheatstone.g10code.de> Message-ID: <87zhirs9d0.fsf@jumper.gniibe.org> Werner Koch wrote: >> - { "X448", "1.3.101.111" }, /* rfc8410 */ >> #endif >> >> + { "Curve448", "1.3.101.111" }, /* X448 in rfc8410 */ > > Would you mind to briefly explain the name change? Well, this is important point. In this patch, I kind of abuse an OID which is defined in RFC8410 for X448 algorithm. Perhaps, ideally for OpenPGP, we would need an independent OID for the curve. For other curves (than Curve448), an OID represents a specific curve. In RFC8410, (IIUC), an OID is assigned to an algorithm (signature algorithm, or key-agreement algorithm). It doesn not represent a curve. That's my understanding. Please correct if I am wrong. The name change is needed, I suppose, because key for ECDH and ECDH packet in OpenPGP will be in big-endian, if we follow the existing way with ECC curves (NIST ones, Brainpool ones, and Curve25519). Note that for Curve25519, it is not in native format of X25519. So, I intend that it will be same, that is, it will not be in native format of X448. (How come? That's because when ECDH with Curve25519 was implemented for OpenPGP, there was no X25519 defined.) It is also possible to introduce new thing for X448. We can define X448 key and ECDH with X448 in OpenPGP, in the native format of X448. I can change the patch accordingly, if the latter is preferred. I'm open to any changes. -- From dkg at fifthhorseman.net Mon Sep 30 19:26:36 2019 From: dkg at fifthhorseman.net (Daniel Kahn Gillmor) Date: Mon, 30 Sep 2019 19:26:36 +0200 Subject: [PATCH] ecc: Add Curve448. In-Reply-To: <87zhirs9d0.fsf@jumper.gniibe.org> References: <87sgoky6cb.fsf@jumper.gniibe.org> <87muerwkmr.fsf@wheatstone.g10code.de> <87zhirs9d0.fsf@jumper.gniibe.org> Message-ID: <87eezxzodv.fsf@fifthhorseman.net> On Thu 2019-09-26 18:19:23 +0900, Niibe Yutaka wrote: > In RFC8410, (IIUC), an OID is assigned to an algorithm (signature > algorithm, or key-agreement algorithm). It doesn not represent > a curve. That's my understanding. Please correct if I am wrong. I believe this is correct. > The name change is needed, I suppose, because key for ECDH and ECDH > packet in OpenPGP will be in big-endian, if we follow the existing way > with ECC curves (NIST ones, Brainpool ones, and Curve25519). Note that > for Curve25519, it is not in native format of X25519. So, I intend that > it will be same, that is, it will not be in native format of X448. I'm not convinced that this kind of consistency is the right way to go. In the world beyond gcrypt, points on goldilocks (curve 448) and curve 25519 have canonical a bytestring representation used by every other implementation. Requiring consistency *within* gcrypt between 25519 and 448 means requiring inconsistency between gcrypt and the larger world for 448. I'd rather see gcrypt use the standard representations for as many points as possible. > It is also possible to introduce new thing for X448. We can define X448 > key and ECDH with X448 in OpenPGP, in the native format of X448. > > I can change the patch accordingly, if the latter is preferred. I lean toward using the "native" format where possible, even if that leaves 25519 as an outlier. --dkg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: