From jussi.kivilinna at iki.fi Mon Aug 12 21:40:49 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 12 Aug 2019 22:40:49 +0300 Subject: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang In-Reply-To: <20190812171448.GA10039@archlinux-threadripper> References: <20190812033120.43013-1-natechancellor@gmail.com> <20190812033120.43013-4-natechancellor@gmail.com> <20190812171448.GA10039@archlinux-threadripper> Message-ID: <1ba05172-500b-6b42-00ad-27fb33eff070@iki.fi> Hello, On 12.8.2019 20.14, Nathan Chancellor wrote: > On Mon, Aug 12, 2019 at 10:35:53AM +0300, Jussi Kivilinna wrote: >> Hello, >> >> On 12.8.2019 6.31, Nathan Chancellor wrote: >>> From: Vladimir Serbinenko >>> >>> clang doesn't recognise =l / =h assembly operand specifiers but apparently >>> handles C version well. >>> >>> lib/mpi/generic_mpih-mul1.c:37:24: error: invalid use of a cast in a >>> inline asm context requiring an l-value: remove the cast or build with >>> -fheinous-gnu-extensions >>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); >>> ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> lib/mpi/longlong.h:652:20: note: expanded from macro 'umul_ppmm' >>> : "=l" ((USItype)(w0)), \ >>> ~~~~~~~~~~^~~ >>> lib/mpi/generic_mpih-mul1.c:37:3: error: invalid output constraint '=h' >>> in asm >>> umul_ppmm(prod_high, prod_low, s1_ptr[j], s2_limb); >>> ^ >>> lib/mpi/longlong.h:653:7: note: expanded from macro 'umul_ppmm' >>> "=h" ((USItype)(w1)) \ >>> ^ >>> 2 errors generated. >>> >>> Fixes: 5ce3e312ec5c ("crypto: GnuPG based MPI lib - header files (part 2)") >>> Link: https://github.com/ClangBuiltLinux/linux/issues/605 >>> Link: https://github.com/gpg/libgcrypt/commit/1ecbd0bca31d462719a2a6590c1d03244e76ef89 >>> Signed-off-by: Vladimir Serbinenko >>> [jk: add changelog, rebase on libgcrypt repository, reformat changed >>> line so it does not go over 80 characters] >>> Signed-off-by: Jussi Kivilinna >> >> This is my signed-off-by for libgcrypt project, not kernel. I do not think >> signed-offs can be passed from other projects in this way. >> >> -Jussi > > Hi Jussi, > > I am no signoff expert but if I am reading the developer certificate of > origin in the libgcrypt repo correctly [1], your signoff on this commit > falls under: > > (d) I understand and agree that this project and the contribution > are public and that a record of the contribution (including all > personal information I submit with it, including my sign-off) is > maintained indefinitely and may be redistributed consistent with > this project or the open source license(s) involved. There is nothing wrong with the commit in libgcrypt repo and/or my libgcrypt-DCO-sign-off. > > This file is maintained under the LGPL because it was taken straight > from the libgcrypr repo and per (b), I can submit this commit here > with everything intact. But you do not have my kernel-DCO-sign-off for this patch. I have not been involved with this kernel patch in anyway, have not integrated it to kernel, not testing it on kernel.. I do not own it. However, with this signed-off-by line you have involved me to kernel patch process in which for this patch I'm not interested. So to be clear, I retract my kernel-DCO-signed-off for this kernel patch: NOT-Signed-off-by: Jussi Kivilinna Of course you can copy the original libgcrypt commit message to this patch, but I think it needs to be clearly quoted so that my libgcrypt-DCO-signed-off line wont be mixed with kernel-DOC-signed-off lines. > > However, I don't want to upset you in any way though so if you are not > comfortable with that, I suppose I can remove it as if Vladimir > submitted this fix to me directly (as I got permission for his signoff). > I need to resubmit this fix to an appropriate maintainer so let me know > what you think. That's quite complicated approach. Fast and easier process would be if you just own the patch yourself. Libgcrypt (and target file in libgcrypt) is LGPL v2.1+, so the license is compatible with kernel and you are good to go with just your own (kernel DCO) signed-off-by. -Jussi > > [1]: https://github.com/gpg/libgcrypt/blob/3bb858551cd5d84e43b800edfa2b07d1529718a9/doc/DCO > > Cheers, > Nathan > From natechancellor at gmail.com Mon Aug 12 21:45:04 2019 From: natechancellor at gmail.com (Nathan Chancellor) Date: Mon, 12 Aug 2019 12:45:04 -0700 Subject: [PATCH 3/5] lib/mpi: Fix for building for MIPS32 with Clang In-Reply-To: <1ba05172-500b-6b42-00ad-27fb33eff070@iki.fi> References: <20190812033120.43013-1-natechancellor@gmail.com> <20190812033120.43013-4-natechancellor@gmail.com> <20190812171448.GA10039@archlinux-threadripper> <1ba05172-500b-6b42-00ad-27fb33eff070@iki.fi> Message-ID: <20190812194504.GA121197@archlinux-threadripper> On Mon, Aug 12, 2019 at 10:40:49PM +0300, Jussi Kivilinna wrote: > That's quite complicated approach. Fast and easier process would be if you > just own the patch yourself. Libgcrypt (and target file in libgcrypt) > is LGPL v2.1+, so the license is compatible with kernel and you are good > to go with just your own (kernel DCO) signed-off-by. > > -Jussi I have gone this route as another developer pointed out that we can eliminate all of the inline asm umul_ppmm definitions because the kernel requires GCC 4.6 and newer and that is completely different from the libgcrypt patches. https://lore.kernel.org/lkml/20190812193256.55103-1-natechancellor at gmail.com/ Thanks for weighing in and cheers! Nathan From jussi.kivilinna at iki.fi Tue Aug 20 19:55:42 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 20 Aug 2019 20:55:42 +0300 Subject: [PATCH 1/5] PowerPC optimized routines for AES and SHA2 using PowerISA In-Reply-To: <45739881563406877@sas2-b4ed770db137.qloud-c.yandex.net> References: <20190709145813.24903-1-shawn@git.icu> <878st6hqdf.fsf@wheatstone.g10code.de> <45739881563406877@sas2-b4ed770db137.qloud-c.yandex.net> Message-ID: Hello, I've start implementing PowerPC vector crypto AES implementation without cryptogams on top of your patch "re-implement single-block mode, and implement OCB block cipher". I expect AES work to be completed by the end of this month, SHA2 in September and GHASH in Sep~Oct. -Jussi On 18.7.2019 2.41, Shawn Landden wrote: > > > 09.07.2019, 16:05, "Werner Koch" : >> Hi! >> >> On Tue, 9 Jul 2019 09:58, shawn at git.icu said: >> >>> ?From CRYPTOGAMS https://www.openssl.org/~appro/cryptogams/ >> >> I had a quick look at the license and I can't see that this license >> allows the inclusuon int Libcgrypt which is available under the GNU >> LGPL. >> >> ??Copyright (c) 2006-2017, CRYPTOGAMS by >> ??All rights reserved. >> >> The copyright line does not seem to identify a holder of the copyright. >> Weel, unless "CRYPTOGAMS by " is a legal entity. >> >> ??Redistribution and use in source and binary forms, with or without >> ??modification, are permitted provided that the following conditions are >> ??met: >> >> ??* Redistributions of source code must retain copyright notices, this >> ????list of conditions and the following disclaimer. >> >> ??* Redistributions in binary form must reproduce the above copyright >> ????notice, this list of conditions and the following disclaimer in the >> ????documentation and/or other materials provided with the distribution. >> >> ??* Neither the name of the CRYPTOGAMS nor the names of its copyright >> ????holder and contributors may be used to endorse or promote products >> ????derived from this software without specific prior written >> ????permission. >> >> This looks like a standard BSD licese but I didn't checked it. > No, this does not require attribution. It is not the dreaded old OpenSSL license. You are looking right at it, > and it does not have the line you are imagining, which looks like this: > > 3. All advertising materials mentioning features or use of this software > must display the following acknowledgement: > This product includes software developed by the Reagents of The University of California, Berkeley. >> >> ??ALTERNATIVELY, provided that this notice is retained in full, this >> ??product may be distributed under the terms of the GNU General Public >> ??License (GPL), in which case the provisions of the GPL apply INSTEAD >> ??OF those given above. >> >> Problems with this: It allows only GNU GPL, does not provide a version >> of it, and seems to be invalidating itself because it adds a further >> restriction to the GPL, namely that "this notice is retained in full". >> Anyway GPL is too restrictive for Libgcrypt. >> >> Whether we can exceptionally add BSD code needs to be discussed but has >> the very practical problem that all users of Libcgrypt need to update all >> their documentation to include the required statements and copyright >> notices for the BSD license. >> >> I am sorry for these bad news and I hope a solution can be found. >> Either by removing all OpenSSL code or by asking the original author to >> change to a better usable and more standard license. >> >> Shalom-Salam, >> >> ???Werner >> >> -- >> Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > From jussi.kivilinna at iki.fi Fri Aug 23 18:52:00 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:00 +0300 Subject: [PATCH 1/6] hwf: add detection of PowerPC hardware features Message-ID: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> From: Shawn Landden * src/Makefile.am: PowerPC hardware detection. * src/g10lib.h: Likewise. * src/hwf-common.h: Likewise. * src/hwf-ppc.c: Likewise. * src/hwfeatures.c: Likewise. * configure.ac: Likewise. -- [jk: split PowerPC HW features to separate patch, from https://lists.gnupg.org/pipermail/gcrypt-devel/2019-July/004769.html] [jk: disable __builtin_cpu_supports usage for now] Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/configure.ac b/configure.ac index e8c8cd39c..6980f381a 100644 --- a/configure.ac +++ b/configure.ac @@ -669,6 +669,14 @@ AC_ARG_ENABLE(arm-crypto-support, armcryptosupport=$enableval,armcryptosupport=yes) AC_MSG_RESULT($armcryptosupport) +# Implementation of the --disable-ppc-crypto-support switch. +AC_MSG_CHECKING([whether PPC crypto support is requested]) +AC_ARG_ENABLE(ppc-crypto-support, + AC_HELP_STRING([--disable-ppc-crypto-support], + [Disable support for the PPC crypto instructions introduced in POWER 8 (PowerISA 2.07)]), + ppccryptosupport=$enableval,ppccryptosupport=yes) +AC_MSG_RESULT($ppccryptosupport) + # Implementation of the --disable-O-flag-munging switch. AC_MSG_CHECKING([whether a -O flag munging is requested]) AC_ARG_ENABLE([O-flag-munging], @@ -1267,6 +1275,9 @@ if test "$mpi_cpu_arch" != "arm" ; then fi fi +if test "$mpi_cpu_arch" != "ppc"; then + ppccryptosupport="n/a" +fi ############################################# #### #### @@ -2107,6 +2118,10 @@ if test x"$armcryptosupport" = xyes ; then AC_DEFINE(ENABLE_ARM_CRYPTO_SUPPORT,1, [Enable support for ARMv8 Crypto Extension instructions.]) fi +if test x"$ppccryptosupport" = xyes ; then + AC_DEFINE(ENABLE_PPC_CRYPTO_SUPPORT,1, + [Enable support for POWER 8 (PowerISA 2.07) crypto extension.]) +fi if test x"$jentsupport" = xyes ; then AC_DEFINE(ENABLE_JENT_SUPPORT, 1, [Enable support for the jitter entropy collector.]) @@ -2687,6 +2702,7 @@ case "$mpi_cpu_arch" in ;; ppc) AC_DEFINE(HAVE_CPU_ARCH_PPC, 1, [Defined for PPC platforms]) + GCRYPT_HWF_MODULES="libgcrypt_la-hwf-ppc.lo" ;; arm) AC_DEFINE(HAVE_CPU_ARCH_ARM, 1, [Defined for ARM platforms]) @@ -2788,6 +2804,7 @@ GCRY_MSG_SHOW([Try using Intel AVX: ],[$avxsupport]) GCRY_MSG_SHOW([Try using Intel AVX2: ],[$avx2support]) GCRY_MSG_SHOW([Try using ARM NEON: ],[$neonsupport]) GCRY_MSG_SHOW([Try using ARMv8 crypto: ],[$armcryptosupport]) +GCRY_MSG_SHOW([Try using PPC crypto: ],[$ppccryptosupport]) GCRY_MSG_SHOW([],[]) if test "x${gpg_config_script_warn}" != x; then diff --git a/src/Makefile.am b/src/Makefile.am index 82d6e8a0e..5d347a2a7 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -66,7 +66,7 @@ libgcrypt_la_SOURCES = \ hmac256.c hmac256.h context.c context.h \ ec-context.h -EXTRA_libgcrypt_la_SOURCES = hwf-x86.c hwf-arm.c +EXTRA_libgcrypt_la_SOURCES = hwf-x86.c hwf-arm.c hwf-ppc.c gcrypt_hwf_modules = @GCRYPT_HWF_MODULES@ diff --git a/src/g10lib.h b/src/g10lib.h index 694c2d83e..41e18c137 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -236,7 +236,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_ARM_SHA2 (1 << 20) #define HWF_ARM_PMULL (1 << 21) - +#define HWF_PPC_VCRYPTO (1 << 22) gpg_err_code_t _gcry_disable_hw_feature (const char *name); void _gcry_detect_hw_features (void); diff --git a/src/hwf-common.h b/src/hwf-common.h index 8f156b564..76f346e94 100644 --- a/src/hwf-common.h +++ b/src/hwf-common.h @@ -22,6 +22,6 @@ unsigned int _gcry_hwf_detect_x86 (void); unsigned int _gcry_hwf_detect_arm (void); - +unsigned int _gcry_hwf_detect_ppc (void); #endif /*HWF_COMMON_H*/ diff --git a/src/hwf-ppc.c b/src/hwf-ppc.c new file mode 100644 index 000000000..1bf2edf70 --- /dev/null +++ b/src/hwf-ppc.c @@ -0,0 +1,234 @@ +/* hwf-ppc.c - Detect hardware features - PPC part + * Copyright (C) 2013,2019 Jussi Kivilinna + * Copyright (C) 2019 Shawn Landden + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include +#include +#include +#include +#include +#include +#include +#if defined(HAVE_SYS_AUXV_H) && (defined(HAVE_GETAUXVAL) || \ + defined(HAVE_ELF_AUX_INFO)) +#include +#endif + +#include "g10lib.h" +#include "hwf-common.h" + +#if !defined (__powerpc__) && !defined (__powerpc64__) +# error Module build for wrong CPU. +#endif + + +#if defined(HAVE_SYS_AUXV_H) && defined(HAVE_ELF_AUX_INFO) && \ + !defined(HAVE_GETAUXVAL) && defined(AT_HWCAP) +#define HAVE_GETAUXVAL +static unsigned long getauxval(unsigned long type) +{ + unsigned long auxval = 0; + int err; + + /* FreeBSD provides 'elf_aux_info' function that does the same as + * 'getauxval' on Linux. */ + + err = elf_aux_info (type, &auxval, sizeof(auxval)); + if (err) + { + errno = err; + auxval = 0; + } + + return auxval; +} +#endif + + +#undef HAS_SYS_AT_HWCAP +#if defined(__linux__) || \ + (defined(HAVE_SYS_AUXV_H) && defined(HAVE_GETAUXVAL)) +#define HAS_SYS_AT_HWCAP 1 + +struct feature_map_s + { + unsigned int hwcap_flag; + unsigned int hwcap2_flag; + const char *feature_match; + unsigned int hwf_flag; + }; + +#if defined(__powerpc__) || defined(__powerpc64__) + +/* Note: These macros have same values on Linux and FreeBSD. */ +#ifndef AT_HWCAP +# define AT_HWCAP 16 +#endif +#ifndef AT_HWCAP2 +# define AT_HWCAP2 26 +#endif + +#ifndef PPC_FEATURE2_VEC_CRYPTO +# define PPC_FEATURE2_VEC_CRYPTO 0x02000000 +#endif + +static const struct feature_map_s ppc_features[] = + { +#ifdef ENABLE_PPC_CRYPTO_SUPPORT + { 0, PPC_FEATURE2_VEC_CRYPTO, " crypto", HWF_PPC_VCRYPTO }, +#endif + }; +#endif + +static int +get_hwcap(unsigned int *hwcap, unsigned int *hwcap2) +{ + struct { unsigned long a_type; unsigned long a_val; } auxv; + FILE *f; + int err = -1; + static int hwcap_initialized = 0; + static unsigned int stored_hwcap = 0; + static unsigned int stored_hwcap2 = 0; + + if (hwcap_initialized) + { + *hwcap = stored_hwcap; + *hwcap2 = stored_hwcap2; + return 0; + } + +#if 0 // TODO: configure.ac detection for __builtin_cpu_supports +#if defined(__GLIBC__) && defined(__GNUC__) +#if __GNUC__ >= 6 + /* Returns 0 if glibc support doesn't exist, so we can + * only trust positive results. This function will need updating + * if we ever need more than one cpu feature. + */ + // TODO: fix, false if ENABLE_PPC_CRYPTO_SUPPORT + if (sizeof(ppc_features)/sizeof(ppc_features[0]) == 0) { + if (__builtin_cpu_supports("vcrypto")) { + stored_hwcap = 0; + stored_hwcap2 = PPC_FEATURE2_VEC_CRYPTO; + hwcap_initialized = 1; + return 0; + } + } +#endif +#endif +#endif + +#if defined(HAVE_SYS_AUXV_H) && defined(HAVE_GETAUXVAL) + errno = 0; + auxv.a_val = getauxval (AT_HWCAP); + if (errno == 0) + { + stored_hwcap |= auxv.a_val; + hwcap_initialized = 1; + } + + if (AT_HWCAP2 >= 0) + { + errno = 0; + auxv.a_val = getauxval (AT_HWCAP2); + if (errno == 0) + { + stored_hwcap2 |= auxv.a_val; + hwcap_initialized = 1; + } + } + + if (hwcap_initialized && (stored_hwcap || stored_hwcap2)) + { + *hwcap = stored_hwcap; + *hwcap2 = stored_hwcap2; + return 0; + } +#endif + + f = fopen("/proc/self/auxv", "r"); + if (!f) + { + *hwcap = stored_hwcap; + *hwcap2 = stored_hwcap2; + return -1; + } + + while (fread(&auxv, sizeof(auxv), 1, f) > 0) + { + if (auxv.a_type == AT_HWCAP) + { + stored_hwcap |= auxv.a_val; + hwcap_initialized = 1; + } + + if (auxv.a_type == AT_HWCAP2) + { + stored_hwcap2 |= auxv.a_val; + hwcap_initialized = 1; + } + } + + if (hwcap_initialized) + err = 0; + + fclose(f); + *hwcap = stored_hwcap; + *hwcap2 = stored_hwcap2; + return err; +} + +static unsigned int +detect_ppc_at_hwcap(void) +{ + unsigned int hwcap; + unsigned int hwcap2; + unsigned int features = 0; + unsigned int i; + + if (get_hwcap(&hwcap, &hwcap2) < 0) + return features; + + for (i = 0; i < DIM(ppc_features); i++) + { + if (hwcap & ppc_features[i].hwcap_flag) + features |= ppc_features[i].hwf_flag; + + if (hwcap2 & ppc_features[i].hwcap2_flag) + features |= ppc_features[i].hwf_flag; + } + + return features; +} + +#endif + +unsigned int +_gcry_hwf_detect_ppc (void) +{ + unsigned int ret = 0; + unsigned int broken_hwfs = 0; + +#if defined (HAS_SYS_AT_HWCAP) + ret |= detect_ppc_at_hwcap (); +#endif + + ret &= ~broken_hwfs; + + return ret; +} diff --git a/src/hwfeatures.c b/src/hwfeatures.c index e08166945..fe5137538 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -42,6 +42,7 @@ static struct const char *desc; } hwflist[] = { +#if defined(HAVE_CPU_ARCH_X86) { HWF_PADLOCK_RNG, "padlock-rng" }, { HWF_PADLOCK_AES, "padlock-aes" }, { HWF_PADLOCK_SHA, "padlock-sha" }, @@ -59,11 +60,15 @@ static struct { HWF_INTEL_FAST_VPGATHER, "intel-fast-vpgather" }, { HWF_INTEL_RDTSC, "intel-rdtsc" }, { HWF_INTEL_SHAEXT, "intel-shaext" }, +#elif defined(HAVE_CPU_ARCH_ARM) { HWF_ARM_NEON, "arm-neon" }, { HWF_ARM_AES, "arm-aes" }, { HWF_ARM_SHA1, "arm-sha1" }, { HWF_ARM_SHA2, "arm-sha2" }, - { HWF_ARM_PMULL, "arm-pmull" } + { HWF_ARM_PMULL, "arm-pmull" }, +#elif defined(HAVE_CPU_ARCH_PPC) + { HWF_PPC_VCRYPTO, "ppc-crypto" }, +#endif }; /* A bit vector with the hardware features which shall not be used. @@ -208,12 +213,14 @@ _gcry_detect_hw_features (void) { hw_features = _gcry_hwf_detect_x86 (); } -#endif /* HAVE_CPU_ARCH_X86 */ -#if defined (HAVE_CPU_ARCH_ARM) +#elif defined (HAVE_CPU_ARCH_ARM) { hw_features = _gcry_hwf_detect_arm (); } -#endif /* HAVE_CPU_ARCH_ARM */ - +#elif defined (HAVE_CPU_ARCH_PPC) + { + hw_features = _gcry_hwf_detect_ppc (); + } +#endif hw_features &= ~disabled_hw_features; } From jussi.kivilinna at iki.fi Fri Aug 23 18:52:10 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:10 +0300 Subject: [PATCH 3/6] rijndael-ppc: add key setup and enable single block PowerPC AES In-Reply-To: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> References: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> Message-ID: <156657913032.2143.7127137457575511285.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'rijndael-ppc.c'. * cipher/rijndael-internal.h (USE_PPC_CRYPTO): New. (RIJNDAEL_context): Add 'use_ppc_crypto'. * cipher/rijndael-ppc.c (backwards, swap_if_le): Remove. (u128_t, ALWAYS_INLINE, NO_INLINE, NO_INSTRUMENT_FUNCTION) (ASM_FUNC_ATTR, ASM_FUNC_ATTR_INLINE, ASM_FUNC_ATTR_NOINLINE) (ALIGNED_LOAD, ALIGNED_STORE, VEC_LOAD_BE, VEC_STORE_BE) (vec_bswap32_const, vec_aligned_ld, vec_load_be_const) (vec_load_be, vec_aligned_st, vec_store_be, _gcry_aes_sbox4_ppc8) (_gcry_aes_ppc8_setkey, _gcry_aes_ppc8_prepare_decryption) (aes_ppc8_encrypt_altivec, aes_ppc8_decrypt_altivec): New. (_gcry_aes_ppc8_encrypt, _gcry_aes_ppc8_decrypt): Rewrite. (_gcry_aes_ppc8_ocb_crypt): Comment out. * cipher/rijndael.c [USE_PPC_CRYPTO] (_gcry_aes_ppc8_setkey) (_gcry_aes_ppc8_prepare_decryption, _gcry_aes_ppc8_encrypt) (_gcry_aes_ppc8_decrypt): New prototypes. (do_setkey) [USE_PPC_CRYPTO]: Add setup for PowerPC AES. (prepare_decryption) [USE_PPC_CRYPTO]: Ditto. * configure.ac: Add 'rijndael-ppc.lo'. (gcry_cv_ppc_altivec, gcry_cv_cc_ppc_altivec_cflags) (gcry_cv_gcc_inline_asm_ppc_altivec) (gcry_cv_gcc_inline_asm_ppc_arch_3_00): New checks. -- Benchmark on POWER8 ~3.8Ghz: Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 7.27 ns/B 131.2 MiB/s 27.61 c/B ECB dec | 7.70 ns/B 123.8 MiB/s 29.28 c/B CBC enc | 6.38 ns/B 149.5 MiB/s 24.24 c/B CBC dec | 6.17 ns/B 154.5 MiB/s 23.45 c/B CFB enc | 6.45 ns/B 147.9 MiB/s 24.51 c/B CFB dec | 6.20 ns/B 153.8 MiB/s 23.57 c/B OFB enc | 7.36 ns/B 129.6 MiB/s 27.96 c/B OFB dec | 7.36 ns/B 129.6 MiB/s 27.96 c/B CTR enc | 6.22 ns/B 153.2 MiB/s 23.65 c/B CTR dec | 6.22 ns/B 153.3 MiB/s 23.65 c/B XTS enc | 6.67 ns/B 142.9 MiB/s 25.36 c/B XTS dec | 6.70 ns/B 142.3 MiB/s 25.46 c/B CCM enc | 12.61 ns/B 75.60 MiB/s 47.93 c/B CCM dec | 12.62 ns/B 75.56 MiB/s 47.96 c/B CCM auth | 6.41 ns/B 148.8 MiB/s 24.36 c/B EAX enc | 12.62 ns/B 75.55 MiB/s 47.96 c/B EAX dec | 12.62 ns/B 75.55 MiB/s 47.97 c/B EAX auth | 6.39 ns/B 149.2 MiB/s 24.30 c/B GCM enc | 9.81 ns/B 97.24 MiB/s 37.27 c/B GCM dec | 9.81 ns/B 97.20 MiB/s 37.28 c/B GCM auth | 3.59 ns/B 265.8 MiB/s 13.63 c/B OCB enc | 6.39 ns/B 149.3 MiB/s 24.27 c/B OCB dec | 6.38 ns/B 149.5 MiB/s 24.25 c/B OCB auth | 6.35 ns/B 150.2 MiB/s 24.13 c/B After: ECB enc | 1.29 ns/B 737.7 MiB/s 4.91 c/B ECB dec | 1.34 ns/B 711.1 MiB/s 5.10 c/B CBC enc | 2.13 ns/B 448.5 MiB/s 8.08 c/B CBC dec | 1.05 ns/B 908.0 MiB/s 3.99 c/B CFB enc | 2.17 ns/B 439.9 MiB/s 8.24 c/B CFB dec | 2.22 ns/B 429.8 MiB/s 8.43 c/B OFB enc | 1.49 ns/B 640.1 MiB/s 5.66 c/B OFB dec | 1.49 ns/B 640.1 MiB/s 5.66 c/B CTR enc | 2.21 ns/B 432.5 MiB/s 8.38 c/B CTR dec | 2.20 ns/B 432.5 MiB/s 8.38 c/B XTS enc | 2.32 ns/B 410.6 MiB/s 8.83 c/B XTS dec | 2.33 ns/B 409.7 MiB/s 8.85 c/B CCM enc | 4.36 ns/B 218.7 MiB/s 16.57 c/B CCM dec | 4.36 ns/B 218.8 MiB/s 16.56 c/B CCM auth | 2.17 ns/B 440.4 MiB/s 8.23 c/B EAX enc | 4.37 ns/B 218.3 MiB/s 16.60 c/B EAX dec | 4.36 ns/B 218.7 MiB/s 16.57 c/B EAX auth | 2.16 ns/B 440.7 MiB/s 8.22 c/B GCM enc | 5.78 ns/B 165.0 MiB/s 21.96 c/B GCM dec | 5.78 ns/B 165.0 MiB/s 21.96 c/B GCM auth | 3.59 ns/B 265.9 MiB/s 13.63 c/B OCB enc | 2.33 ns/B 410.1 MiB/s 8.84 c/B OCB dec | 2.34 ns/B 407.2 MiB/s 8.90 c/B OCB auth | 2.32 ns/B 411.1 MiB/s 8.82 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 2aae82e27..1f2d8ec97 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -96,6 +96,7 @@ EXTRA_libcipher_la_SOURCES = \ rijndael-ssse3-amd64.c rijndael-ssse3-amd64-asm.S \ rijndael-armv8-ce.c rijndael-armv8-aarch32-ce.S \ rijndael-armv8-aarch64-ce.S rijndael-aarch64.S \ + rijndael-ppc.c \ rmd160.c \ rsa.c \ salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \ @@ -197,3 +198,15 @@ crc-intel-pclmul.o: $(srcdir)/crc-intel-pclmul.c Makefile crc-intel-pclmul.lo: $(srcdir)/crc-intel-pclmul.c Makefile `echo $(LTCOMPILE) -c $< | $(instrumentation_munging) ` + +if ENABLE_PPC_VCRYPTO_EXTRA_CFLAGS +ppc_vcrypto_cflags = -maltivec -mvsx -mcrypto +else +ppc_vcrypto_cflags = +endif + +rijndael-ppc.o: $(srcdir)/rijndael-ppc.c Makefile + `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +rijndael-ppc.lo: $(srcdir)/rijndael-ppc.c Makefile + `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h index 78b08e8f8..5150a69d7 100644 --- a/cipher/rijndael-internal.h +++ b/cipher/rijndael-internal.h @@ -75,7 +75,7 @@ # define USE_PADLOCK 1 # endif # endif -#endif /*ENABLE_PADLOCK_SUPPORT*/ +#endif /* ENABLE_PADLOCK_SUPPORT */ /* USE_AESNI inidicates whether to compile with Intel AES-NI code. We need the vector-size attribute which seems to be available since @@ -104,6 +104,18 @@ # endif #endif /* ENABLE_ARM_CRYPTO_SUPPORT */ +/* USE_PPC_CRYPTO indicates whether to enable PowerPC vector crypto + * accelerated code. */ +#undef USE_PPC_CRYPTO +#ifdef ENABLE_PPC_CRYPTO_SUPPORT +# if defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) +# if __GNUC__ >= 4 +# define USE_PPC_CRYPTO 1 +# endif +# endif +#endif /* ENABLE_PPC_CRYPTO_SUPPORT */ + struct RIJNDAEL_context_s; typedef unsigned int (*rijndael_cryptfn_t)(const struct RIJNDAEL_context_s *ctx, @@ -154,6 +166,9 @@ typedef struct RIJNDAEL_context_s #ifdef USE_ARM_CE unsigned int use_arm_ce:1; /* ARMv8 CE shall be used. */ #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + unsigned int use_ppc_crypto:1; /* PowerPC crypto shall be used. */ +#endif /*USE_PPC_CRYPTO*/ rijndael_cryptfn_t encrypt_fn; rijndael_cryptfn_t decrypt_fn; rijndael_prefetchfn_t prefetch_enc_fn; diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c index 2e5dd2f89..a7c47a876 100644 --- a/cipher/rijndael-ppc.c +++ b/cipher/rijndael-ppc.c @@ -1,5 +1,6 @@ -/* Rijndael (AES) for GnuPG - PowerPC Vector Crypto AES +/* Rijndael (AES) for GnuPG - PowerPC Vector Crypto AES implementation * Copyright (C) 2019 Shawn Landden + * Copyright (C) 2019 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -24,138 +25,397 @@ #include -/* PPC AES extensions */ -#include #include "rijndael-internal.h" #include "cipher-internal.h" +#include "bufhelp.h" + +#ifdef USE_PPC_CRYPTO + +#include + typedef vector unsigned char block; -static const vector unsigned char backwards = - { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; - -#ifdef __LITTLE_ENDIAN__ -#define swap_if_le(a) \ - vec_perm(a, a, backwards) -#elif __BIG_ENDIAN__ -#define swap_if_le(a) (a) + +typedef union +{ + u32 data32[4]; +} __attribute__((packed, aligned(1), may_alias)) u128_t; + + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + + +#define ALIGNED_LOAD(in_ptr) \ + (vec_aligned_ld (0, (const unsigned char *)(in_ptr))) + +#define ALIGNED_STORE(out_ptr, vec) \ + (vec_aligned_st ((vec), 0, (unsigned char *)(out_ptr))) + +#define VEC_LOAD_BE(in_ptr, bige_const) \ + (vec_load_be (0, (const unsigned char *)(in_ptr), bige_const)) + +#define VEC_STORE_BE(out_ptr, vec, bige_const) \ + (vec_store_be ((vec), 0, (unsigned char *)(out_ptr), bige_const)) + + +static const block vec_bswap32_const = + { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }; + + +static ASM_FUNC_ATTR_INLINE block +vec_aligned_ld(unsigned long offset, const unsigned char *ptr) +{ +#ifndef WORDS_BIGENDIAN + block vec; + __asm__ ("lvx %0,%1,%2\n\t" + : "=v" (vec) + : "r" (offset), "r" ((uintptr_t)ptr) + : "memory"); + return vec; #else -#error "What endianness?" + return vec_vsx_ld (offset, ptr); #endif +} -/* Passes in AltiVec registers (big-endian) - * sadly compilers don't know how to unroll outer loops into - * inner loops with more registers on static functions, - * so that this can be properly optimized for OOO multi-issue - * without having to hand-unroll. - */ -static block _gcry_aes_ppc8_encrypt_altivec (const RIJNDAEL_context *ctx, - block a) + +static ASM_FUNC_ATTR_INLINE block +vec_load_be_const(void) +{ +#ifndef WORDS_BIGENDIAN + return ~ALIGNED_LOAD(&vec_bswap32_const); +#else + static const block vec_dummy = { 0 }; + return vec_dummy; +#endif +} + + +static ASM_FUNC_ATTR_INLINE block +vec_load_be(unsigned long offset, const unsigned char *ptr, + block be_bswap_const) +{ +#ifndef WORDS_BIGENDIAN + block vec; + /* GCC vec_vsx_ld is generating two instructions on little-endian. Use + * lxvw4x directly instead. */ + __asm__ ("lxvw4x %x0,%1,%2\n\t" + : "=wa" (vec) + : "r" (offset), "r" ((uintptr_t)ptr) + : "memory"); + __asm__ ("vperm %0,%1,%1,%2\n\t" + : "=v" (vec) + : "v" (vec), "v" (be_bswap_const)); + return vec; +#else + (void)be_bswap_const; + return vec_vsx_ld (offset, ptr); +#endif +} + + +static ASM_FUNC_ATTR_INLINE void +vec_aligned_st(block vec, unsigned long offset, unsigned char *ptr) +{ +#ifndef WORDS_BIGENDIAN + __asm__ ("stvx %0,%1,%2\n\t" + : + : "v" (vec), "r" (offset), "r" ((uintptr_t)ptr) + : "memory"); +#else + vec_vsx_st (vec, offset, ptr); +#endif +} + + +static ASM_FUNC_ATTR_INLINE void +vec_store_be(block vec, unsigned long offset, unsigned char *ptr, + block be_bswap_const) +{ +#ifndef WORDS_BIGENDIAN + /* GCC vec_vsx_st is generating two instructions on little-endian. Use + * stxvw4x directly instead. */ + __asm__ ("vperm %0,%1,%1,%2\n\t" + : "=v" (vec) + : "v" (vec), "v" (be_bswap_const)); + __asm__ ("stxvw4x %x0,%1,%2\n\t" + : + : "wa" (vec), "r" (offset), "r" ((uintptr_t)ptr) + : "memory"); +#else + (void)be_bswap_const; + vec_vsx_st (vec, offset, ptr); +#endif +} + + +static ASM_FUNC_ATTR_INLINE u32 +_gcry_aes_sbox4_ppc8(u32 fourbytes) +{ + union + { + PROPERLY_ALIGNED_TYPE dummy; + block data_vec; + u32 data32[4]; + } u; + + u.data32[0] = fourbytes; + u.data_vec = vec_sbox_be(u.data_vec); + return u.data32[0]; +} + +void +_gcry_aes_ppc8_setkey (RIJNDAEL_context *ctx, const byte *key) +{ + const block bige_const = vec_load_be_const(); + union + { + PROPERLY_ALIGNED_TYPE dummy; + byte data[MAXKC][4]; + u32 data32[MAXKC]; + } tkk[2]; + unsigned int rounds = ctx->rounds; + int KC = rounds - 6; + unsigned int keylen = KC * 4; + u128_t *ekey = (u128_t *)(void *)ctx->keyschenc; + unsigned int i, r, t; + byte rcon = 1; + int j; +#define k tkk[0].data +#define k_u32 tkk[0].data32 +#define tk tkk[1].data +#define tk_u32 tkk[1].data32 +#define W (ctx->keyschenc) +#define W_u32 (ctx->keyschenc32) + + for (i = 0; i < keylen; i++) + { + k[i >> 2][i & 3] = key[i]; + } + + for (j = KC-1; j >= 0; j--) + { + tk_u32[j] = k_u32[j]; + } + r = 0; + t = 0; + /* Copy values into round key array. */ + for (j = 0; (j < KC) && (r < rounds + 1); ) + { + for (; (j < KC) && (t < 4); j++, t++) + { + W_u32[r][t] = le_bswap32(tk_u32[j]); + } + if (t == 4) + { + r++; + t = 0; + } + } + while (r < rounds + 1) + { + tk_u32[0] ^= + le_bswap32( + _gcry_aes_sbox4_ppc8(rol(le_bswap32(tk_u32[KC - 1]), 24)) ^ rcon); + + if (KC != 8) + { + for (j = 1; j < KC; j++) + { + tk_u32[j] ^= tk_u32[j-1]; + } + } + else + { + for (j = 1; j < KC/2; j++) + { + tk_u32[j] ^= tk_u32[j-1]; + } + + tk_u32[KC/2] ^= + le_bswap32(_gcry_aes_sbox4_ppc8(le_bswap32(tk_u32[KC/2 - 1]))); + + for (j = KC/2 + 1; j < KC; j++) + { + tk_u32[j] ^= tk_u32[j-1]; + } + } + + /* Copy values into round key array. */ + for (j = 0; (j < KC) && (r < rounds + 1); ) + { + for (; (j < KC) && (t < 4); j++, t++) + { + W_u32[r][t] = le_bswap32(tk_u32[j]); + } + if (t == 4) + { + r++; + t = 0; + } + } + + rcon = (rcon << 1) ^ ((rcon >> 7) * 0x1b); + } + + /* Store in big-endian order. */ + for (r = 0; r <= rounds; r++) + { +#ifndef WORDS_BIGENDIAN + VEC_STORE_BE(&ekey[r], ALIGNED_LOAD(&ekey[r]), bige_const); +#else + block rvec = ALIGNED_LOAD(&ekey[r]); + ALIGNED_STORE(&ekey[r], + vec_perm(rvec, rvec, vec_bswap32_const)); + (void)bige_const; +#endif + } + +#undef W +#undef tk +#undef k +#undef W_u32 +#undef tk_u32 +#undef k_u32 + wipememory(&tkk, sizeof(tkk)); +} + + +/* Make a decryption key from an encryption key. */ +void +_gcry_aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) { + u128_t *ekey = (u128_t *)(void *)ctx->keyschenc; + u128_t *dkey = (u128_t *)(void *)ctx->keyschdec; + int rounds = ctx->rounds; + int rr; int r; + + r = 0; + rr = rounds; + for (r = 0, rr = rounds; r <= rounds; r++, rr--) + { + ALIGNED_STORE(&dkey[r], ALIGNED_LOAD(&ekey[rr])); + } +} + + +static ASM_FUNC_ATTR_INLINE block +aes_ppc8_encrypt_altivec (const RIJNDAEL_context *ctx, block a) +{ + u128_t *rk = (u128_t *)ctx->keyschenc; int rounds = ctx->rounds; - block *rk = (block*)ctx->keyschenc; + int r; - a = rk[0] ^ a; - for (r = 1;r < rounds;r++) +#define DO_ROUND(r) (a = vec_cipher_be (a, ALIGNED_LOAD (&rk[r]))) + + a = ALIGNED_LOAD(&rk[0]) ^ a; + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + r = 10; + if (rounds >= 12) { - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (a) - :"v" (rk[r]) - ); + DO_ROUND(10); + DO_ROUND(11); + r = 12; + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + r = 14; + } } - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (a) - :"v" (rk[r]) - ); + a = vec_cipherlast_be(a, ALIGNED_LOAD(&rk[r])); + +#undef DO_ROUND + return a; } -static block _gcry_aes_ppc8_decrypt_altivec (const RIJNDAEL_context *ctx, - block a) +static ASM_FUNC_ATTR_INLINE block +aes_ppc8_decrypt_altivec (const RIJNDAEL_context *ctx, block a) { - int r; + u128_t *rk = (u128_t *)ctx->keyschdec; int rounds = ctx->rounds; - block *rk = (block*)ctx->keyschdec; + int r; - a = rk[0] ^ a; - for (r = 1;r < rounds;r++) +#define DO_ROUND(r) (a = vec_ncipher_be (a, ALIGNED_LOAD (&rk[r]))) + + a = ALIGNED_LOAD(&rk[0]) ^ a; + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + r = 10; + if (rounds >= 12) { - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (a) - :"v" (rk[r]) - ); + DO_ROUND(10); + DO_ROUND(11); + r = 12; + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + r = 14; + } } - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (a) - :"v" (rk[r]) - ); + a = vec_ncipherlast_be(a, ALIGNED_LOAD(&rk[r])); + +#undef DO_ROUND + return a; } + unsigned int _gcry_aes_ppc8_encrypt (const RIJNDAEL_context *ctx, unsigned char *b, const unsigned char *a) { - uintptr_t zero = 0; + const block bige_const = vec_load_be_const(); block sa; - if ((uintptr_t)a % 16 == 0) - { - sa = vec_ld (0, a); - } - else - { - block unalignedprev, unalignedcur; - unalignedprev = vec_ld (0, a); - unalignedcur = vec_ld (16, a); - sa = vec_perm (unalignedprev, unalignedcur, vec_lvsl(0, a)); - } - - sa = swap_if_le(sa); - sa = _gcry_aes_ppc8_encrypt_altivec(ctx, sa); - - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - : - : "wa" (sa), "r" (zero), "r" ((uintptr_t)b)); + sa = VEC_LOAD_BE (a, bige_const); + sa = aes_ppc8_encrypt_altivec (ctx, sa); + VEC_STORE_BE (b, sa, bige_const); return 0; /* does not use stack */ } + unsigned int _gcry_aes_ppc8_decrypt (const RIJNDAEL_context *ctx, unsigned char *b, const unsigned char *a) { - uintptr_t zero = 0; - block sa, unalignedprev, unalignedcur; - - if ((uintptr_t)a % 16 == 0) - { - sa = vec_ld(0, a); - } - else - { - unalignedprev = vec_ld (0, a); - unalignedcur = vec_ld (16, a); - sa = vec_perm (unalignedprev, unalignedcur, vec_lvsl(0, a)); - } + const block bige_const = vec_load_be_const(); + block sa; - sa = swap_if_le (sa); - sa = _gcry_aes_ppc8_decrypt_altivec (ctx, sa); + sa = VEC_LOAD_BE (a, bige_const); + sa = aes_ppc8_decrypt_altivec (ctx, sa); + VEC_STORE_BE (b, sa, bige_const); - if ((uintptr_t)b % 16 == 0) - { - vec_vsx_st(swap_if_le(sa), 0, b); - } - else - { - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - : - : "wa" (sa), "r" (zero), "r" ((uintptr_t)b)); - } return 0; /* does not use stack */ } + +#if 0 size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt) @@ -673,4 +933,6 @@ size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, } return 0; } +#endif +#endif /* USE_PPC_CRYPTO */ diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 2c9aa6733..8a27dfe0b 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -199,6 +199,19 @@ extern void _gcry_aes_armv8_ce_xts_crypt (void *context, unsigned char *tweak, size_t nblocks, int encrypt); #endif /*USE_ARM_ASM*/ +#ifdef USE_PPC_CRYPTO +/* PowerPC Crypto implementations of AES */ +extern void _gcry_aes_ppc8_setkey(RIJNDAEL_context *ctx, const byte *key); +extern void _gcry_aes_ppc8_prepare_decryption(RIJNDAEL_context *ctx); + +extern unsigned int _gcry_aes_ppc8_encrypt(const RIJNDAEL_context *ctx, + unsigned char *dst, + const unsigned char *src); +extern unsigned int _gcry_aes_ppc8_decrypt(const RIJNDAEL_context *ctx, + unsigned char *dst, + const unsigned char *src); +#endif /*USE_PPC_CRYPTO*/ + static unsigned int do_encrypt (const RIJNDAEL_context *ctx, unsigned char *bx, const unsigned char *ax); static unsigned int do_decrypt (const RIJNDAEL_context *ctx, unsigned char *bx, @@ -280,7 +293,7 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, int i,j, r, t, rconpointer = 0; int KC; #if defined(USE_AESNI) || defined(USE_PADLOCK) || defined(USE_SSSE3) \ - || defined(USE_ARM_CE) + || defined(USE_ARM_CE) || defined(USE_PPC_CRYPTO) unsigned int hwfeatures; #endif @@ -324,7 +337,7 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, ctx->rounds = rounds; #if defined(USE_AESNI) || defined(USE_PADLOCK) || defined(USE_SSSE3) \ - || defined(USE_ARM_CE) + || defined(USE_ARM_CE) || defined(USE_PPC_CRYPTO) hwfeatures = _gcry_get_hw_features (); #endif @@ -341,6 +354,9 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, #ifdef USE_ARM_CE ctx->use_arm_ce = 0; #endif +#ifdef USE_PPC_CRYPTO + ctx->use_ppc_crypto = 0; +#endif if (0) { @@ -420,6 +436,19 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, hd->bulk.xts_crypt = _gcry_aes_armv8_ce_xts_crypt; } } +#endif +#ifdef USE_PPC_CRYPTO + else if (hwfeatures & HWF_PPC_VCRYPTO) + { + ctx->encrypt_fn = _gcry_aes_ppc8_encrypt; + ctx->decrypt_fn = _gcry_aes_ppc8_decrypt; + ctx->prefetch_enc_fn = NULL; + ctx->prefetch_dec_fn = NULL; + ctx->use_ppc_crypto = 1; + if (hd) + { + } + } #endif else { @@ -446,6 +475,10 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, #ifdef USE_ARM_CE else if (ctx->use_arm_ce) _gcry_aes_armv8_ce_setkey (ctx, key); +#endif +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + _gcry_aes_ppc8_setkey (ctx, key); #endif else { @@ -584,7 +617,19 @@ prepare_decryption( RIJNDAEL_context *ctx ) { _gcry_aes_armv8_ce_prepare_decryption (ctx); } -#endif /*USE_SSSE3*/ +#endif /*USE_ARM_CE*/ +#ifdef USE_ARM_CE + else if (ctx->use_arm_ce) + { + _gcry_aes_armv8_ce_prepare_decryption (ctx); + } +#endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_prepare_decryption (ctx); + } +#endif #ifdef USE_PADLOCK else if (ctx->use_padlock) { diff --git a/configure.ac b/configure.ac index 6980f381a..586145aa4 100644 --- a/configure.ac +++ b/configure.ac @@ -1655,6 +1655,7 @@ if test "$gcry_cv_gcc_platform_as_ok_for_intel_syntax" = "yes" ; then [Defined if underlying assembler is compatible with Intel syntax assembly implementations]) fi + # # Check whether compiler is configured for ARMv6 or newer architecture # @@ -1831,6 +1832,112 @@ if test "$gcry_cv_gcc_inline_asm_aarch64_crypto" = "yes" ; then fi +# +# Check whether PowerPC AltiVec/VSX intrinsics +# +AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX intrinsics], + [gcry_cv_cc_ppc_altivec], + [if test "$mpi_cpu_arch" != "ppc" ; then + gcry_cv_cc_ppc_altivec="n/a" + else + gcry_cv_cc_ppc_altivec=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[#include + typedef vector unsigned char block; + block fn(block in) + { + block t = vec_perm (in, in, vec_vsx_ld (0, (unsigned char*)0)); + return vec_cipher_be (t, in); + } + ]])], + [gcry_cv_cc_ppc_altivec=yes]) + fi]) +if test "$gcry_cv_cc_ppc_altivec" = "yes" ; then + AC_DEFINE(HAVE_COMPATIBLE_CC_PPC_ALTIVEC,1, + [Defined if underlying compiler supports PowerPC AltiVec/VSX/crypto intrinsics]) +fi + +_gcc_cflags_save=$CFLAGS +CFLAGS="$CFLAGS -maltivec -mvsx -mcrypto" + +if test "$gcry_cv_cc_ppc_altivec" = "no" && + test "$mpi_cpu_arch" = "ppc" ; then + AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX/crypto intrinsics with extra GCC flags], + [gcry_cv_cc_ppc_altivec_cflags], + [gcry_cv_cc_ppc_altivec_cflags=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[#include + typedef vector unsigned char block; + block fn(block in) + { + block t = vec_perm (in, in, vec_vsx_ld (0, (unsigned char*)0)); + return vec_cipher_be (t, in); + }]])], + [gcry_cv_cc_ppc_altivec_cflags=yes])]) + if test "$gcry_cv_cc_ppc_altivec_cflags" = "yes" ; then + AC_DEFINE(HAVE_COMPATIBLE_CC_PPC_ALTIVEC,1, + [Defined if underlying compiler supports PowerPC AltiVec/VSX/crypto intrinsics]) + AC_DEFINE(HAVE_COMPATIBLE_CC_PPC_ALTIVEC_WITH_CFLAGS,1, + [Defined if underlying compiler supports PowerPC AltiVec/VSX/crypto intrinsics with extra GCC flags]) + fi +fi + +AM_CONDITIONAL(ENABLE_PPC_VCRYPTO_EXTRA_CFLAGS, + test "$gcry_cv_cc_ppc_altivec_cflags" = "yes") + +# Restore flags. +CFLAGS=$_gcc_cflags_save; + + +# +# Check whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto instructions +# +AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto instructions], + [gcry_cv_gcc_inline_asm_ppc_altivec], + [if test "$mpi_cpu_arch" != "ppc" ; then + gcry_cv_gcc_inline_asm_ppc_altivec="n/a" + else + gcry_cv_gcc_inline_asm_ppc_altivec=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[__asm__(".globl testfn;\n" + "testfn:\n" + "stvx %v31,%r12,%r0;\n" + "lvx %v20,%r12,%r0;\n" + "vcipher %v0, %v1, %v22;\n" + "lxvw4x %vs32, %r0, %r1;\n" + ); + ]])], + [gcry_cv_gcc_inline_asm_ppc_altivec=yes]) + fi]) +if test "$gcry_cv_gcc_inline_asm_ppc_altivec" = "yes" ; then + AC_DEFINE(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC,1, + [Defined if inline assembler supports PowerPC AltiVec/VSX/crypto instructions]) +fi + + +# +# Check whether GCC inline assembler supports PowerISA 3.00 instructions +# +AC_CACHE_CHECK([whether GCC inline assembler supports PowerISA 3.00 instructions], + [gcry_cv_gcc_inline_asm_ppc_arch_3_00], + [if test "$mpi_cpu_arch" != "ppc" ; then + gcry_cv_gcc_inline_asm_ppc_arch_3_00="n/a" + else + gcry_cv_gcc_inline_asm_ppc_arch_3_00=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[__asm__(".globl testfn;\n" + "testfn:\n" + "stxvb16x %r1,%v12,%v30;\n" + ); + ]])], + [gcry_cv_gcc_inline_asm_ppc_arch_3_00=yes]) + fi]) +if test "$gcry_cv_gcc_inline_asm_ppc_arch_3_00" = "yes" ; then + AC_DEFINE(HAVE_GCC_INLINE_ASM_PPC_ARCH_3_00,1, + [Defined if inline assembler supports PowerISA 3.00 instructions]) +fi + + ####################################### #### Checks for library functions. #### ####################################### @@ -2229,6 +2336,20 @@ if test "$found" = "1" ; then GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-ce.lo" GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-armv8-aarch64-ce.lo" ;; + powerpc64le-*-*) + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-ppc.lo" + ;; + powerpc64-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-ppc.lo" + ;; + powerpc-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS rijndael-ppc.lo" + ;; esac case "$mpi_cpu_arch" in From jussi.kivilinna at iki.fi Fri Aug 23 18:52:05 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:05 +0300 Subject: [PATCH 2/6] rijndael/ppc: implement single-block mode, and implement OCB block cipher In-Reply-To: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> References: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> Message-ID: <156657912516.2143.58570588389353535.stgit@localhost.localdomain> From: Shawn Landden * cipher/rijndael-ppc.c: New implementation of single-block mode, and implementation of OCB mode. -- [jk: split rijndael-ppc.c from patch 'rijndael/ppc: reimplement single-block mode, and implement OCB block cipher' for basis of new PowerPC vector crypto implementation of AES: https://lists.gnupg.org/pipermail/gcrypt-devel/2019-July/004765.html] [jk: coding-style fixes] Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c new file mode 100644 index 000000000..2e5dd2f89 --- /dev/null +++ b/cipher/rijndael-ppc.c @@ -0,0 +1,676 @@ +/* Rijndael (AES) for GnuPG - PowerPC Vector Crypto AES + * Copyright (C) 2019 Shawn Landden + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * + * Alternatively, this code may be used in OpenSSL from The OpenSSL Project, + * and Cryptogams by Andy Polyakov, and if made part of a release of either + * or both projects, is thereafter dual-licensed under the license said project + * is released under. + */ + +#include + +/* PPC AES extensions */ +#include +#include "rijndael-internal.h" +#include "cipher-internal.h" + +typedef vector unsigned char block; +static const vector unsigned char backwards = + { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; + +#ifdef __LITTLE_ENDIAN__ +#define swap_if_le(a) \ + vec_perm(a, a, backwards) +#elif __BIG_ENDIAN__ +#define swap_if_le(a) (a) +#else +#error "What endianness?" +#endif + +/* Passes in AltiVec registers (big-endian) + * sadly compilers don't know how to unroll outer loops into + * inner loops with more registers on static functions, + * so that this can be properly optimized for OOO multi-issue + * without having to hand-unroll. + */ +static block _gcry_aes_ppc8_encrypt_altivec (const RIJNDAEL_context *ctx, + block a) +{ + int r; + int rounds = ctx->rounds; + block *rk = (block*)ctx->keyschenc; + + a = rk[0] ^ a; + for (r = 1;r < rounds;r++) + { + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (a) + :"v" (rk[r]) + ); + } + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (a) + :"v" (rk[r]) + ); + return a; +} + + +static block _gcry_aes_ppc8_decrypt_altivec (const RIJNDAEL_context *ctx, + block a) +{ + int r; + int rounds = ctx->rounds; + block *rk = (block*)ctx->keyschdec; + + a = rk[0] ^ a; + for (r = 1;r < rounds;r++) + { + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (a) + :"v" (rk[r]) + ); + } + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (a) + :"v" (rk[r]) + ); + return a; +} + +unsigned int _gcry_aes_ppc8_encrypt (const RIJNDAEL_context *ctx, + unsigned char *b, + const unsigned char *a) +{ + uintptr_t zero = 0; + block sa; + + if ((uintptr_t)a % 16 == 0) + { + sa = vec_ld (0, a); + } + else + { + block unalignedprev, unalignedcur; + unalignedprev = vec_ld (0, a); + unalignedcur = vec_ld (16, a); + sa = vec_perm (unalignedprev, unalignedcur, vec_lvsl(0, a)); + } + + sa = swap_if_le(sa); + sa = _gcry_aes_ppc8_encrypt_altivec(ctx, sa); + + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + : + : "wa" (sa), "r" (zero), "r" ((uintptr_t)b)); + + return 0; /* does not use stack */ +} + +unsigned int _gcry_aes_ppc8_decrypt (const RIJNDAEL_context *ctx, + unsigned char *b, + const unsigned char *a) +{ + uintptr_t zero = 0; + block sa, unalignedprev, unalignedcur; + + if ((uintptr_t)a % 16 == 0) + { + sa = vec_ld(0, a); + } + else + { + unalignedprev = vec_ld (0, a); + unalignedcur = vec_ld (16, a); + sa = vec_perm (unalignedprev, unalignedcur, vec_lvsl(0, a)); + } + + sa = swap_if_le (sa); + sa = _gcry_aes_ppc8_decrypt_altivec (ctx, sa); + + if ((uintptr_t)b % 16 == 0) + { + vec_vsx_st(swap_if_le(sa), 0, b); + } + else + { + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + : + : "wa" (sa), "r" (zero), "r" ((uintptr_t)b)); + } + return 0; /* does not use stack */ +} + +size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, + int encrypt) +{ + RIJNDAEL_context *ctx = (void *)&c->context.c; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + block *in = (block*)inbuf; + block *out = (block*)outbuf; + uintptr_t zero = 0; + int r; + int rounds = ctx->rounds; + + if (encrypt) + { + const int unroll = 8; + block unalignedprev, ctr, iv; + + if (((uintptr_t)inbuf % 16) != 0) + { + unalignedprev = vec_ld(0, in++); + } + + iv = vec_ld (0, (block*)&c->u_iv.iv); + ctr = vec_ld (0, (block*)&c->u_ctr.ctr); + + for ( ;nblocks >= unroll; nblocks -= unroll) + { + u64 i = c->u_mode.ocb.data_nblocks + 1; + block l0, l1, l2, l3, l4, l5, l6, l7; + block b0, b1, b2, b3, b4, b5, b6, b7; + block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; + const block *rk = (block*)&ctx->keyschenc; + + c->u_mode.ocb.data_nblocks += unroll; + + iv0 = iv; + if ((uintptr_t)inbuf % 16 == 0) + { + b0 = vec_ld (0, in++); + b1 = vec_ld (0, in++); + b2 = vec_ld (0, in++); + b3 = vec_ld (0, in++); + b4 = vec_ld (0, in++); + b5 = vec_ld (0, in++); + b6 = vec_ld (0, in++); + b7 = vec_ld (0, in++); + } + else + { + block unaligned0, unaligned1, unaligned2, + unaligned3, unaligned4, unaligned5, unaligned6; + unaligned0 = vec_ld (0, in++); + unaligned1 = vec_ld (0, in++); + unaligned2 = vec_ld (0, in++); + unaligned3 = vec_ld (0, in++); + unaligned4 = vec_ld (0, in++); + unaligned5 = vec_ld (0, in++); + unaligned6 = vec_ld (0, in++); + b0 = vec_perm (unalignedprev, unaligned0, vec_lvsl (0, inbuf)); + unalignedprev = vec_ld (0, in++); + b1 = vec_perm(unaligned0, unaligned1, vec_lvsl (0, inbuf)); + b2 = vec_perm(unaligned1, unaligned2, vec_lvsl (0, inbuf)); + b3 = vec_perm(unaligned2, unaligned3, vec_lvsl (0, inbuf)); + b4 = vec_perm(unaligned3, unaligned4, vec_lvsl (0, inbuf)); + b5 = vec_perm(unaligned4, unaligned5, vec_lvsl (0, inbuf)); + b6 = vec_perm(unaligned5, unaligned6, vec_lvsl (0, inbuf)); + b7 = vec_perm(unaligned6, unalignedprev, vec_lvsl (0, inbuf)); + } + + l0 = *(block*)ocb_get_l (c, i++); + l1 = *(block*)ocb_get_l (c, i++); + l2 = *(block*)ocb_get_l (c, i++); + l3 = *(block*)ocb_get_l (c, i++); + l4 = *(block*)ocb_get_l (c, i++); + l5 = *(block*)ocb_get_l (c, i++); + l6 = *(block*)ocb_get_l (c, i++); + l7 = *(block*)ocb_get_l (c, i++); + + ctr ^= b0 ^ b1 ^ b2 ^ b3 ^ b4 ^ b5 ^ b6 ^ b7; + + iv0 ^= l0; + b0 ^= iv0; + iv1 = iv0 ^ l1; + b1 ^= iv1; + iv2 = iv1 ^ l2; + b2 ^= iv2; + iv3 = iv2 ^ l3; + b3 ^= iv3; + iv4 = iv3 ^ l4; + b4 ^= iv4; + iv5 = iv4 ^ l5; + b5 ^= iv5; + iv6 = iv5 ^ l6; + b6 ^= iv6; + iv7 = iv6 ^ l7; + b7 ^= iv7; + + b0 = swap_if_le (b0); + b1 = swap_if_le (b1); + b2 = swap_if_le (b2); + b3 = swap_if_le (b3); + b4 = swap_if_le (b4); + b5 = swap_if_le (b5); + b6 = swap_if_le (b6); + b7 = swap_if_le (b7); + + b0 ^= rk[0]; + b1 ^= rk[0]; + b2 ^= rk[0]; + b3 ^= rk[0]; + b4 ^= rk[0]; + b5 ^= rk[0]; + b6 ^= rk[0]; + b7 ^= rk[0]; + + for (r = 1;r < rounds;r++) + { + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b0) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b1) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b2) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b3) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b4) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b5) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b6) + :"v" (rk[r])); + __asm__ volatile ("vcipher %0, %0, %1\n\t" + :"+v" (b7) + :"v" (rk[r])); + } + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b0) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b1) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b2) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b3) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b4) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b5) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b6) + :"v" (rk[r])); + __asm__ volatile ("vcipherlast %0, %0, %1\n\t" + :"+v" (b7) + :"v" (rk[r])); + + iv = iv7; + + /* The unaligned store stxvb16x writes big-endian, + so in the unaligned case we swap the iv instead of the bytes */ + if ((uintptr_t)outbuf % 16 == 0) + { + vec_vsx_st (swap_if_le (b0) ^ iv0, 0, out++); + vec_vsx_st (swap_if_le (b1) ^ iv1, 0, out++); + vec_vsx_st (swap_if_le (b2) ^ iv2, 0, out++); + vec_vsx_st (swap_if_le (b3) ^ iv3, 0, out++); + vec_vsx_st (swap_if_le (b4) ^ iv4, 0, out++); + vec_vsx_st (swap_if_le (b5) ^ iv5, 0, out++); + vec_vsx_st (swap_if_le (b6) ^ iv6, 0, out++); + vec_vsx_st (swap_if_le (b7) ^ iv7, 0, out++); + } + else + { + b0 ^= swap_if_le (iv0); + b1 ^= swap_if_le (iv1); + b2 ^= swap_if_le (iv2); + b3 ^= swap_if_le (iv3); + b4 ^= swap_if_le (iv4); + b5 ^= swap_if_le (iv5); + b6 ^= swap_if_le (iv6); + b7 ^= swap_if_le (iv7); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b0), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b1), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b2), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b3), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b4), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b5), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b6), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b7), "r" (zero), "r" ((uintptr_t)(out++))); + } + } + + for ( ;nblocks; nblocks-- ) + { + block b; + u64 i = ++c->u_mode.ocb.data_nblocks; + const block l = *(block*)ocb_get_l (c, i); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + if ((uintptr_t)in % 16 == 0) + { + b = vec_ld (0, in++); + } + else + { + block unalignedprevprev; + unalignedprevprev = unalignedprev; + unalignedprev = vec_ld (0, in++); + b = vec_perm (unalignedprevprev, unalignedprev, vec_lvsl (0, inbuf)); + } + + /* Checksum_i = Checksum_{i-1} xor P_i */ + ctr ^= b; + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + b ^= iv; + b = swap_if_le (b); + b = _gcry_aes_ppc8_encrypt_altivec (ctx, b); + if ((uintptr_t)out % 16 == 0) + { + vec_vsx_st (swap_if_le (b) ^ iv, 0, out++); + } + else + { + b ^= swap_if_le (iv); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + : + : "wa" (b), "r" (zero), "r" ((uintptr_t)out++)); + } + } + + /* We want to store iv and ctr big-endian and the unaligned + store stxvb16x stores them little endian, so we have to swap them. */ + iv = swap_if_le (iv); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (iv), "r" (zero), "r" ((uintptr_t)&c->u_iv.iv)); + ctr = swap_if_le (ctr); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (ctr), "r" (zero), "r" ((uintptr_t)&c->u_ctr.ctr)); + } + else + { + const int unroll = 8; + block unalignedprev, ctr, iv; + if (((uintptr_t)inbuf % 16) != 0) + { + unalignedprev = vec_ld (0, in++); + } + + iv = vec_ld (0, (block*)&c->u_iv.iv); + ctr = vec_ld (0, (block*)&c->u_ctr.ctr); + + for ( ;nblocks >= unroll; nblocks -= unroll) + { + u64 i = c->u_mode.ocb.data_nblocks + 1; + block l0, l1, l2, l3, l4, l5, l6, l7; + block b0, b1, b2, b3, b4, b5, b6, b7; + block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; + const block *rk = (block*)&ctx->keyschdec; + + c->u_mode.ocb.data_nblocks += unroll; + + iv0 = iv; + if ((uintptr_t)inbuf % 16 == 0) + { + b0 = vec_ld (0, in++); + b1 = vec_ld (0, in++); + b2 = vec_ld (0, in++); + b3 = vec_ld (0, in++); + b4 = vec_ld (0, in++); + b5 = vec_ld (0, in++); + b6 = vec_ld (0, in++); + b7 = vec_ld (0, in++); + } + else + { + block unaligned0, unaligned1, unaligned2, + unaligned3, unaligned4, unaligned5, unaligned6; + unaligned0 = vec_ld (0, in++); + unaligned1 = vec_ld (0, in++); + unaligned2 = vec_ld (0, in++); + unaligned3 = vec_ld (0, in++); + unaligned4 = vec_ld (0, in++); + unaligned5 = vec_ld (0, in++); + unaligned6 = vec_ld (0, in++); + b0 = vec_perm (unalignedprev, unaligned0, vec_lvsl (0, inbuf)); + unalignedprev = vec_ld (0, in++); + b1 = vec_perm (unaligned0, unaligned1, vec_lvsl (0, inbuf)); + b2 = vec_perm (unaligned1, unaligned2, vec_lvsl (0, inbuf)); + b3 = vec_perm (unaligned2, unaligned3, vec_lvsl (0, inbuf)); + b4 = vec_perm (unaligned3, unaligned4, vec_lvsl (0, inbuf)); + b5 = vec_perm (unaligned4, unaligned5, vec_lvsl (0, inbuf)); + b6 = vec_perm (unaligned5, unaligned6, vec_lvsl (0, inbuf)); + b7 = vec_perm (unaligned6, unalignedprev, vec_lvsl (0, inbuf)); + } + + l0 = *(block*)ocb_get_l (c, i++); + l1 = *(block*)ocb_get_l (c, i++); + l2 = *(block*)ocb_get_l (c, i++); + l3 = *(block*)ocb_get_l (c, i++); + l4 = *(block*)ocb_get_l (c, i++); + l5 = *(block*)ocb_get_l (c, i++); + l6 = *(block*)ocb_get_l (c, i++); + l7 = *(block*)ocb_get_l (c, i++); + + iv0 ^= l0; + b0 ^= iv0; + iv1 = iv0 ^ l1; + b1 ^= iv1; + iv2 = iv1 ^ l2; + b2 ^= iv2; + iv3 = iv2 ^ l3; + b3 ^= iv3; + iv4 = iv3 ^ l4; + b4 ^= iv4; + iv5 = iv4 ^ l5; + b5 ^= iv5; + iv6 = iv5 ^ l6; + b6 ^= iv6; + iv7 = iv6 ^ l7; + b7 ^= iv7; + + b0 = swap_if_le (b0); + b1 = swap_if_le (b1); + b2 = swap_if_le (b2); + b3 = swap_if_le (b3); + b4 = swap_if_le (b4); + b5 = swap_if_le (b5); + b6 = swap_if_le (b6); + b7 = swap_if_le (b7); + + b0 ^= rk[0]; + b1 ^= rk[0]; + b2 ^= rk[0]; + b3 ^= rk[0]; + b4 ^= rk[0]; + b5 ^= rk[0]; + b6 ^= rk[0]; + b7 ^= rk[0]; + + for (r = 1;r < rounds;r++) + { + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b0) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b1) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b2) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b3) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b4) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b5) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b6) + :"v" (rk[r])); + __asm__ volatile ("vncipher %0, %0, %1\n\t" + :"+v" (b7) + :"v" (rk[r])); + } + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b0) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b1) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b2) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b3) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b4) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b5) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b6) + :"v" (rk[r])); + __asm__ volatile ("vncipherlast %0, %0, %1\n\t" + :"+v" (b7) + :"v" (rk[r])); + + iv = iv7; + + b0 = swap_if_le (b0) ^ iv0; + b1 = swap_if_le (b1) ^ iv1; + b2 = swap_if_le (b2) ^ iv2; + b3 = swap_if_le (b3) ^ iv3; + b4 = swap_if_le (b4) ^ iv4; + b5 = swap_if_le (b5) ^ iv5; + b6 = swap_if_le (b6) ^ iv6; + b7 = swap_if_le (b7) ^ iv7; + + ctr ^= b0 ^ b1 ^ b2 ^ b3 ^ b4 ^ b5 ^ b6 ^ b7; + + /* The unaligned store stxvb16x writes big-endian */ + if ((uintptr_t)outbuf % 16 == 0) + { + vec_vsx_st (b0, 0, out++); + vec_vsx_st (b1, 0, out++); + vec_vsx_st (b2, 0, out++); + vec_vsx_st (b3, 0, out++); + vec_vsx_st (b4, 0, out++); + vec_vsx_st (b5, 0, out++); + vec_vsx_st (b6, 0, out++); + vec_vsx_st (b7, 0, out++); + } + else + { + b0 = swap_if_le (b0); + b1 = swap_if_le (b1); + b2 = swap_if_le (b2); + b3 = swap_if_le (b3); + b4 = swap_if_le (b4); + b5 = swap_if_le (b5); + b6 = swap_if_le (b6); + b7 = swap_if_le (b7); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b0), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b1), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b2), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b3), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b4), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b5), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b6), "r" (zero), "r" ((uintptr_t)(out++))); + __asm__ ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (b7), "r" (zero), "r" ((uintptr_t)(out++))); + } + } + + for ( ;nblocks; nblocks-- ) + { + block b; + u64 i = ++c->u_mode.ocb.data_nblocks; + const block l = *(block*)ocb_get_l (c, i); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + if ((uintptr_t)in % 16 == 0) + { + b = vec_ld (0, in++); + } + else + { + block unalignedprevprev; + unalignedprevprev = unalignedprev; + unalignedprev = vec_ld (0, in++); + b = vec_perm (unalignedprevprev, unalignedprev, vec_lvsl (0, inbuf)); + } + + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + b ^= iv; + b = swap_if_le (b); + b = _gcry_aes_ppc8_decrypt_altivec (ctx, b); + b = swap_if_le (b) ^ iv; + ctr ^= b; + if ((uintptr_t)out % 16 == 0) + { + vec_vsx_st (b, 0, out++); + } + else + { + b = swap_if_le (b); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + : + : "wa" (b), "r" (zero), "r" ((uintptr_t)out++)); + } + } + + /* We want to store iv and ctr big-endian and the unaligned + store stxvb16x stores them little endian, so we have to swap them. */ + iv = swap_if_le (iv); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (iv), "r" (zero), "r" ((uintptr_t)&c->u_iv.iv)); + ctr = swap_if_le(ctr); + __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" + :: "wa" (ctr), "r" (zero), "r" ((uintptr_t)&c->u_ctr.ctr)); + } + return 0; +} + From jussi.kivilinna at iki.fi Fri Aug 23 18:52:15 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:15 +0300 Subject: [PATCH 4/6] rijndael-ppc: enable PowerPC AES-OCB implemention In-Reply-To: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> References: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> Message-ID: <156657913549.2143.1806037346649065696.stgit@localhost.localdomain> * cipher/rijndael-ppc.c (ROUND_KEY_VARIABLES, PRELOAD_ROUND_KEYS) (AES_ENCRYPT, AES_DECRYPT): New. (_gcry_aes_ppc8_prepare_decryption): Rename to... (aes_ppc8_prepare_decryption): ... this. (_gcry_aes_ppc8_prepare_decryption): New. (aes_ppc8_encrypt_altivec, aes_ppc8_decrypt_altivec): Remove. (_gcry_aes_ppc8_encrypt): Use AES_ENCRYPT macro. (_gcry_aes_ppc8_decrypt): Use AES_DECRYPT macro. (_gcry_aes_ppc8_ocb_crypt): Uncomment; Optimizations for OCB offset calculations, etc; Use new load/store and encryption/decryption macros. * cipher/rijndaelc [USE_PPC_CRYPTO] (_gcry_aes_ppc8_ocb_crypt): New prototype. (do_setkey, _gcry_aes_ocb_crypt) [USE_PPC_CRYPTO]: Add PowerPC AES OCB encryption/decryption. -- Benchmark on POWER8 ~3.8Ghz: Before: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 2.33 ns/B 410.1 MiB/s 8.84 c/B OCB dec | 2.34 ns/B 407.2 MiB/s 8.90 c/B OCB auth | 2.32 ns/B 411.1 MiB/s 8.82 c/B After: OCB enc | 0.250 ns/B 3818 MiB/s 0.949 c/B OCB dec | 0.250 ns/B 3820 MiB/s 0.949 c/B OCB auth | 2.31 ns/B 412.5 MiB/s 8.79 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c index a7c47a876..01ff6f503 100644 --- a/cipher/rijndael-ppc.c +++ b/cipher/rijndael-ppc.c @@ -64,6 +64,82 @@ typedef union (vec_store_be ((vec), 0, (unsigned char *)(out_ptr), bige_const)) +#define ROUND_KEY_VARIABLES \ + block rkey0, rkeylast + +#define PRELOAD_ROUND_KEYS(rk_ptr, nrounds) \ + do { \ + rkey0 = ALIGNED_LOAD(&rk_ptr[0]); \ + if (nrounds >= 12) \ + { \ + if (rounds > 12) \ + { \ + rkeylast = ALIGNED_LOAD(&rk_ptr[14]); \ + } \ + else \ + { \ + rkeylast = ALIGNED_LOAD(&rk_ptr[12]); \ + } \ + } \ + else \ + { \ + rkeylast = ALIGNED_LOAD(&rk_ptr[10]); \ + } \ + } while (0) + + +#define AES_ENCRYPT(blk, nrounds) \ + do { \ + blk ^= rkey0; \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[1])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[2])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[3])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[4])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[5])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[6])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[7])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[8])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[9])); \ + if (nrounds >= 12) \ + { \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[10])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[11])); \ + if (rounds > 12) \ + { \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[12])); \ + blk = vec_cipher_be (blk, ALIGNED_LOAD(&rk[13])); \ + } \ + } \ + blk = vec_cipherlast_be (blk, rkeylast); \ + } while (0) + + +#define AES_DECRYPT(blk, nrounds) \ + do { \ + blk ^= rkey0; \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[1])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[2])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[3])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[4])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[5])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[6])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[7])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[8])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[9])); \ + if (nrounds >= 12) \ + { \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[10])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[11])); \ + if (rounds > 12) \ + { \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[12])); \ + blk = vec_ncipher_be (blk, ALIGNED_LOAD(&rk[13])); \ + } \ + } \ + blk = vec_ncipherlast_be (blk, rkeylast); \ + } while (0) + + static const block vec_bswap32_const = { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }; @@ -287,8 +363,8 @@ _gcry_aes_ppc8_setkey (RIJNDAEL_context *ctx, const byte *key) /* Make a decryption key from an encryption key. */ -void -_gcry_aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) +static ASM_FUNC_ATTR_INLINE void +aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) { u128_t *ekey = (u128_t *)(void *)ctx->keyschenc; u128_t *dkey = (u128_t *)(void *)ctx->keyschdec; @@ -305,634 +381,505 @@ _gcry_aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) } -static ASM_FUNC_ATTR_INLINE block -aes_ppc8_encrypt_altivec (const RIJNDAEL_context *ctx, block a) +void +_gcry_aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) { - u128_t *rk = (u128_t *)ctx->keyschenc; - int rounds = ctx->rounds; - int r; - -#define DO_ROUND(r) (a = vec_cipher_be (a, ALIGNED_LOAD (&rk[r]))) - - a = ALIGNED_LOAD(&rk[0]) ^ a; - DO_ROUND(1); - DO_ROUND(2); - DO_ROUND(3); - DO_ROUND(4); - DO_ROUND(5); - DO_ROUND(6); - DO_ROUND(7); - DO_ROUND(8); - DO_ROUND(9); - r = 10; - if (rounds >= 12) - { - DO_ROUND(10); - DO_ROUND(11); - r = 12; - if (rounds > 12) - { - DO_ROUND(12); - DO_ROUND(13); - r = 14; - } - } - a = vec_cipherlast_be(a, ALIGNED_LOAD(&rk[r])); - -#undef DO_ROUND - - return a; + aes_ppc8_prepare_decryption (ctx); } -static ASM_FUNC_ATTR_INLINE block -aes_ppc8_decrypt_altivec (const RIJNDAEL_context *ctx, block a) +unsigned int _gcry_aes_ppc8_encrypt (const RIJNDAEL_context *ctx, + unsigned char *out, + const unsigned char *in) { - u128_t *rk = (u128_t *)ctx->keyschdec; + const block bige_const = vec_load_be_const(); + const u128_t *rk = (u128_t *)&ctx->keyschenc; int rounds = ctx->rounds; - int r; - -#define DO_ROUND(r) (a = vec_ncipher_be (a, ALIGNED_LOAD (&rk[r]))) - - a = ALIGNED_LOAD(&rk[0]) ^ a; - DO_ROUND(1); - DO_ROUND(2); - DO_ROUND(3); - DO_ROUND(4); - DO_ROUND(5); - DO_ROUND(6); - DO_ROUND(7); - DO_ROUND(8); - DO_ROUND(9); - r = 10; - if (rounds >= 12) - { - DO_ROUND(10); - DO_ROUND(11); - r = 12; - if (rounds > 12) - { - DO_ROUND(12); - DO_ROUND(13); - r = 14; - } - } - a = vec_ncipherlast_be(a, ALIGNED_LOAD(&rk[r])); + ROUND_KEY_VARIABLES; + block b; -#undef DO_ROUND + b = VEC_LOAD_BE (in, bige_const); - return a; -} + PRELOAD_ROUND_KEYS (rk, rounds); - -unsigned int _gcry_aes_ppc8_encrypt (const RIJNDAEL_context *ctx, - unsigned char *b, - const unsigned char *a) -{ - const block bige_const = vec_load_be_const(); - block sa; - - sa = VEC_LOAD_BE (a, bige_const); - sa = aes_ppc8_encrypt_altivec (ctx, sa); - VEC_STORE_BE (b, sa, bige_const); + AES_ENCRYPT (b, rounds); + VEC_STORE_BE (out, b, bige_const); return 0; /* does not use stack */ } unsigned int _gcry_aes_ppc8_decrypt (const RIJNDAEL_context *ctx, - unsigned char *b, - const unsigned char *a) + unsigned char *out, + const unsigned char *in) { const block bige_const = vec_load_be_const(); - block sa; + const u128_t *rk = (u128_t *)&ctx->keyschdec; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block b; + + b = VEC_LOAD_BE (in, bige_const); - sa = VEC_LOAD_BE (a, bige_const); - sa = aes_ppc8_decrypt_altivec (ctx, sa); - VEC_STORE_BE (b, sa, bige_const); + PRELOAD_ROUND_KEYS (rk, rounds); + + AES_DECRYPT (b, rounds); + VEC_STORE_BE (out, b, bige_const); return 0; /* does not use stack */ } -#if 0 size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, - const void *inbuf_arg, size_t nblocks, - int encrypt) + const void *inbuf_arg, size_t nblocks, + int encrypt) { + const block bige_const = vec_load_be_const(); RIJNDAEL_context *ctx = (void *)&c->context.c; - unsigned char *outbuf = outbuf_arg; - const unsigned char *inbuf = inbuf_arg; - block *in = (block*)inbuf; - block *out = (block*)outbuf; - uintptr_t zero = 0; - int r; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; int rounds = ctx->rounds; + u64 data_nblocks = c->u_mode.ocb.data_nblocks; + block l0, l1, l2, l; + block b0, b1, b2, b3, b4, b5, b6, b7, b; + block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; + block rkey; + block ctr, iv; + ROUND_KEY_VARIABLES; + + iv = VEC_LOAD_BE (c->u_iv.iv, bige_const); + ctr = VEC_LOAD_BE (c->u_ctr.ctr, bige_const); + + l0 = VEC_LOAD_BE (c->u_mode.ocb.L[0], bige_const); + l1 = VEC_LOAD_BE (c->u_mode.ocb.L[1], bige_const); + l2 = VEC_LOAD_BE (c->u_mode.ocb.L[2], bige_const); if (encrypt) { - const int unroll = 8; - block unalignedprev, ctr, iv; + const u128_t *rk = (u128_t *)&ctx->keyschenc; - if (((uintptr_t)inbuf % 16) != 0) + PRELOAD_ROUND_KEYS (rk, rounds); + + for (; nblocks >= 8 && data_nblocks % 8; nblocks--) { - unalignedprev = vec_ld(0, in++); - } + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (in, bige_const); - iv = vec_ld (0, (block*)&c->u_iv.iv); - ctr = vec_ld (0, (block*)&c->u_ctr.ctr); + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + /* Checksum_i = Checksum_{i-1} xor P_i */ + ctr ^= b; + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + b ^= iv; + AES_ENCRYPT (b, rounds); + b ^= iv; - for ( ;nblocks >= unroll; nblocks -= unroll) - { - u64 i = c->u_mode.ocb.data_nblocks + 1; - block l0, l1, l2, l3, l4, l5, l6, l7; - block b0, b1, b2, b3, b4, b5, b6, b7; - block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; - const block *rk = (block*)&ctx->keyschenc; + VEC_STORE_BE (out, b, bige_const); - c->u_mode.ocb.data_nblocks += unroll; + in += 1; + out += 1; + } - iv0 = iv; - if ((uintptr_t)inbuf % 16 == 0) - { - b0 = vec_ld (0, in++); - b1 = vec_ld (0, in++); - b2 = vec_ld (0, in++); - b3 = vec_ld (0, in++); - b4 = vec_ld (0, in++); - b5 = vec_ld (0, in++); - b6 = vec_ld (0, in++); - b7 = vec_ld (0, in++); - } - else - { - block unaligned0, unaligned1, unaligned2, - unaligned3, unaligned4, unaligned5, unaligned6; - unaligned0 = vec_ld (0, in++); - unaligned1 = vec_ld (0, in++); - unaligned2 = vec_ld (0, in++); - unaligned3 = vec_ld (0, in++); - unaligned4 = vec_ld (0, in++); - unaligned5 = vec_ld (0, in++); - unaligned6 = vec_ld (0, in++); - b0 = vec_perm (unalignedprev, unaligned0, vec_lvsl (0, inbuf)); - unalignedprev = vec_ld (0, in++); - b1 = vec_perm(unaligned0, unaligned1, vec_lvsl (0, inbuf)); - b2 = vec_perm(unaligned1, unaligned2, vec_lvsl (0, inbuf)); - b3 = vec_perm(unaligned2, unaligned3, vec_lvsl (0, inbuf)); - b4 = vec_perm(unaligned3, unaligned4, vec_lvsl (0, inbuf)); - b5 = vec_perm(unaligned4, unaligned5, vec_lvsl (0, inbuf)); - b6 = vec_perm(unaligned5, unaligned6, vec_lvsl (0, inbuf)); - b7 = vec_perm(unaligned6, unalignedprev, vec_lvsl (0, inbuf)); - } + for (; nblocks >= 8; nblocks -= 8) + { + b0 = VEC_LOAD_BE (in + 0, bige_const); + b1 = VEC_LOAD_BE (in + 1, bige_const); + b2 = VEC_LOAD_BE (in + 2, bige_const); + b3 = VEC_LOAD_BE (in + 3, bige_const); + b4 = VEC_LOAD_BE (in + 4, bige_const); + b5 = VEC_LOAD_BE (in + 5, bige_const); + b6 = VEC_LOAD_BE (in + 6, bige_const); + b7 = VEC_LOAD_BE (in + 7, bige_const); - l0 = *(block*)ocb_get_l (c, i++); - l1 = *(block*)ocb_get_l (c, i++); - l2 = *(block*)ocb_get_l (c, i++); - l3 = *(block*)ocb_get_l (c, i++); - l4 = *(block*)ocb_get_l (c, i++); - l5 = *(block*)ocb_get_l (c, i++); - l6 = *(block*)ocb_get_l (c, i++); - l7 = *(block*)ocb_get_l (c, i++); + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 8), bige_const); ctr ^= b0 ^ b1 ^ b2 ^ b3 ^ b4 ^ b5 ^ b6 ^ b7; - iv0 ^= l0; + iv ^= rkey0; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l2; + iv4 = iv ^ l1 ^ l2 ^ l0; + iv5 = iv ^ l2 ^ l0; + iv6 = iv ^ l2; + iv7 = iv ^ l2 ^ l; + b0 ^= iv0; - iv1 = iv0 ^ l1; b1 ^= iv1; - iv2 = iv1 ^ l2; b2 ^= iv2; - iv3 = iv2 ^ l3; b3 ^= iv3; - iv4 = iv3 ^ l4; b4 ^= iv4; - iv5 = iv4 ^ l5; b5 ^= iv5; - iv6 = iv5 ^ l6; b6 ^= iv6; - iv7 = iv6 ^ l7; b7 ^= iv7; - - b0 = swap_if_le (b0); - b1 = swap_if_le (b1); - b2 = swap_if_le (b2); - b3 = swap_if_le (b3); - b4 = swap_if_le (b4); - b5 = swap_if_le (b5); - b6 = swap_if_le (b6); - b7 = swap_if_le (b7); - - b0 ^= rk[0]; - b1 ^= rk[0]; - b2 ^= rk[0]; - b3 ^= rk[0]; - b4 ^= rk[0]; - b5 ^= rk[0]; - b6 ^= rk[0]; - b7 ^= rk[0]; - - for (r = 1;r < rounds;r++) - { - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b0) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b1) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b2) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b3) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b4) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b5) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b6) - :"v" (rk[r])); - __asm__ volatile ("vcipher %0, %0, %1\n\t" - :"+v" (b7) - :"v" (rk[r])); - } - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b0) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b1) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b2) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b3) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b4) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b5) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b6) - :"v" (rk[r])); - __asm__ volatile ("vcipherlast %0, %0, %1\n\t" - :"+v" (b7) - :"v" (rk[r])); - - iv = iv7; - - /* The unaligned store stxvb16x writes big-endian, - so in the unaligned case we swap the iv instead of the bytes */ - if ((uintptr_t)outbuf % 16 == 0) + iv = iv7 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); \ + b4 = vec_cipher_be (b4, rkey); \ + b5 = vec_cipher_be (b5, rkey); \ + b6 = vec_cipher_be (b6, rkey); \ + b7 = vec_cipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) { - vec_vsx_st (swap_if_le (b0) ^ iv0, 0, out++); - vec_vsx_st (swap_if_le (b1) ^ iv1, 0, out++); - vec_vsx_st (swap_if_le (b2) ^ iv2, 0, out++); - vec_vsx_st (swap_if_le (b3) ^ iv3, 0, out++); - vec_vsx_st (swap_if_le (b4) ^ iv4, 0, out++); - vec_vsx_st (swap_if_le (b5) ^ iv5, 0, out++); - vec_vsx_st (swap_if_le (b6) ^ iv6, 0, out++); - vec_vsx_st (swap_if_le (b7) ^ iv7, 0, out++); + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } } - else + +#undef DO_ROUND + + rkey = rkeylast ^ rkey0; + b0 = vec_cipherlast_be (b0, rkey ^ iv0); + b1 = vec_cipherlast_be (b1, rkey ^ iv1); + b2 = vec_cipherlast_be (b2, rkey ^ iv2); + b3 = vec_cipherlast_be (b3, rkey ^ iv3); + b4 = vec_cipherlast_be (b4, rkey ^ iv4); + b5 = vec_cipherlast_be (b5, rkey ^ iv5); + b6 = vec_cipherlast_be (b6, rkey ^ iv6); + b7 = vec_cipherlast_be (b7, rkey ^ iv7); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4 && (data_nblocks % 4) == 0) + { + b0 = VEC_LOAD_BE (in + 0, bige_const); + b1 = VEC_LOAD_BE (in + 1, bige_const); + b2 = VEC_LOAD_BE (in + 2, bige_const); + b3 = VEC_LOAD_BE (in + 3, bige_const); + + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 4), bige_const); + + ctr ^= b0 ^ b1 ^ b2 ^ b3; + + iv ^= rkey0; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l; + + b0 ^= iv0; + b1 ^= iv1; + b2 ^= iv2; + b3 ^= iv3; + iv = iv3 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) { - b0 ^= swap_if_le (iv0); - b1 ^= swap_if_le (iv1); - b2 ^= swap_if_le (iv2); - b3 ^= swap_if_le (iv3); - b4 ^= swap_if_le (iv4); - b5 ^= swap_if_le (iv5); - b6 ^= swap_if_le (iv6); - b7 ^= swap_if_le (iv7); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b0), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b1), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b2), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b3), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b4), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b5), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b6), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b7), "r" (zero), "r" ((uintptr_t)(out++))); + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } } + +#undef DO_ROUND + + rkey = rkeylast ^ rkey0; + b0 = vec_cipherlast_be (b0, rkey ^ iv0); + b1 = vec_cipherlast_be (b1, rkey ^ iv1); + b2 = vec_cipherlast_be (b2, rkey ^ iv2); + b3 = vec_cipherlast_be (b3, rkey ^ iv3); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + in += 4; + out += 4; + nblocks -= 4; } - for ( ;nblocks; nblocks-- ) + for (; nblocks; nblocks--) { - block b; - u64 i = ++c->u_mode.ocb.data_nblocks; - const block l = *(block*)ocb_get_l (c, i); + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (in, bige_const); /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ iv ^= l; - if ((uintptr_t)in % 16 == 0) - { - b = vec_ld (0, in++); - } - else - { - block unalignedprevprev; - unalignedprevprev = unalignedprev; - unalignedprev = vec_ld (0, in++); - b = vec_perm (unalignedprevprev, unalignedprev, vec_lvsl (0, inbuf)); - } - /* Checksum_i = Checksum_{i-1} xor P_i */ ctr ^= b; /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ b ^= iv; - b = swap_if_le (b); - b = _gcry_aes_ppc8_encrypt_altivec (ctx, b); - if ((uintptr_t)out % 16 == 0) - { - vec_vsx_st (swap_if_le (b) ^ iv, 0, out++); - } - else - { - b ^= swap_if_le (iv); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - : - : "wa" (b), "r" (zero), "r" ((uintptr_t)out++)); - } - } + AES_ENCRYPT (b, rounds); + b ^= iv; - /* We want to store iv and ctr big-endian and the unaligned - store stxvb16x stores them little endian, so we have to swap them. */ - iv = swap_if_le (iv); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (iv), "r" (zero), "r" ((uintptr_t)&c->u_iv.iv)); - ctr = swap_if_le (ctr); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (ctr), "r" (zero), "r" ((uintptr_t)&c->u_ctr.ctr)); + VEC_STORE_BE (out, b, bige_const); + + in += 1; + out += 1; + } } else { - const int unroll = 8; - block unalignedprev, ctr, iv; - if (((uintptr_t)inbuf % 16) != 0) + const u128_t *rk = (u128_t *)&ctx->keyschdec; + + if (!ctx->decryption_prepared) { - unalignedprev = vec_ld (0, in++); + aes_ppc8_prepare_decryption (ctx); + ctx->decryption_prepared = 1; } - iv = vec_ld (0, (block*)&c->u_iv.iv); - ctr = vec_ld (0, (block*)&c->u_ctr.ctr); + PRELOAD_ROUND_KEYS (rk, rounds); - for ( ;nblocks >= unroll; nblocks -= unroll) + for (; nblocks >= 8 && data_nblocks % 8; nblocks--) { - u64 i = c->u_mode.ocb.data_nblocks + 1; - block l0, l1, l2, l3, l4, l5, l6, l7; - block b0, b1, b2, b3, b4, b5, b6, b7; - block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; - const block *rk = (block*)&ctx->keyschdec; + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (in, bige_const); - c->u_mode.ocb.data_nblocks += unroll; + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + b ^= iv; + AES_DECRYPT (b, rounds); + b ^= iv; + /* Checksum_i = Checksum_{i-1} xor P_i */ + ctr ^= b; - iv0 = iv; - if ((uintptr_t)inbuf % 16 == 0) - { - b0 = vec_ld (0, in++); - b1 = vec_ld (0, in++); - b2 = vec_ld (0, in++); - b3 = vec_ld (0, in++); - b4 = vec_ld (0, in++); - b5 = vec_ld (0, in++); - b6 = vec_ld (0, in++); - b7 = vec_ld (0, in++); - } - else - { - block unaligned0, unaligned1, unaligned2, - unaligned3, unaligned4, unaligned5, unaligned6; - unaligned0 = vec_ld (0, in++); - unaligned1 = vec_ld (0, in++); - unaligned2 = vec_ld (0, in++); - unaligned3 = vec_ld (0, in++); - unaligned4 = vec_ld (0, in++); - unaligned5 = vec_ld (0, in++); - unaligned6 = vec_ld (0, in++); - b0 = vec_perm (unalignedprev, unaligned0, vec_lvsl (0, inbuf)); - unalignedprev = vec_ld (0, in++); - b1 = vec_perm (unaligned0, unaligned1, vec_lvsl (0, inbuf)); - b2 = vec_perm (unaligned1, unaligned2, vec_lvsl (0, inbuf)); - b3 = vec_perm (unaligned2, unaligned3, vec_lvsl (0, inbuf)); - b4 = vec_perm (unaligned3, unaligned4, vec_lvsl (0, inbuf)); - b5 = vec_perm (unaligned4, unaligned5, vec_lvsl (0, inbuf)); - b6 = vec_perm (unaligned5, unaligned6, vec_lvsl (0, inbuf)); - b7 = vec_perm (unaligned6, unalignedprev, vec_lvsl (0, inbuf)); - } + VEC_STORE_BE (out, b, bige_const); - l0 = *(block*)ocb_get_l (c, i++); - l1 = *(block*)ocb_get_l (c, i++); - l2 = *(block*)ocb_get_l (c, i++); - l3 = *(block*)ocb_get_l (c, i++); - l4 = *(block*)ocb_get_l (c, i++); - l5 = *(block*)ocb_get_l (c, i++); - l6 = *(block*)ocb_get_l (c, i++); - l7 = *(block*)ocb_get_l (c, i++); + in += 1; + out += 1; + } + + for (; nblocks >= 8; nblocks -= 8) + { + b0 = VEC_LOAD_BE (in + 0, bige_const); + b1 = VEC_LOAD_BE (in + 1, bige_const); + b2 = VEC_LOAD_BE (in + 2, bige_const); + b3 = VEC_LOAD_BE (in + 3, bige_const); + b4 = VEC_LOAD_BE (in + 4, bige_const); + b5 = VEC_LOAD_BE (in + 5, bige_const); + b6 = VEC_LOAD_BE (in + 6, bige_const); + b7 = VEC_LOAD_BE (in + 7, bige_const); + + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 8), bige_const); + + iv ^= rkey0; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l2; + iv4 = iv ^ l1 ^ l2 ^ l0; + iv5 = iv ^ l2 ^ l0; + iv6 = iv ^ l2; + iv7 = iv ^ l2 ^ l; - iv0 ^= l0; b0 ^= iv0; - iv1 = iv0 ^ l1; b1 ^= iv1; - iv2 = iv1 ^ l2; b2 ^= iv2; - iv3 = iv2 ^ l3; b3 ^= iv3; - iv4 = iv3 ^ l4; b4 ^= iv4; - iv5 = iv4 ^ l5; b5 ^= iv5; - iv6 = iv5 ^ l6; b6 ^= iv6; - iv7 = iv6 ^ l7; b7 ^= iv7; - - b0 = swap_if_le (b0); - b1 = swap_if_le (b1); - b2 = swap_if_le (b2); - b3 = swap_if_le (b3); - b4 = swap_if_le (b4); - b5 = swap_if_le (b5); - b6 = swap_if_le (b6); - b7 = swap_if_le (b7); - - b0 ^= rk[0]; - b1 ^= rk[0]; - b2 ^= rk[0]; - b3 ^= rk[0]; - b4 ^= rk[0]; - b5 ^= rk[0]; - b6 ^= rk[0]; - b7 ^= rk[0]; - - for (r = 1;r < rounds;r++) + iv = iv7 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); \ + b4 = vec_ncipher_be (b4, rkey); \ + b5 = vec_ncipher_be (b5, rkey); \ + b6 = vec_ncipher_be (b6, rkey); \ + b7 = vec_ncipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) { - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b0) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b1) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b2) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b3) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b4) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b5) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b6) - :"v" (rk[r])); - __asm__ volatile ("vncipher %0, %0, %1\n\t" - :"+v" (b7) - :"v" (rk[r])); + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } } - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b0) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b1) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b2) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b3) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b4) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b5) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b6) - :"v" (rk[r])); - __asm__ volatile ("vncipherlast %0, %0, %1\n\t" - :"+v" (b7) - :"v" (rk[r])); - - iv = iv7; - - b0 = swap_if_le (b0) ^ iv0; - b1 = swap_if_le (b1) ^ iv1; - b2 = swap_if_le (b2) ^ iv2; - b3 = swap_if_le (b3) ^ iv3; - b4 = swap_if_le (b4) ^ iv4; - b5 = swap_if_le (b5) ^ iv5; - b6 = swap_if_le (b6) ^ iv6; - b7 = swap_if_le (b7) ^ iv7; + +#undef DO_ROUND + + rkey = rkeylast ^ rkey0; + b0 = vec_ncipherlast_be (b0, rkey ^ iv0); + b1 = vec_ncipherlast_be (b1, rkey ^ iv1); + b2 = vec_ncipherlast_be (b2, rkey ^ iv2); + b3 = vec_ncipherlast_be (b3, rkey ^ iv3); + b4 = vec_ncipherlast_be (b4, rkey ^ iv4); + b5 = vec_ncipherlast_be (b5, rkey ^ iv5); + b6 = vec_ncipherlast_be (b6, rkey ^ iv6); + b7 = vec_ncipherlast_be (b7, rkey ^ iv7); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); ctr ^= b0 ^ b1 ^ b2 ^ b3 ^ b4 ^ b5 ^ b6 ^ b7; - /* The unaligned store stxvb16x writes big-endian */ - if ((uintptr_t)outbuf % 16 == 0) - { - vec_vsx_st (b0, 0, out++); - vec_vsx_st (b1, 0, out++); - vec_vsx_st (b2, 0, out++); - vec_vsx_st (b3, 0, out++); - vec_vsx_st (b4, 0, out++); - vec_vsx_st (b5, 0, out++); - vec_vsx_st (b6, 0, out++); - vec_vsx_st (b7, 0, out++); - } - else + in += 8; + out += 8; + } + + if (nblocks >= 4 && (data_nblocks % 4) == 0) + { + b0 = VEC_LOAD_BE (in + 0, bige_const); + b1 = VEC_LOAD_BE (in + 1, bige_const); + b2 = VEC_LOAD_BE (in + 2, bige_const); + b3 = VEC_LOAD_BE (in + 3, bige_const); + + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 4), bige_const); + + iv ^= rkey0; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l; + + b0 ^= iv0; + b1 ^= iv1; + b2 ^= iv2; + b3 ^= iv3; + iv = iv3 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) { - b0 = swap_if_le (b0); - b1 = swap_if_le (b1); - b2 = swap_if_le (b2); - b3 = swap_if_le (b3); - b4 = swap_if_le (b4); - b5 = swap_if_le (b5); - b6 = swap_if_le (b6); - b7 = swap_if_le (b7); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b0), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b1), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b2), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b3), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b4), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b5), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b6), "r" (zero), "r" ((uintptr_t)(out++))); - __asm__ ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (b7), "r" (zero), "r" ((uintptr_t)(out++))); + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } } + +#undef DO_ROUND + + rkey = rkeylast ^ rkey0; + b0 = vec_ncipherlast_be (b0, rkey ^ iv0); + b1 = vec_ncipherlast_be (b1, rkey ^ iv1); + b2 = vec_ncipherlast_be (b2, rkey ^ iv2); + b3 = vec_ncipherlast_be (b3, rkey ^ iv3); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + ctr ^= b0 ^ b1 ^ b2 ^ b3; + + in += 4; + out += 4; + nblocks -= 4; } - for ( ;nblocks; nblocks-- ) + for (; nblocks; nblocks--) { - block b; - u64 i = ++c->u_mode.ocb.data_nblocks; - const block l = *(block*)ocb_get_l (c, i); + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (in, bige_const); /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ iv ^= l; - if ((uintptr_t)in % 16 == 0) - { - b = vec_ld (0, in++); - } - else - { - block unalignedprevprev; - unalignedprevprev = unalignedprev; - unalignedprev = vec_ld (0, in++); - b = vec_perm (unalignedprevprev, unalignedprev, vec_lvsl (0, inbuf)); - } - - /* Checksum_i = Checksum_{i-1} xor P_i */ - /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + b ^= iv; + AES_DECRYPT (b, rounds); b ^= iv; - b = swap_if_le (b); - b = _gcry_aes_ppc8_decrypt_altivec (ctx, b); - b = swap_if_le (b) ^ iv; + /* Checksum_i = Checksum_{i-1} xor P_i */ ctr ^= b; - if ((uintptr_t)out % 16 == 0) - { - vec_vsx_st (b, 0, out++); - } - else - { - b = swap_if_le (b); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - : - : "wa" (b), "r" (zero), "r" ((uintptr_t)out++)); - } - } - /* We want to store iv and ctr big-endian and the unaligned - store stxvb16x stores them little endian, so we have to swap them. */ - iv = swap_if_le (iv); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (iv), "r" (zero), "r" ((uintptr_t)&c->u_iv.iv)); - ctr = swap_if_le(ctr); - __asm__ volatile ("stxvb16x %x0, %1, %2\n\t" - :: "wa" (ctr), "r" (zero), "r" ((uintptr_t)&c->u_ctr.ctr)); + VEC_STORE_BE (out, b, bige_const); + + in += 1; + out += 1; + } } + + VEC_STORE_BE (c->u_iv.iv, iv, bige_const); + VEC_STORE_BE (c->u_ctr.ctr, ctr, bige_const); + c->u_mode.ocb.data_nblocks = data_nblocks; + return 0; } -#endif #endif /* USE_PPC_CRYPTO */ diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 8a27dfe0b..c7bc467cf 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -210,6 +210,9 @@ extern unsigned int _gcry_aes_ppc8_encrypt(const RIJNDAEL_context *ctx, extern unsigned int _gcry_aes_ppc8_decrypt(const RIJNDAEL_context *ctx, unsigned char *dst, const unsigned char *src); +extern size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, + int encrypt); #endif /*USE_PPC_CRYPTO*/ static unsigned int do_encrypt (const RIJNDAEL_context *ctx, unsigned char *bx, @@ -447,6 +450,7 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, ctx->use_ppc_crypto = 1; if (hd) { + hd->bulk.ocb_crypt = _gcry_aes_ppc8_ocb_crypt; } } #endif @@ -1380,6 +1384,12 @@ _gcry_aes_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, return _gcry_aes_armv8_ce_ocb_crypt (c, outbuf, inbuf, nblocks, encrypt); } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + return _gcry_aes_ppc8_ocb_crypt (c, outbuf, inbuf, nblocks, encrypt); + } +#endif /*USE_PPC_CRYPTO*/ else if (encrypt) { union { unsigned char x1[16] ATTR_ALIGNED_16; u32 x32[4]; } l_tmp; From jussi.kivilinna at iki.fi Fri Aug 23 18:52:20 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:20 +0300 Subject: [PATCH 5/6] rijndael-ppc: add block mode for ocb_auth In-Reply-To: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> References: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> Message-ID: <156657914066.2143.16287955800301820551.stgit@localhost.localdomain> * cipher/rijndael-ppc.c (_gcry_aes_ppc8_ocb_auth): New. * cipher/rijndael.c [USE_PPC_CRYPTO] (_gcry_aes_ppc8_ocb_auth): New prototype. (do_setkey, _gcry_aes_ocb_auth) [USE_PPC_CRYPTO]: Add PowerPC AES ocb_auth. -- Benchmark on POWER8 ~3.8Ghz: Before: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 0.250 ns/B 3818 MiB/s 0.949 c/B OCB dec | 0.250 ns/B 3820 MiB/s 0.949 c/B OCB auth | 2.31 ns/B 412.5 MiB/s 8.79 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte OCB enc | 0.252 ns/B 3779 MiB/s 0.959 c/B OCB dec | 0.245 ns/B 3891 MiB/s 0.931 c/B OCB auth | 0.223 ns/B 4283 MiB/s 0.846 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c index 01ff6f503..018527321 100644 --- a/cipher/rijndael-ppc.c +++ b/cipher/rijndael-ppc.c @@ -882,4 +882,213 @@ size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, return 0; } +size_t _gcry_aes_ppc8_ocb_auth (gcry_cipher_hd_t c, void *abuf_arg, + size_t nblocks) +{ + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = (void *)&c->context.c; + const u128_t *rk = (u128_t *)&ctx->keyschenc; + const u128_t *abuf = (const u128_t *)abuf_arg; + int rounds = ctx->rounds; + u64 data_nblocks = c->u_mode.ocb.aad_nblocks; + block l0, l1, l2, l; + block b0, b1, b2, b3, b4, b5, b6, b7, b; + block iv0, iv1, iv2, iv3, iv4, iv5, iv6, iv7; + block rkey, frkey; + block ctr, iv; + ROUND_KEY_VARIABLES; + + iv = VEC_LOAD_BE (c->u_mode.ocb.aad_offset, bige_const); + ctr = VEC_LOAD_BE (c->u_mode.ocb.aad_sum, bige_const); + + l0 = VEC_LOAD_BE (c->u_mode.ocb.L[0], bige_const); + l1 = VEC_LOAD_BE (c->u_mode.ocb.L[1], bige_const); + l2 = VEC_LOAD_BE (c->u_mode.ocb.L[2], bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + + for (; nblocks >= 8 && data_nblocks % 8; nblocks--) + { + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (abuf, bige_const); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + b ^= iv; + AES_ENCRYPT (b, rounds); + ctr ^= b; + + abuf += 1; + } + + for (; nblocks >= 8; nblocks -= 8) + { + b0 = VEC_LOAD_BE (abuf + 0, bige_const); + b1 = VEC_LOAD_BE (abuf + 1, bige_const); + b2 = VEC_LOAD_BE (abuf + 2, bige_const); + b3 = VEC_LOAD_BE (abuf + 3, bige_const); + b4 = VEC_LOAD_BE (abuf + 4, bige_const); + b5 = VEC_LOAD_BE (abuf + 5, bige_const); + b6 = VEC_LOAD_BE (abuf + 6, bige_const); + b7 = VEC_LOAD_BE (abuf + 7, bige_const); + + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 8), bige_const); + + frkey = rkey0; + iv ^= frkey; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l2; + iv4 = iv ^ l1 ^ l2 ^ l0; + iv5 = iv ^ l2 ^ l0; + iv6 = iv ^ l2; + iv7 = iv ^ l2 ^ l; + + b0 ^= iv0; + b1 ^= iv1; + b2 ^= iv2; + b3 ^= iv3; + b4 ^= iv4; + b5 ^= iv5; + b6 ^= iv6; + b7 ^= iv7; + iv = iv7 ^ frkey; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); \ + b4 = vec_cipher_be (b4, rkey); \ + b5 = vec_cipher_be (b5, rkey); \ + b6 = vec_cipher_be (b6, rkey); \ + b7 = vec_cipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey); + b1 = vec_cipherlast_be (b1, rkey); + b2 = vec_cipherlast_be (b2, rkey); + b3 = vec_cipherlast_be (b3, rkey); + b4 = vec_cipherlast_be (b4, rkey); + b5 = vec_cipherlast_be (b5, rkey); + b6 = vec_cipherlast_be (b6, rkey); + b7 = vec_cipherlast_be (b7, rkey); + + ctr ^= b0 ^ b1 ^ b2 ^ b3 ^ b4 ^ b5 ^ b6 ^ b7; + + abuf += 8; + } + + if (nblocks >= 4 && (data_nblocks % 4) == 0) + { + b0 = VEC_LOAD_BE (abuf + 0, bige_const); + b1 = VEC_LOAD_BE (abuf + 1, bige_const); + b2 = VEC_LOAD_BE (abuf + 2, bige_const); + b3 = VEC_LOAD_BE (abuf + 3, bige_const); + + l = VEC_LOAD_BE (ocb_get_l (c, data_nblocks += 4), bige_const); + + frkey = rkey0; + iv ^= frkey; + + iv0 = iv ^ l0; + iv1 = iv ^ l0 ^ l1; + iv2 = iv ^ l1; + iv3 = iv ^ l1 ^ l; + + b0 ^= iv0; + b1 ^= iv1; + b2 ^= iv2; + b3 ^= iv3; + iv = iv3 ^ frkey; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey); + b1 = vec_cipherlast_be (b1, rkey); + b2 = vec_cipherlast_be (b2, rkey); + b3 = vec_cipherlast_be (b3, rkey); + + ctr ^= b0 ^ b1 ^ b2 ^ b3; + + abuf += 4; + nblocks -= 4; + } + + for (; nblocks; nblocks--) + { + l = VEC_LOAD_BE (ocb_get_l (c, ++data_nblocks), bige_const); + b = VEC_LOAD_BE (abuf, bige_const); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + iv ^= l; + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + b ^= iv; + AES_ENCRYPT (b, rounds); + ctr ^= b; + + abuf += 1; + } + + VEC_STORE_BE (c->u_mode.ocb.aad_offset, iv, bige_const); + VEC_STORE_BE (c->u_mode.ocb.aad_sum, ctr, bige_const); + c->u_mode.ocb.aad_nblocks = data_nblocks; + + return 0; +} + #endif /* USE_PPC_CRYPTO */ diff --git a/cipher/rijndael.c b/cipher/rijndael.c index c7bc467cf..f15ac18b1 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -213,6 +213,8 @@ extern unsigned int _gcry_aes_ppc8_decrypt(const RIJNDAEL_context *ctx, extern size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt); +extern size_t _gcry_aes_ppc8_ocb_auth (gcry_cipher_hd_t c, + const void *abuf_arg, size_t nblocks); #endif /*USE_PPC_CRYPTO*/ static unsigned int do_encrypt (const RIJNDAEL_context *ctx, unsigned char *bx, @@ -451,6 +453,7 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, if (hd) { hd->bulk.ocb_crypt = _gcry_aes_ppc8_ocb_crypt; + hd->bulk.ocb_auth = _gcry_aes_ppc8_ocb_auth; } } #endif @@ -1484,6 +1487,12 @@ _gcry_aes_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) return _gcry_aes_armv8_ce_ocb_auth (c, abuf, nblocks); } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + return _gcry_aes_ppc8_ocb_auth (c, abuf, nblocks); + } +#endif /*USE_PPC_CRYPTO*/ else { union { unsigned char x1[16] ATTR_ALIGNED_16; u32 x32[4]; } l_tmp; From jussi.kivilinna at iki.fi Fri Aug 23 18:52:25 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 19:52:25 +0300 Subject: [PATCH 6/6] rijndael-ppc: add block modes for CBC, CFB, CTR and XTS In-Reply-To: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> References: <156657911998.2143.9516236618799878867.stgit@localhost.localdomain> Message-ID: <156657914581.2143.14382637022441845937.stgit@localhost.localdomain> * cipher/rijndael-ppc.c (vec_add_uint128, _gcry_aes_ppc8_cfb_enc) (_gcry_aes_ppc8_cfb_dec, _gcry_aes_ppc8_cbc_enc) (_gcry_aes_ppc8_cbc_dec, _gcry_aes_ppc8_ctr_enc) (_gcry_aes_ppc8_xts_crypt): New. * cipher/rijndael.c [USE_PPC_CRYPTO] (_gcry_aes_ppc8_cfb_enc) (_gcry_aes_ppc8_cfb_dec, _gcry_aes_ppc8_cbc_enc) (_gcry_aes_ppc8_cbc_dec, _gcry_aes_ppc8_ctr_enc) (_gcry_aes_ppc8_xts_crypt): New. (do_setkey, _gcry_aes_cfb_enc, _gcry_aes_cfb_dec, _gcry_aes_cbc_enc) (_gcry_aes_cbc_dec, _gcry_aes_ctr_enc) (_gcry_aes_xts_crypto) [USE_PPC_CRYPTO]: Enable PowerPC AES CFB/CBC/CTR/XTS bulk implementations. * configure.ac (gcry_cv_gcc_inline_asm_ppc_altivec): Add 'vadduwm' instruction. -- Benchmark on POWER8 ~3.8Ghz: Before: AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 2.13 ns/B 447.2 MiB/s 8.10 c/B CBC dec | 1.13 ns/B 843.4 MiB/s 4.30 c/B CFB enc | 2.20 ns/B 433.9 MiB/s 8.35 c/B CFB dec | 2.22 ns/B 429.7 MiB/s 8.43 c/B CTR enc | 2.18 ns/B 438.2 MiB/s 8.27 c/B CTR dec | 2.18 ns/B 437.4 MiB/s 8.28 c/B XTS enc | 2.31 ns/B 412.8 MiB/s 8.78 c/B XTS dec | 2.30 ns/B 414.3 MiB/s 8.75 c/B CCM enc | 4.33 ns/B 220.1 MiB/s 16.47 c/B CCM dec | 4.34 ns/B 219.9 MiB/s 16.48 c/B CCM auth | 2.16 ns/B 440.6 MiB/s 8.22 c/B EAX enc | 4.34 ns/B 219.8 MiB/s 16.49 c/B EAX dec | 4.34 ns/B 219.8 MiB/s 16.49 c/B EAX auth | 2.16 ns/B 440.5 MiB/s 8.23 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte CBC enc | 1.06 ns/B 903.1 MiB/s 4.01 c/B CBC dec | 0.211 ns/B 4511 MiB/s 0.803 c/B CFB enc | 1.06 ns/B 896.7 MiB/s 4.04 c/B CFB dec | 0.209 ns/B 4563 MiB/s 0.794 c/B CTR enc | 0.237 ns/B 4026 MiB/s 0.900 c/B CTR dec | 0.237 ns/B 4029 MiB/s 0.900 c/B XTS enc | 0.496 ns/B 1922 MiB/s 1.89 c/B XTS dec | 0.496 ns/B 1924 MiB/s 1.88 c/B CCM enc | 1.29 ns/B 737.7 MiB/s 4.91 c/B CCM dec | 1.29 ns/B 737.8 MiB/s 4.91 c/B CCM auth | 1.06 ns/B 903.3 MiB/s 4.01 c/B EAX enc | 1.29 ns/B 737.7 MiB/s 4.91 c/B EAX dec | 1.29 ns/B 737.2 MiB/s 4.92 c/B Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c index 018527321..5f3c7ee30 100644 --- a/cipher/rijndael-ppc.c +++ b/cipher/rijndael-ppc.c @@ -230,6 +230,22 @@ vec_store_be(block vec, unsigned long offset, unsigned char *ptr, } +static ASM_FUNC_ATTR_INLINE block +vec_add_uint128(block a, block b) +{ +#if 1 + block res; + /* Use assembly as GCC (v8.3) generates slow code for vec_vadduqm. */ + __asm__ ("vadduqm %0,%1,%2\n\t" + : "=v" (res) + : "v" (a), "v" (b)); + return res; +#else + return (block)vec_vadduqm((vector __uint128_t)a, (vector __uint128_t)b); +#endif +} + + static ASM_FUNC_ATTR_INLINE u32 _gcry_aes_sbox4_ppc8(u32 fourbytes) { @@ -419,14 +435,612 @@ unsigned int _gcry_aes_ppc8_decrypt (const RIJNDAEL_context *ctx, ROUND_KEY_VARIABLES; block b; - b = VEC_LOAD_BE (in, bige_const); + b = VEC_LOAD_BE (in, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + + AES_DECRYPT (b, rounds); + VEC_STORE_BE (out, b, bige_const); + + return 0; /* does not use stack */ +} + + +void _gcry_aes_ppc8_cfb_enc (void *context, unsigned char *iv_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *rk = (u128_t *)&ctx->keyschenc; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block rkeylast_orig; + block iv; + + iv = VEC_LOAD_BE (iv_arg, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + rkeylast_orig = rkeylast; + + for (; nblocks; nblocks--) + { + rkeylast = rkeylast_orig ^ VEC_LOAD_BE (in, bige_const); + + AES_ENCRYPT (iv, rounds); + + VEC_STORE_BE (out, iv, bige_const); + + out++; + in++; + } + + VEC_STORE_BE (iv_arg, iv, bige_const); +} + +void _gcry_aes_ppc8_cfb_dec (void *context, unsigned char *iv_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *rk = (u128_t *)&ctx->keyschenc; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block rkeylast_orig; + block iv, b, bin; + block in0, in1, in2, in3, in4, in5, in6, in7; + block b0, b1, b2, b3, b4, b5, b6, b7; + block rkey; + + iv = VEC_LOAD_BE (iv_arg, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + rkeylast_orig = rkeylast; + + for (; nblocks >= 8; nblocks -= 8) + { + in0 = iv; + in1 = VEC_LOAD_BE (in + 0, bige_const); + in2 = VEC_LOAD_BE (in + 1, bige_const); + in3 = VEC_LOAD_BE (in + 2, bige_const); + in4 = VEC_LOAD_BE (in + 3, bige_const); + in5 = VEC_LOAD_BE (in + 4, bige_const); + in6 = VEC_LOAD_BE (in + 5, bige_const); + in7 = VEC_LOAD_BE (in + 6, bige_const); + iv = VEC_LOAD_BE (in + 7, bige_const); + + b0 = rkey0 ^ in0; + b1 = rkey0 ^ in1; + b2 = rkey0 ^ in2; + b3 = rkey0 ^ in3; + b4 = rkey0 ^ in4; + b5 = rkey0 ^ in5; + b6 = rkey0 ^ in6; + b7 = rkey0 ^ in7; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); \ + b4 = vec_cipher_be (b4, rkey); \ + b5 = vec_cipher_be (b5, rkey); \ + b6 = vec_cipher_be (b6, rkey); \ + b7 = vec_cipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ in1); + b1 = vec_cipherlast_be (b1, rkey ^ in2); + b2 = vec_cipherlast_be (b2, rkey ^ in3); + b3 = vec_cipherlast_be (b3, rkey ^ in4); + b4 = vec_cipherlast_be (b4, rkey ^ in5); + b5 = vec_cipherlast_be (b5, rkey ^ in6); + b6 = vec_cipherlast_be (b6, rkey ^ in7); + b7 = vec_cipherlast_be (b7, rkey ^ iv); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4) + { + in0 = iv; + in1 = VEC_LOAD_BE (in + 0, bige_const); + in2 = VEC_LOAD_BE (in + 1, bige_const); + in3 = VEC_LOAD_BE (in + 2, bige_const); + iv = VEC_LOAD_BE (in + 3, bige_const); + + b0 = rkey0 ^ in0; + b1 = rkey0 ^ in1; + b2 = rkey0 ^ in2; + b3 = rkey0 ^ in3; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ in1); + b1 = vec_cipherlast_be (b1, rkey ^ in2); + b2 = vec_cipherlast_be (b2, rkey ^ in3); + b3 = vec_cipherlast_be (b3, rkey ^ iv); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + in += 4; + out += 4; + nblocks -= 4; + } + + for (; nblocks; nblocks--) + { + bin = VEC_LOAD_BE (in, bige_const); + rkeylast = rkeylast_orig ^ bin; + b = iv; + iv = bin; + + AES_ENCRYPT (b, rounds); + + VEC_STORE_BE (out, b, bige_const); + + out++; + in++; + } + + VEC_STORE_BE (iv_arg, iv, bige_const); +} + + +void _gcry_aes_ppc8_cbc_enc (void *context, unsigned char *iv_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int cbc_mac) +{ + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *rk = (u128_t *)&ctx->keyschenc; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block lastiv, b; + + lastiv = VEC_LOAD_BE (iv_arg, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + + for (; nblocks; nblocks--) + { + b = lastiv ^ VEC_LOAD_BE (in, bige_const); + + AES_ENCRYPT (b, rounds); + + lastiv = b; + VEC_STORE_BE (out, b, bige_const); + + in++; + if (!cbc_mac) + out++; + } + + VEC_STORE_BE (iv_arg, lastiv, bige_const); +} + +void _gcry_aes_ppc8_cbc_dec (void *context, unsigned char *iv_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *rk = (u128_t *)&ctx->keyschdec; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block rkeylast_orig; + block in0, in1, in2, in3, in4, in5, in6, in7; + block b0, b1, b2, b3, b4, b5, b6, b7; + block rkey; + block iv, b; + + if (!ctx->decryption_prepared) + { + aes_ppc8_prepare_decryption (ctx); + ctx->decryption_prepared = 1; + } + + iv = VEC_LOAD_BE (iv_arg, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + rkeylast_orig = rkeylast; + + for (; nblocks >= 8; nblocks -= 8) + { + in0 = VEC_LOAD_BE (in + 0, bige_const); + in1 = VEC_LOAD_BE (in + 1, bige_const); + in2 = VEC_LOAD_BE (in + 2, bige_const); + in3 = VEC_LOAD_BE (in + 3, bige_const); + in4 = VEC_LOAD_BE (in + 4, bige_const); + in5 = VEC_LOAD_BE (in + 5, bige_const); + in6 = VEC_LOAD_BE (in + 6, bige_const); + in7 = VEC_LOAD_BE (in + 7, bige_const); + + b0 = rkey0 ^ in0; + b1 = rkey0 ^ in1; + b2 = rkey0 ^ in2; + b3 = rkey0 ^ in3; + b4 = rkey0 ^ in4; + b5 = rkey0 ^ in5; + b6 = rkey0 ^ in6; + b7 = rkey0 ^ in7; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); \ + b4 = vec_ncipher_be (b4, rkey); \ + b5 = vec_ncipher_be (b5, rkey); \ + b6 = vec_ncipher_be (b6, rkey); \ + b7 = vec_ncipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_ncipherlast_be (b0, rkey ^ iv); + b1 = vec_ncipherlast_be (b1, rkey ^ in0); + b2 = vec_ncipherlast_be (b2, rkey ^ in1); + b3 = vec_ncipherlast_be (b3, rkey ^ in2); + b4 = vec_ncipherlast_be (b4, rkey ^ in3); + b5 = vec_ncipherlast_be (b5, rkey ^ in4); + b6 = vec_ncipherlast_be (b6, rkey ^ in5); + b7 = vec_ncipherlast_be (b7, rkey ^ in6); + iv = in7; + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4) + { + in0 = VEC_LOAD_BE (in + 0, bige_const); + in1 = VEC_LOAD_BE (in + 1, bige_const); + in2 = VEC_LOAD_BE (in + 2, bige_const); + in3 = VEC_LOAD_BE (in + 3, bige_const); + + b0 = rkey0 ^ in0; + b1 = rkey0 ^ in1; + b2 = rkey0 ^ in2; + b3 = rkey0 ^ in3; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_ncipherlast_be (b0, rkey ^ iv); + b1 = vec_ncipherlast_be (b1, rkey ^ in0); + b2 = vec_ncipherlast_be (b2, rkey ^ in1); + b3 = vec_ncipherlast_be (b3, rkey ^ in2); + iv = in3; + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + in += 4; + out += 4; + nblocks -= 4; + } + + for (; nblocks; nblocks--) + { + rkeylast = rkeylast_orig ^ iv; + + iv = VEC_LOAD_BE (in, bige_const); + b = iv; + AES_DECRYPT (b, rounds); + + VEC_STORE_BE (out, b, bige_const); + + in++; + out++; + } + + VEC_STORE_BE (iv_arg, iv, bige_const); +} + + +void _gcry_aes_ppc8_ctr_enc (void *context, unsigned char *ctr_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + static const unsigned char vec_one_const[16] = + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 }; + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *rk = (u128_t *)&ctx->keyschenc; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + ROUND_KEY_VARIABLES; + block rkeylast_orig; + block ctr, b, one; + + ctr = VEC_LOAD_BE (ctr_arg, bige_const); + one = VEC_LOAD_BE (&vec_one_const, bige_const); + + PRELOAD_ROUND_KEYS (rk, rounds); + rkeylast_orig = rkeylast; + + if (nblocks >= 4) + { + block b0, b1, b2, b3, b4, b5, b6, b7; + block two, three, four; + block ctr4; + block rkey; + + two = vec_add_uint128 (one, one); + three = vec_add_uint128 (two, one); + four = vec_add_uint128 (two, two); + + for (; nblocks >= 8; nblocks -= 8) + { + ctr4 = vec_add_uint128 (ctr, four); + b0 = rkey0 ^ ctr; + b1 = rkey0 ^ vec_add_uint128 (ctr, one); + b2 = rkey0 ^ vec_add_uint128 (ctr, two); + b3 = rkey0 ^ vec_add_uint128 (ctr, three); + b4 = rkey0 ^ ctr4; + b5 = rkey0 ^ vec_add_uint128 (ctr4, one); + b6 = rkey0 ^ vec_add_uint128 (ctr4, two); + b7 = rkey0 ^ vec_add_uint128 (ctr4, three); + ctr = vec_add_uint128 (ctr4, four); + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); \ + b4 = vec_cipher_be (b4, rkey); \ + b5 = vec_cipher_be (b5, rkey); \ + b6 = vec_cipher_be (b6, rkey); \ + b7 = vec_cipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ VEC_LOAD_BE (in + 0, bige_const)); + b1 = vec_cipherlast_be (b1, rkey ^ VEC_LOAD_BE (in + 1, bige_const)); + b2 = vec_cipherlast_be (b2, rkey ^ VEC_LOAD_BE (in + 2, bige_const)); + b3 = vec_cipherlast_be (b3, rkey ^ VEC_LOAD_BE (in + 3, bige_const)); + b4 = vec_cipherlast_be (b4, rkey ^ VEC_LOAD_BE (in + 4, bige_const)); + b5 = vec_cipherlast_be (b5, rkey ^ VEC_LOAD_BE (in + 5, bige_const)); + b6 = vec_cipherlast_be (b6, rkey ^ VEC_LOAD_BE (in + 6, bige_const)); + b7 = vec_cipherlast_be (b7, rkey ^ VEC_LOAD_BE (in + 7, bige_const)); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4) + { + b0 = rkey0 ^ ctr; + b1 = rkey0 ^ vec_add_uint128 (ctr, one); + b2 = rkey0 ^ vec_add_uint128 (ctr, two); + b3 = rkey0 ^ vec_add_uint128 (ctr, three); + ctr = vec_add_uint128 (ctr, four); + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD(&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ VEC_LOAD_BE (in + 0, bige_const)); + b1 = vec_cipherlast_be (b1, rkey ^ VEC_LOAD_BE (in + 1, bige_const)); + b2 = vec_cipherlast_be (b2, rkey ^ VEC_LOAD_BE (in + 2, bige_const)); + b3 = vec_cipherlast_be (b3, rkey ^ VEC_LOAD_BE (in + 3, bige_const)); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + in += 4; + out += 4; + nblocks -= 4; + } + } + + for (; nblocks; nblocks--) + { + b = ctr; + ctr = vec_add_uint128 (ctr, one); + rkeylast = rkeylast_orig ^ VEC_LOAD_BE (in, bige_const); - PRELOAD_ROUND_KEYS (rk, rounds); + AES_ENCRYPT (b, rounds); - AES_DECRYPT (b, rounds); - VEC_STORE_BE (out, b, bige_const); + VEC_STORE_BE (out, b, bige_const); - return 0; /* does not use stack */ + out++; + in++; + } + + VEC_STORE_BE (ctr_arg, ctr, bige_const); } @@ -1091,4 +1705,400 @@ size_t _gcry_aes_ppc8_ocb_auth (gcry_cipher_hd_t c, void *abuf_arg, return 0; } + +void _gcry_aes_ppc8_xts_crypt (void *context, unsigned char *tweak_arg, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int encrypt) +{ + static const block vec_bswap64_const = + { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }; + static const block vec_bswap128_const = + { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; + static const unsigned char vec_tweak_const[16] = + { 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0x87 }; + static const vector unsigned long long vec_shift63_const = + { 63, 63 }; + static const vector unsigned long long vec_shift1_const = + { 1, 1 }; + const block bige_const = vec_load_be_const(); + RIJNDAEL_context *ctx = context; + const u128_t *in = (const u128_t *)inbuf_arg; + u128_t *out = (u128_t *)outbuf_arg; + int rounds = ctx->rounds; + block tweak_tmp, tweak_next, tweak; + block b0, b1, b2, b3, b4, b5, b6, b7, b, rkey; + block tweak0, tweak1, tweak2, tweak3, tweak4, tweak5, tweak6, tweak7; + block tweak_const, bswap64_const, bswap128_const; + vector unsigned long long shift63_const, shift1_const; + ROUND_KEY_VARIABLES; + + tweak_const = VEC_LOAD_BE (&vec_tweak_const, bige_const); + bswap64_const = ALIGNED_LOAD (&vec_bswap64_const); + bswap128_const = ALIGNED_LOAD (&vec_bswap128_const); + shift63_const = (vector unsigned long long)ALIGNED_LOAD (&vec_shift63_const); + shift1_const = (vector unsigned long long)ALIGNED_LOAD (&vec_shift1_const); + + tweak_next = VEC_LOAD_BE (tweak_arg, bige_const); + +#define GEN_TWEAK(tweak, tmp) /* Generate next tweak. */ \ + tmp = vec_vperm(tweak, tweak, bswap64_const); \ + tweak = vec_vperm(tweak, tweak, bswap128_const); \ + tmp = (block)(vec_sra((vector unsigned long long)tmp, shift63_const)) & \ + tweak_const; \ + tweak = (block)vec_sl((vector unsigned long long)tweak, shift1_const); \ + tweak = tweak ^ tmp; \ + tweak = vec_vperm(tweak, tweak, bswap128_const); + + if (encrypt) + { + const u128_t *rk = (u128_t *)&ctx->keyschenc; + + PRELOAD_ROUND_KEYS (rk, rounds); + + for (; nblocks >= 8; nblocks -= 8) + { + tweak0 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak1 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak2 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak3 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak4 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak5 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak6 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak7 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + + b0 = VEC_LOAD_BE (in + 0, bige_const) ^ tweak0 ^ rkey0; + b1 = VEC_LOAD_BE (in + 1, bige_const) ^ tweak1 ^ rkey0; + b2 = VEC_LOAD_BE (in + 2, bige_const) ^ tweak2 ^ rkey0; + b3 = VEC_LOAD_BE (in + 3, bige_const) ^ tweak3 ^ rkey0; + b4 = VEC_LOAD_BE (in + 4, bige_const) ^ tweak4 ^ rkey0; + b5 = VEC_LOAD_BE (in + 5, bige_const) ^ tweak5 ^ rkey0; + b6 = VEC_LOAD_BE (in + 6, bige_const) ^ tweak6 ^ rkey0; + b7 = VEC_LOAD_BE (in + 7, bige_const) ^ tweak7 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); \ + b4 = vec_cipher_be (b4, rkey); \ + b5 = vec_cipher_be (b5, rkey); \ + b6 = vec_cipher_be (b6, rkey); \ + b7 = vec_cipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ tweak0); + b1 = vec_cipherlast_be (b1, rkey ^ tweak1); + b2 = vec_cipherlast_be (b2, rkey ^ tweak2); + b3 = vec_cipherlast_be (b3, rkey ^ tweak3); + b4 = vec_cipherlast_be (b4, rkey ^ tweak4); + b5 = vec_cipherlast_be (b5, rkey ^ tweak5); + b6 = vec_cipherlast_be (b6, rkey ^ tweak6); + b7 = vec_cipherlast_be (b7, rkey ^ tweak7); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4) + { + tweak0 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak1 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak2 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak3 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + + b0 = VEC_LOAD_BE (in + 0, bige_const) ^ tweak0 ^ rkey0; + b1 = VEC_LOAD_BE (in + 1, bige_const) ^ tweak1 ^ rkey0; + b2 = VEC_LOAD_BE (in + 2, bige_const) ^ tweak2 ^ rkey0; + b3 = VEC_LOAD_BE (in + 3, bige_const) ^ tweak3 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_cipher_be (b0, rkey); \ + b1 = vec_cipher_be (b1, rkey); \ + b2 = vec_cipher_be (b2, rkey); \ + b3 = vec_cipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_cipherlast_be (b0, rkey ^ tweak0); + b1 = vec_cipherlast_be (b1, rkey ^ tweak1); + b2 = vec_cipherlast_be (b2, rkey ^ tweak2); + b3 = vec_cipherlast_be (b3, rkey ^ tweak3); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + in += 4; + out += 4; + nblocks -= 4; + } + + for (; nblocks; nblocks--) + { + tweak = tweak_next; + + /* Xor-Encrypt/Decrypt-Xor block. */ + b = VEC_LOAD_BE (in, bige_const) ^ tweak; + + /* Generate next tweak. */ + GEN_TWEAK (tweak_next, tweak_tmp); + + AES_ENCRYPT (b, rounds); + + b ^= tweak; + VEC_STORE_BE (out, b, bige_const); + + in++; + out++; + } + } + else + { + const u128_t *rk = (u128_t *)&ctx->keyschdec; + + if (!ctx->decryption_prepared) + { + aes_ppc8_prepare_decryption (ctx); + ctx->decryption_prepared = 1; + } + + PRELOAD_ROUND_KEYS (rk, rounds); + + for (; nblocks >= 8; nblocks -= 8) + { + tweak0 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak1 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak2 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak3 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak4 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak5 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak6 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak7 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + + b0 = VEC_LOAD_BE (in + 0, bige_const) ^ tweak0 ^ rkey0; + b1 = VEC_LOAD_BE (in + 1, bige_const) ^ tweak1 ^ rkey0; + b2 = VEC_LOAD_BE (in + 2, bige_const) ^ tweak2 ^ rkey0; + b3 = VEC_LOAD_BE (in + 3, bige_const) ^ tweak3 ^ rkey0; + b4 = VEC_LOAD_BE (in + 4, bige_const) ^ tweak4 ^ rkey0; + b5 = VEC_LOAD_BE (in + 5, bige_const) ^ tweak5 ^ rkey0; + b6 = VEC_LOAD_BE (in + 6, bige_const) ^ tweak6 ^ rkey0; + b7 = VEC_LOAD_BE (in + 7, bige_const) ^ tweak7 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); \ + b4 = vec_ncipher_be (b4, rkey); \ + b5 = vec_ncipher_be (b5, rkey); \ + b6 = vec_ncipher_be (b6, rkey); \ + b7 = vec_ncipher_be (b7, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_ncipherlast_be (b0, rkey ^ tweak0); + b1 = vec_ncipherlast_be (b1, rkey ^ tweak1); + b2 = vec_ncipherlast_be (b2, rkey ^ tweak2); + b3 = vec_ncipherlast_be (b3, rkey ^ tweak3); + b4 = vec_ncipherlast_be (b4, rkey ^ tweak4); + b5 = vec_ncipherlast_be (b5, rkey ^ tweak5); + b6 = vec_ncipherlast_be (b6, rkey ^ tweak6); + b7 = vec_ncipherlast_be (b7, rkey ^ tweak7); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + VEC_STORE_BE (out + 4, b4, bige_const); + VEC_STORE_BE (out + 5, b5, bige_const); + VEC_STORE_BE (out + 6, b6, bige_const); + VEC_STORE_BE (out + 7, b7, bige_const); + + in += 8; + out += 8; + } + + if (nblocks >= 4) + { + tweak0 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak1 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak2 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + tweak3 = tweak_next; + GEN_TWEAK (tweak_next, tweak_tmp); + + b0 = VEC_LOAD_BE (in + 0, bige_const) ^ tweak0 ^ rkey0; + b1 = VEC_LOAD_BE (in + 1, bige_const) ^ tweak1 ^ rkey0; + b2 = VEC_LOAD_BE (in + 2, bige_const) ^ tweak2 ^ rkey0; + b3 = VEC_LOAD_BE (in + 3, bige_const) ^ tweak3 ^ rkey0; + +#define DO_ROUND(r) \ + rkey = ALIGNED_LOAD (&rk[r]); \ + b0 = vec_ncipher_be (b0, rkey); \ + b1 = vec_ncipher_be (b1, rkey); \ + b2 = vec_ncipher_be (b2, rkey); \ + b3 = vec_ncipher_be (b3, rkey); + + DO_ROUND(1); + DO_ROUND(2); + DO_ROUND(3); + DO_ROUND(4); + DO_ROUND(5); + DO_ROUND(6); + DO_ROUND(7); + DO_ROUND(8); + DO_ROUND(9); + if (rounds >= 12) + { + DO_ROUND(10); + DO_ROUND(11); + if (rounds > 12) + { + DO_ROUND(12); + DO_ROUND(13); + } + } + +#undef DO_ROUND + + rkey = rkeylast; + b0 = vec_ncipherlast_be (b0, rkey ^ tweak0); + b1 = vec_ncipherlast_be (b1, rkey ^ tweak1); + b2 = vec_ncipherlast_be (b2, rkey ^ tweak2); + b3 = vec_ncipherlast_be (b3, rkey ^ tweak3); + + VEC_STORE_BE (out + 0, b0, bige_const); + VEC_STORE_BE (out + 1, b1, bige_const); + VEC_STORE_BE (out + 2, b2, bige_const); + VEC_STORE_BE (out + 3, b3, bige_const); + + in += 4; + out += 4; + nblocks -= 4; + } + + for (; nblocks; nblocks--) + { + tweak = tweak_next; + + /* Xor-Encrypt/Decrypt-Xor block. */ + b = VEC_LOAD_BE (in, bige_const) ^ tweak; + + /* Generate next tweak. */ + GEN_TWEAK (tweak_next, tweak_tmp); + + AES_DECRYPT (b, rounds); + + b ^= tweak; + VEC_STORE_BE (out, b, bige_const); + + in++; + out++; + } + } + + VEC_STORE_BE (tweak_arg, tweak_next, bige_const); + +#undef GEN_TWEAK +} + #endif /* USE_PPC_CRYPTO */ diff --git a/cipher/rijndael.c b/cipher/rijndael.c index f15ac18b1..ebd1a11a5 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -210,11 +210,33 @@ extern unsigned int _gcry_aes_ppc8_encrypt(const RIJNDAEL_context *ctx, extern unsigned int _gcry_aes_ppc8_decrypt(const RIJNDAEL_context *ctx, unsigned char *dst, const unsigned char *src); + +extern void _gcry_aes_ppc8_cfb_enc (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_ppc8_cbc_enc (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int cbc_mac); +extern void _gcry_aes_ppc8_ctr_enc (void *context, unsigned char *ctr, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_ppc8_cfb_dec (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_ppc8_cbc_dec (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); + extern size_t _gcry_aes_ppc8_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, size_t nblocks, int encrypt); extern size_t _gcry_aes_ppc8_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks); + +extern void _gcry_aes_ppc8_xts_crypt (void *context, unsigned char *tweak, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, int encrypt); #endif /*USE_PPC_CRYPTO*/ static unsigned int do_encrypt (const RIJNDAEL_context *ctx, unsigned char *bx, @@ -452,8 +474,14 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, ctx->use_ppc_crypto = 1; if (hd) { + hd->bulk.cfb_enc = _gcry_aes_ppc8_cfb_enc; + hd->bulk.cfb_dec = _gcry_aes_ppc8_cfb_dec; + hd->bulk.cbc_enc = _gcry_aes_ppc8_cbc_enc; + hd->bulk.cbc_dec = _gcry_aes_ppc8_cbc_dec; + hd->bulk.ctr_enc = _gcry_aes_ppc8_ctr_enc; hd->bulk.ocb_crypt = _gcry_aes_ppc8_ocb_crypt; hd->bulk.ocb_auth = _gcry_aes_ppc8_ocb_auth; + hd->bulk.xts_crypt = _gcry_aes_ppc8_xts_crypt; } } #endif @@ -896,6 +924,13 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_cfb_enc (ctx, iv, outbuf, inbuf, nblocks); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { rijndael_cryptfn_t encrypt_fn = ctx->encrypt_fn; @@ -957,6 +992,13 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_cbc_enc (ctx, iv, outbuf, inbuf, nblocks, cbc_mac); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { rijndael_cryptfn_t encrypt_fn = ctx->encrypt_fn; @@ -1025,6 +1067,13 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_ctr_enc (ctx, ctr, outbuf, inbuf, nblocks); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { union { unsigned char x1[16] ATTR_ALIGNED_16; u32 x32[4]; } tmp; @@ -1268,6 +1317,13 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_cfb_dec (ctx, iv, outbuf, inbuf, nblocks); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { rijndael_cryptfn_t encrypt_fn = ctx->encrypt_fn; @@ -1326,6 +1382,13 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_cbc_dec (ctx, iv, outbuf, inbuf, nblocks); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { unsigned char savebuf[BLOCKSIZE] ATTR_ALIGNED_16; @@ -1556,6 +1619,13 @@ _gcry_aes_xts_crypt (void *context, unsigned char *tweak, return; } #endif /*USE_ARM_CE*/ +#ifdef USE_PPC_CRYPTO + else if (ctx->use_ppc_crypto) + { + _gcry_aes_ppc8_xts_crypt (ctx, tweak, outbuf, inbuf, nblocks, encrypt); + return; + } +#endif /*USE_PPC_CRYPTO*/ else { if (encrypt) diff --git a/configure.ac b/configure.ac index 586145aa4..d7725b553 100644 --- a/configure.ac +++ b/configure.ac @@ -1905,6 +1905,7 @@ AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto "lvx %v20,%r12,%r0;\n" "vcipher %v0, %v1, %v22;\n" "lxvw4x %vs32, %r0, %r1;\n" + "vadduwm %v0, %v1, %v22;\n" ); ]])], [gcry_cv_gcc_inline_asm_ppc_altivec=yes]) From jussi.kivilinna at iki.fi Fri Aug 23 19:14:41 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Fri, 23 Aug 2019 20:14:41 +0300 Subject: Developer Certificate of Origin In-Reply-To: <1569101562675872@iva1-44bdf084ee9e.qloud-c.yandex.net> References: <1569101562675872@iva1-44bdf084ee9e.qloud-c.yandex.net> Message-ID: <43edd318-bdc0-76a0-a461-817c7637fbf8@iki.fi> Hello, Where I can find you public key so I have verify the signature? -Jussi > Developer's Certificate of Origin 1.1 > > By making a contribution to this project (libgcrypt), I certify that: > > (a) The contribution was created in whole or in part by me and I > have the right to submit it under the open source license > indicated in the file; or > > (b) The contribution is based upon previous work that, to the best > of my knowledge, is covered under an appropriate open source > license and I have the right under that license to submit that > work with modifications, whether created in whole or in part > by me, under the same open source license (unless I am > permitted to submit under a different license), as indicated > in the file; or > > (c) The contribution was provided directly to me by some other > person who certified (a), (b) or (c) and I have not modified > it. > > (d) I understand and agree that this project and the contribution > are public and that a record of the contribution (including all > personal information I submit with it, including my sign-off) is > maintained indefinitely and may be redistributed consistent with > this project or the open source license(s) involved. > > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > The patches I just sent are here https://github.com/shawnl/libgcrypt/tree/ppc > cd09b29154daad1954f26cb5a21f6221505346f6 > and I sign off on them. > -----BEGIN PGP SIGNATURE----- > > iHUEARYIAB0WIQTxRbjHptuAaSiIXRy+5B/NgxVAaAUCXSSKkAAKCRC+5B/NgxVA > aEOaAQDNlTwqHjspFaZEUyvvaWojzFl7/GDM2UT7BnqAr84rUAEA9ZNfjLZwMpIg > fpPgTYzb6N/iK5jG5TrVzhblHb3LtgQ= > =/pmU > -----END PGP SIGNATURE----- > > > > -- > Shawn Landden > > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > http://lists.gnupg.org/mailman/listinfo/gcrypt-devel > From shawn at git.icu Mon Aug 26 13:19:01 2019 From: shawn at git.icu (Shawn Landden) Date: Mon, 26 Aug 2019 06:19:01 -0500 Subject: Developer Certificate of Origin In-Reply-To: <43edd318-bdc0-76a0-a461-817c7637fbf8@iki.fi> References: <1569101562675872@iva1-44bdf084ee9e.qloud-c.yandex.net> <43edd318-bdc0-76a0-a461-817c7637fbf8@iki.fi> Message-ID: <1875081566818341@vla1-c477e3898c96.qloud-c.yandex.net> https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x30782e2be4eb9fd5b293d3da6d100bf096b8a005 It does not have signatures, but Marco Villegas (who also has no signatures) will sign it when he figures out how. 23.08.2019, 12:14, "Jussi Kivilinna" : > Hello, > > Where I can find you public key so I have verify the signature? > > -Jussi > >> ?????????Developer's Certificate of Origin 1.1 >> >> ?????????By making a contribution to this project (libgcrypt), I certify that: >> >> ?????????(a) The contribution was created in whole or in part by me and I >> ?????????????have the right to submit it under the open source license >> ?????????????indicated in the file; or >> >> ?????????(b) The contribution is based upon previous work that, to the best >> ?????????????of my knowledge, is covered under an appropriate open source >> ?????????????license and I have the right under that license to submit that >> ?????????????work with modifications, whether created in whole or in part >> ?????????????by me, under the same open source license (unless I am >> ?????????????permitted to submit under a different license), as indicated >> ?????????????in the file; or >> >> ?????????(c) The contribution was provided directly to me by some other >> ?????????????person who certified (a), (b) or (c) and I have not modified >> ?????????????it. >> >> ?????????(d) I understand and agree that this project and the contribution >> ?????????????are public and that a record of the contribution (including all >> ?????????????personal information I submit with it, including my sign-off) is >> ?????????????maintained indefinitely and may be redistributed consistent with >> ?????????????this project or the open source license(s) involved. >> >> ?The patches I just sent are here https://github.com/shawnl/libgcrypt/tree/ppc >> ?cd09b29154daad1954f26cb5a21f6221505346f6 >> ?and I sign off on them. >> >>> -----BEGIN PGP SIGNATURE----- >>> >>> ?iHUEARYIAB0WIQTxRbjHptuAaSiIXRy+5B/NgxVAaAUCXSSKkAAKCRC+5B/NgxVA >>> ?aEOaAQDNlTwqHjspFaZEUyvvaWojzFl7/GDM2UT7BnqAr84rUAEA9ZNfjLZwMpIg >>> ?fpPgTYzb6N/iK5jG5TrVzhblHb3LtgQ= >>> ?=/pmU >>> ?-----END PGP SIGNATURE----- >> >> ?-- >> ?Shawn Landden >> >> ?_______________________________________________ >> ?Gcrypt-devel mailing list >> ?Gcrypt-devel at gnupg.org >> ?http://lists.gnupg.org/mailman/listinfo/gcrypt-devel -- Shawn Landden From jussi.kivilinna at iki.fi Mon Aug 26 17:43:20 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Mon, 26 Aug 2019 18:43:20 +0300 Subject: Developer Certificate of Origin In-Reply-To: <1875081566818341@vla1-c477e3898c96.qloud-c.yandex.net> References: <1569101562675872@iva1-44bdf084ee9e.qloud-c.yandex.net> <43edd318-bdc0-76a0-a461-817c7637fbf8@iki.fi> <1875081566818341@vla1-c477e3898c96.qloud-c.yandex.net> Message-ID: <1054fd64-0f59-1d82-6672-6f1cf71a7a8c@iki.fi> On 26.8.2019 14.19, Shawn Landden wrote: > https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x30782e2be4eb9fd5b293d3da6d100bf096b8a005 > > It does not have signatures, but Marco Villegas (who also has no signatures) will sign it when he figures out how. The public key is enough, thanks. I meant verifying the signature of your DCO. -Jussi > > 23.08.2019, 12:14, "Jussi Kivilinna" : >> Hello, >> >> Where I can find you public key so I have verify the signature? >> >> -Jussi >> >>> ?????????Developer's Certificate of Origin 1.1 >>> >>> ?????????By making a contribution to this project (libgcrypt), I certify that: >>> >>> ?????????(a) The contribution was created in whole or in part by me and I >>> ?????????????have the right to submit it under the open source license >>> ?????????????indicated in the file; or >>> >>> ?????????(b) The contribution is based upon previous work that, to the best >>> ?????????????of my knowledge, is covered under an appropriate open source >>> ?????????????license and I have the right under that license to submit that >>> ?????????????work with modifications, whether created in whole or in part >>> ?????????????by me, under the same open source license (unless I am >>> ?????????????permitted to submit under a different license), as indicated >>> ?????????????in the file; or >>> >>> ?????????(c) The contribution was provided directly to me by some other >>> ?????????????person who certified (a), (b) or (c) and I have not modified >>> ?????????????it. >>> >>> ?????????(d) I understand and agree that this project and the contribution >>> ?????????????are public and that a record of the contribution (including all >>> ?????????????personal information I submit with it, including my sign-off) is >>> ?????????????maintained indefinitely and may be redistributed consistent with >>> ?????????????this project or the open source license(s) involved. >>> >>> ?The patches I just sent are here https://github.com/shawnl/libgcrypt/tree/ppc >>> ?cd09b29154daad1954f26cb5a21f6221505346f6 >>> ?and I sign off on them. >>> >>>> -----BEGIN PGP SIGNATURE----- >>>> >>>> ?iHUEARYIAB0WIQTxRbjHptuAaSiIXRy+5B/NgxVAaAUCXSSKkAAKCRC+5B/NgxVA >>>> ?aEOaAQDNlTwqHjspFaZEUyvvaWojzFl7/GDM2UT7BnqAr84rUAEA9ZNfjLZwMpIg >>>> ?fpPgTYzb6N/iK5jG5TrVzhblHb3LtgQ= >>>> ?=/pmU >>>> ?-----END PGP SIGNATURE----- >>> >>> ?-- >>> ?Shawn Landden >>> >>> ?_______________________________________________ >>> ?Gcrypt-devel mailing list >>> ?Gcrypt-devel at gnupg.org >>> ?http://lists.gnupg.org/mailman/listinfo/gcrypt-devel > From ametzler at bebt.de Fri Aug 30 19:13:07 2019 From: ametzler at bebt.de (Andreas Metzler) Date: Fri, 30 Aug 2019 19:13:07 +0200 Subject: [Patch] use Requires.private do avoid unnecessary linkage Message-ID: <20190830171307.GC14673@argenau.bebt.de> Hello, find attached a one-line patch to avoid unnecessary linkage when using pkg-config. For dynamic linking it is not necessary to link libgpg-error when linking against libgcrypt (unless functions from libgpg-error are used directly, obviously). With this patch this works. (sid)ametzler at argenau:$ pkg-config --libs libgcrypt -lgcrypt (sid)ametzler at argenau:$ pkg-config --libs --static libgcrypt -lgcrypt -lgpg-error TIA, cu Andreas -- `What a good friend you are to him, Dr. Maturin. His other friends are so grateful to you.' `I sew his ears on from time to time, sure' -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-pkgconfig-Stop-unnecessary-linkage.patch Type: text/x-diff Size: 867 bytes Desc: not available URL: From jussi.kivilinna at iki.fi Sat Aug 31 01:49:37 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 31 Aug 2019 02:49:37 +0300 Subject: [PATCH 1/3] hwf-ppc: add detection for PowerISA 3.00 Message-ID: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> * src/g10lib.h (HWF_PPC_ARCH_3_00): New. * src/hwf-ppc.c (feature_map_s): Remove unused 'feature_match'. (PPC_FEATURE2_ARCH_3_00): New. (ppc_features, get_hwcap): Add PowerISA 3.00. * src/hwfeatures.c (hwflist): Rename "ppc-crypto" to "ppc-vcrypto"; Add "ppc-arch_3_00". -- Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/src/g10lib.h b/src/g10lib.h index 41e18c137..bbdaf58be 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -237,6 +237,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); #define HWF_ARM_PMULL (1 << 21) #define HWF_PPC_VCRYPTO (1 << 22) +#define HWF_PPC_ARCH_3_00 (1 << 23) gpg_err_code_t _gcry_disable_hw_feature (const char *name); void _gcry_detect_hw_features (void); diff --git a/src/hwf-ppc.c b/src/hwf-ppc.c index 1bf2edf70..2ed60c0f1 100644 --- a/src/hwf-ppc.c +++ b/src/hwf-ppc.c @@ -70,7 +70,6 @@ struct feature_map_s { unsigned int hwcap_flag; unsigned int hwcap2_flag; - const char *feature_match; unsigned int hwf_flag; }; @@ -87,12 +86,16 @@ struct feature_map_s #ifndef PPC_FEATURE2_VEC_CRYPTO # define PPC_FEATURE2_VEC_CRYPTO 0x02000000 #endif +#ifndef PPC_FEATURE2_ARCH_3_00 +# define PPC_FEATURE2_ARCH_3_00 0x00800000 +#endif static const struct feature_map_s ppc_features[] = { #ifdef ENABLE_PPC_CRYPTO_SUPPORT - { 0, PPC_FEATURE2_VEC_CRYPTO, " crypto", HWF_PPC_VCRYPTO }, + { 0, PPC_FEATURE2_VEC_CRYPTO, HWF_PPC_VCRYPTO }, #endif + { 0, PPC_FEATURE2_ARCH_3_00, HWF_PPC_ARCH_3_00 }, }; #endif @@ -114,22 +117,23 @@ get_hwcap(unsigned int *hwcap, unsigned int *hwcap2) } #if 0 // TODO: configure.ac detection for __builtin_cpu_supports -#if defined(__GLIBC__) && defined(__GNUC__) -#if __GNUC__ >= 6 - /* Returns 0 if glibc support doesn't exist, so we can - * only trust positive results. This function will need updating - * if we ever need more than one cpu feature. - */ - // TODO: fix, false if ENABLE_PPC_CRYPTO_SUPPORT - if (sizeof(ppc_features)/sizeof(ppc_features[0]) == 0) { - if (__builtin_cpu_supports("vcrypto")) { - stored_hwcap = 0; - stored_hwcap2 = PPC_FEATURE2_VEC_CRYPTO; - hwcap_initialized = 1; - return 0; + // TODO: move to 'detect_ppc_builtin_cpu_supports' +#if defined(__GLIBC__) && defined(__GNUC__) && __GNUC__ >= 6 + /* __builtin_cpu_supports returns 0 if glibc support doesn't exist, so + * we can only trust positive results. */ +#ifdef ENABLE_PPC_CRYPTO_SUPPORT + if (__builtin_cpu_supports("vcrypto")) /* TODO: Configure.ac */ + { + stored_hwcap2 |= PPC_FEATURE2_VEC_CRYPTO; + hwcap_initialized = 1; } - } #endif + + if (__builtin_cpu_supports("arch_3_00")) /* TODO: Configure.ac */ + { + stored_hwcap2 |= PPC_FEATURE2_ARCH_3_00; + hwcap_initialized = 1; + } #endif #endif @@ -188,6 +192,7 @@ get_hwcap(unsigned int *hwcap, unsigned int *hwcap2) err = 0; fclose(f); + *hwcap = stored_hwcap; *hwcap2 = stored_hwcap2; return err; diff --git a/src/hwfeatures.c b/src/hwfeatures.c index fe5137538..1021bd3b1 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -67,7 +67,8 @@ static struct { HWF_ARM_SHA2, "arm-sha2" }, { HWF_ARM_PMULL, "arm-pmull" }, #elif defined(HAVE_CPU_ARCH_PPC) - { HWF_PPC_VCRYPTO, "ppc-crypto" }, + { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, + { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, #endif }; From jussi.kivilinna at iki.fi Sat Aug 31 01:49:48 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 31 Aug 2019 02:49:48 +0300 Subject: [PATCH 3/3] Add SHA-512 implementations for POWER8 and POWER9 In-Reply-To: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> References: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> Message-ID: <156720898792.9538.1268253086120742170.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'sha512-ppc.c'; Add extra CFLAG handling for 'sha512-ppc.c'. * cipher/sha512-ppc.c: New. * cipher/sha512.c (USE_PPC_CRYPTO, _gcry_sha512_transform_ppc8) (_gcry_sha512_transform_ppc9, do_sha512_transform_ppc8) (do_sha512_transform_ppc9): New. (sha512_init_common): Add PowerPC HW feature detection and implemention selection. * configure.ac: Add 'vshasigmad' instruction to PowerPC assembly support check; Add 'sha512-ppc.lo'. -- Benchmark on POWER8 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 3.47 ns/B 274.6 MiB/s 13.20 c/B After (~2.08x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.66 ns/B 573.1 MiB/s 6.32 c/B For comparison, OpenSSL 1.1.1b (~1.6% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.64 ns/B 582.2 MiB/s 6.22 c/B Benchmark on POWER9 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 2.65 ns/B 359.6 MiB/s 10.08 c/B After (~1.33x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.99 ns/B 479.2 MiB/s 7.56 c/B For comparison, OpenSSL 1.1.1b (~9.4% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.82 ns/B 524.4 MiB/s 6.91 c/B GnuPG-bug-id: T4530 Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index dcb2e8f6f..4a5110d65 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -114,6 +114,7 @@ EXTRA_libcipher_la_SOURCES = \ sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S \ sha512-avx2-bmi2-amd64.S \ sha512-armv7-neon.S sha512-arm.S \ + sha512-ppc.c \ sm3.c \ keccak.c keccak_permute_32.h keccak_permute_64.h keccak-armv7-neon.S \ stribog.c \ @@ -216,3 +217,9 @@ sha256-ppc.o: $(srcdir)/sha256-ppc.c Makefile sha256-ppc.lo: $(srcdir)/sha256-ppc.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +sha512-ppc.o: $(srcdir)/sha512-ppc.c Makefile + `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +sha512-ppc.lo: $(srcdir)/sha512-ppc.c Makefile + `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` diff --git a/cipher/sha512-ppc.c b/cipher/sha512-ppc.c new file mode 100644 index 000000000..af9156ccd --- /dev/null +++ b/cipher/sha512-ppc.c @@ -0,0 +1,910 @@ +/* sha512-ppc.c - PowerPC vcrypto implementation of SHA-512 transform + * Copyright (C) 2019 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(ENABLE_PPC_CRYPTO_SUPPORT) && \ + defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) && \ + defined(USE_SHA512) && \ + __GNUC__ >= 4 + +#include +#include "bufhelp.h" + + +typedef vector unsigned char vector16x_u8; +typedef vector unsigned long long vector2x_u64; + + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + + +static const u64 K[80] = + { + U64_C(0x428a2f98d728ae22), U64_C(0x7137449123ef65cd), + U64_C(0xb5c0fbcfec4d3b2f), U64_C(0xe9b5dba58189dbbc), + U64_C(0x3956c25bf348b538), U64_C(0x59f111f1b605d019), + U64_C(0x923f82a4af194f9b), U64_C(0xab1c5ed5da6d8118), + U64_C(0xd807aa98a3030242), U64_C(0x12835b0145706fbe), + U64_C(0x243185be4ee4b28c), U64_C(0x550c7dc3d5ffb4e2), + U64_C(0x72be5d74f27b896f), U64_C(0x80deb1fe3b1696b1), + U64_C(0x9bdc06a725c71235), U64_C(0xc19bf174cf692694), + U64_C(0xe49b69c19ef14ad2), U64_C(0xefbe4786384f25e3), + U64_C(0x0fc19dc68b8cd5b5), U64_C(0x240ca1cc77ac9c65), + U64_C(0x2de92c6f592b0275), U64_C(0x4a7484aa6ea6e483), + U64_C(0x5cb0a9dcbd41fbd4), U64_C(0x76f988da831153b5), + U64_C(0x983e5152ee66dfab), U64_C(0xa831c66d2db43210), + U64_C(0xb00327c898fb213f), U64_C(0xbf597fc7beef0ee4), + U64_C(0xc6e00bf33da88fc2), U64_C(0xd5a79147930aa725), + U64_C(0x06ca6351e003826f), U64_C(0x142929670a0e6e70), + U64_C(0x27b70a8546d22ffc), U64_C(0x2e1b21385c26c926), + U64_C(0x4d2c6dfc5ac42aed), U64_C(0x53380d139d95b3df), + U64_C(0x650a73548baf63de), U64_C(0x766a0abb3c77b2a8), + U64_C(0x81c2c92e47edaee6), U64_C(0x92722c851482353b), + U64_C(0xa2bfe8a14cf10364), U64_C(0xa81a664bbc423001), + U64_C(0xc24b8b70d0f89791), U64_C(0xc76c51a30654be30), + U64_C(0xd192e819d6ef5218), U64_C(0xd69906245565a910), + U64_C(0xf40e35855771202a), U64_C(0x106aa07032bbd1b8), + U64_C(0x19a4c116b8d2d0c8), U64_C(0x1e376c085141ab53), + U64_C(0x2748774cdf8eeb99), U64_C(0x34b0bcb5e19b48a8), + U64_C(0x391c0cb3c5c95a63), U64_C(0x4ed8aa4ae3418acb), + U64_C(0x5b9cca4f7763e373), U64_C(0x682e6ff3d6b2b8a3), + U64_C(0x748f82ee5defb2fc), U64_C(0x78a5636f43172f60), + U64_C(0x84c87814a1f0ab72), U64_C(0x8cc702081a6439ec), + U64_C(0x90befffa23631e28), U64_C(0xa4506cebde82bde9), + U64_C(0xbef9a3f7b2c67915), U64_C(0xc67178f2e372532b), + U64_C(0xca273eceea26619c), U64_C(0xd186b8c721c0c207), + U64_C(0xeada7dd6cde0eb1e), U64_C(0xf57d4f7fee6ed178), + U64_C(0x06f067aa72176fba), U64_C(0x0a637dc5a2c898a6), + U64_C(0x113f9804bef90dae), U64_C(0x1b710b35131c471b), + U64_C(0x28db77f523047d84), U64_C(0x32caab7b40c72493), + U64_C(0x3c9ebe0a15c9bebc), U64_C(0x431d67c49c100d4c), + U64_C(0x4cc5d4becb3e42b6), U64_C(0x597f299cfc657e2a), + U64_C(0x5fcb6fab3ad6faec), U64_C(0x6c44198c4a475817) + }; + + +static ASM_FUNC_ATTR_INLINE u64 +ror64 (u64 v, u64 shift) +{ + return (v >> (shift & 63)) ^ (v << ((64 - shift) & 63)); +} + + +static ASM_FUNC_ATTR_INLINE vector2x_u64 +vec_rol_elems(vector2x_u64 v, unsigned int idx) +{ +#ifndef WORDS_BIGENDIAN + return vec_sld (v, v, (16 - (8 * idx)) & 15); +#else + return vec_sld (v, v, (8 * idx) & 15); +#endif +} + + +static ASM_FUNC_ATTR_INLINE vector2x_u64 +vec_merge_idx0_elems(vector2x_u64 v0, vector2x_u64 v1) +{ + return vec_mergeh (v0, v1); +} + + +static ASM_FUNC_ATTR_INLINE vector2x_u64 +vec_vshasigma_u64(vector2x_u64 v, unsigned int a, unsigned int b) +{ + asm ("vshasigmad %0,%1,%2,%3" + : "=v" (v) + : "v" (v), "g" (a), "g" (b) + : "memory"); + return v; +} + + +/* SHA2 round in vector registers */ +#define R(a,b,c,d,e,f,g,h,k,w) do \ + { \ + t1 = (h); \ + t1 += ((k) + (w)); \ + t1 += Cho((e),(f),(g)); \ + t1 += Sum1((e)); \ + t2 = Sum0((a)); \ + t2 += Maj((a),(b),(c)); \ + d += t1; \ + h = t1 + t2; \ + } while (0) + +#define Cho(b, c, d) (vec_sel(d, c, b)) + +#define Maj(c, d, b) (vec_sel(c, b, c ^ d)) + +#define Sum0(x) (vec_vshasigma_u64(x, 1, 0)) + +#define Sum1(x) (vec_vshasigma_u64(x, 1, 15)) + + +/* Message expansion on general purpose registers */ +#define S0(x) (ror64 ((x), 1) ^ ror64 ((x), 8) ^ ((x) >> 7)) +#define S1(x) (ror64 ((x), 19) ^ ror64 ((x), 61) ^ ((x) >> 6)) + +#define I(i) ( w[i] = buf_get_be64(data + i * 8) ) +#define W(i) ({ w[i&0x0f] += w[(i-7) &0x0f]; \ + w[i&0x0f] += S0(w[(i-15)&0x0f]); \ + w[i&0x0f] += S1(w[(i-2) &0x0f]); \ + w[i&0x0f]; }) + +#define I2(i) ( w2[i] = buf_get_be64(128 + data + i * 8), I(i) ) +#define W2(i) ({ w2[i] = w2[i-7]; \ + w2[i] += S1(w2[i-2]); \ + w2[i] += S0(w2[i-15]); \ + w2[i] += w2[i-16]; \ + W(i); }) +#define R2(i) ( w2[i] ) + + +unsigned int ASM_FUNC_ATTR +_gcry_sha512_transform_ppc8(u64 state[8], + const unsigned char *data, size_t nblks) +{ + /* GPRs used for message expansion as vector intrinsics based generates + * slower code. */ + vector2x_u64 h0, h1, h2, h3, h4, h5, h6, h7; + vector2x_u64 a, b, c, d, e, f, g, h, t1, t2; + u64 w[16]; + + h0 = vec_vsx_ld (8 * 0, (unsigned long long *)state); + h1 = vec_rol_elems (h0, 1); + h2 = vec_vsx_ld (8 * 2, (unsigned long long *)state); + h3 = vec_rol_elems (h2, 1); + h4 = vec_vsx_ld (8 * 4, (unsigned long long *)state); + h5 = vec_rol_elems (h4, 1); + h6 = vec_vsx_ld (8 * 6, (unsigned long long *)state); + h7 = vec_rol_elems (h6, 1); + + while (nblks >= 2) + { + nblks -= 2; + + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + } + + while (nblks) + { + nblks--; + + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + } + + h0 = vec_merge_idx0_elems (h0, h1); + h2 = vec_merge_idx0_elems (h2, h3); + h4 = vec_merge_idx0_elems (h4, h5); + h6 = vec_merge_idx0_elems (h6, h7); + vec_vsx_st (h0, 8 * 0, (unsigned long long *)state); + vec_vsx_st (h2, 8 * 2, (unsigned long long *)state); + vec_vsx_st (h4, 8 * 4, (unsigned long long *)state); + vec_vsx_st (h6, 8 * 6, (unsigned long long *)state); + + return sizeof(w); +} +#undef R +#undef Cho +#undef Maj +#undef Sum0 +#undef Sum1 +#undef S0 +#undef S1 +#undef I +#undef W +#undef I2 +#undef W2 +#undef R2 + + +/* SHA2 round in general purpose registers */ +#define R(a,b,c,d,e,f,g,h,k,w) do \ + { \ + t1 = (h) + Sum1((e)) + Cho((e),(f),(g)) + ((k) + (w));\ + t2 = Sum0((a)) + Maj((a),(b),(c)); \ + d += t1; \ + h = t1 + t2; \ + } while (0) + +#define Cho(x, y, z) ((x & y) + (~x & z)) + +#define Maj(z, x, y) ((x & y) + (z & (x ^ y))) + +#define Sum0(x) (ror64(x, 28) ^ ror64(x ^ ror64(x, 39-34), 34)) + +#define Sum1(x) (ror64(x, 14) ^ ror64(x, 18) ^ ror64(x, 41)) + + +/* Message expansion on general purpose registers */ +#define S0(x) (ror64 ((x), 1) ^ ror64 ((x), 8) ^ ((x) >> 7)) +#define S1(x) (ror64 ((x), 19) ^ ror64 ((x), 61) ^ ((x) >> 6)) + +#define I(i) ( w[i] = buf_get_be64(data + i * 8) ) +#define W(i) ({ w[i&0x0f] += w[(i-7) &0x0f]; \ + w[i&0x0f] += S0(w[(i-15)&0x0f]); \ + w[i&0x0f] += S1(w[(i-2) &0x0f]); \ + w[i&0x0f]; }) + + +unsigned int ASM_FUNC_ATTR +_gcry_sha512_transform_ppc9(u64 state[8], const unsigned char *data, + size_t nblks) +{ + /* GPRs used for round function and message expansion as vector intrinsics + * based generates slower code for POWER9. */ + u64 a, b, c, d, e, f, g, h, t1, t2; + u64 w[16]; + + a = state[0]; + b = state[1]; + c = state[2]; + d = state[3]; + e = state[4]; + f = state[5]; + g = state[6]; + h = state[7]; + + while (nblks >= 2) + { + nblks -= 2; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + } + + while (nblks) + { + nblks--; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 128; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + R(a, b, c, d, e, f, g, h, K[64], W(64)); + R(h, a, b, c, d, e, f, g, K[65], W(65)); + R(g, h, a, b, c, d, e, f, K[66], W(66)); + R(f, g, h, a, b, c, d, e, K[67], W(67)); + R(e, f, g, h, a, b, c, d, K[68], W(68)); + R(d, e, f, g, h, a, b, c, K[69], W(69)); + R(c, d, e, f, g, h, a, b, K[70], W(70)); + R(b, c, d, e, f, g, h, a, K[71], W(71)); + R(a, b, c, d, e, f, g, h, K[72], W(72)); + R(h, a, b, c, d, e, f, g, K[73], W(73)); + R(g, h, a, b, c, d, e, f, K[74], W(74)); + R(f, g, h, a, b, c, d, e, K[75], W(75)); + R(e, f, g, h, a, b, c, d, K[76], W(76)); + R(d, e, f, g, h, a, b, c, K[77], W(77)); + R(c, d, e, f, g, h, a, b, K[78], W(78)); + R(b, c, d, e, f, g, h, a, K[79], W(79)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + } + + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + + return sizeof(w); +} + +#endif /* ENABLE_PPC_CRYPTO_SUPPORT */ diff --git a/cipher/sha512.c b/cipher/sha512.c index 1a808f884..b05157aa9 100644 --- a/cipher/sha512.c +++ b/cipher/sha512.c @@ -104,6 +104,19 @@ #endif +/* USE_PPC_CRYPTO indicates whether to enable PowerPC vector crypto + * accelerated code. */ +#undef USE_PPC_CRYPTO +#ifdef ENABLE_PPC_CRYPTO_SUPPORT +# if defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) +# if __GNUC__ >= 4 +# define USE_PPC_CRYPTO 1 +# endif +# endif +#endif + + typedef struct { u64 h0, h1, h2, h3, h4, h5, h6, h7; @@ -253,6 +266,31 @@ do_transform_generic (void *context, const unsigned char *data, size_t nblks); #endif +#ifdef USE_PPC_CRYPTO +unsigned int _gcry_sha512_transform_ppc8(u64 state[8], + const unsigned char *input_data, + size_t num_blks); + +unsigned int _gcry_sha512_transform_ppc9(u64 state[8], + const unsigned char *input_data, + size_t num_blks); + +static unsigned int +do_sha512_transform_ppc8(void *ctx, const unsigned char *data, size_t nblks) +{ + SHA512_CONTEXT *hd = ctx; + return _gcry_sha512_transform_ppc8 (&hd->state.h0, data, nblks); +} + +static unsigned int +do_sha512_transform_ppc9(void *ctx, const unsigned char *data, size_t nblks) +{ + SHA512_CONTEXT *hd = ctx; + return _gcry_sha512_transform_ppc9 (&hd->state.h0, data, nblks); +} +#endif + + static void sha512_init_common (SHA512_CONTEXT *ctx, unsigned int flags) { @@ -285,6 +323,12 @@ sha512_init_common (SHA512_CONTEXT *ctx, unsigned int flags) #ifdef USE_AVX2 if ((features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2)) ctx->bctx.bwrite = do_sha512_transform_amd64_avx2; +#endif +#ifdef USE_PPC_CRYPTO + if ((features & HWF_PPC_VCRYPTO) != 0) + ctx->bctx.bwrite = do_sha512_transform_ppc8; + if ((features & HWF_PPC_VCRYPTO) != 0 && (features & HWF_PPC_ARCH_3_00) != 0) + ctx->bctx.bwrite = do_sha512_transform_ppc9; #endif (void)features; } diff --git a/configure.ac b/configure.ac index fb7b40874..7b8f6cf41 100644 --- a/configure.ac +++ b/configure.ac @@ -1907,6 +1907,7 @@ AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto "lxvw4x %vs32, %r0, %r1;\n" "vadduwm %v0, %v1, %v22;\n" "vshasigmaw %v0, %v1, 0, 15;\n" + "vshasigmad %v0, %v1, 0, 15;\n" ); ]])], [gcry_cv_gcc_inline_asm_ppc_altivec=yes]) @@ -2653,6 +2654,19 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha512-arm.lo" ;; + powerpc64le-*-*) + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha512-ppc.lo" + ;; + powerpc64-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha512-ppc.lo" + ;; + powerpc-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha512-ppc.lo" esac if test x"$neonsupport" = xyes ; then From jussi.kivilinna at iki.fi Sat Aug 31 01:49:42 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 31 Aug 2019 02:49:42 +0300 Subject: [PATCH 2/3] Add SHA-256 implementations for POWER8 and POWER9 In-Reply-To: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> References: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> Message-ID: <156720898275.9538.8488379651041799512.stgit@localhost.localdomain> * cipher/Makefile.am: Add 'sha256-ppc.c'; Add extra CFLAG handling for 'sha256-ppc.c'. * cipher/sha256-ppc.c: New. * cipher/sha256.c (USE_PPC_CRYPTO, _gcry_sha256_transform_ppc8) (_gcry_sha256_transform_ppc9, do_sha256_transform_ppc8) (do_sha256_transform_ppc9): New. (sha256_init, sha224_init): Split common part to new function named... (sha256_common_init): ...this; Add PowerPC HW feature detection and implemention selection. * configure.ac: Add 'vshasigmaw' instruction to PowerPC assembly support check; Add 'sha256-ppc.lo'. -- Benchmark on POWER8 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 4.17 ns/B 228.6 MiB/s 15.85 c/B After (~1.63x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.55 ns/B 373.9 MiB/s 9.69 c/B For comparison, OpenSSL 1.1.1b (~2.4% slower): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.61 ns/B 364.8 MiB/s 9.93 c/B Benchmark on POWER9 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 3.23 ns/B 295.6 MiB/s 12.26 c/B After (~1.03x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 3.11 ns/B 306.8 MiB/s 11.81 c/B For comparison, OpenSSL 1.1.1b (~6.6% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.91 ns/B 327.5 MiB/s 11.07 c/B GnuPG-bug-id: T4530 Signed-off-by: Jussi Kivilinna --- 0 files changed diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 1f2d8ec97..dcb2e8f6f 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -110,7 +110,7 @@ EXTRA_libcipher_la_SOURCES = \ sha256.c sha256-ssse3-amd64.S sha256-avx-amd64.S \ sha256-avx2-bmi2-amd64.S \ sha256-armv8-aarch32-ce.S sha256-armv8-aarch64-ce.S \ - sha256-intel-shaext.c \ + sha256-intel-shaext.c sha256-ppc.c \ sha512.c sha512-ssse3-amd64.S sha512-avx-amd64.S \ sha512-avx2-bmi2-amd64.S \ sha512-armv7-neon.S sha512-arm.S \ @@ -210,3 +210,9 @@ rijndael-ppc.o: $(srcdir)/rijndael-ppc.c Makefile rijndael-ppc.lo: $(srcdir)/rijndael-ppc.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +sha256-ppc.o: $(srcdir)/sha256-ppc.c Makefile + `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` + +sha256-ppc.lo: $(srcdir)/sha256-ppc.c Makefile + `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` diff --git a/cipher/sha256-ppc.c b/cipher/sha256-ppc.c new file mode 100644 index 000000000..1dcec6b42 --- /dev/null +++ b/cipher/sha256-ppc.c @@ -0,0 +1,790 @@ +/* sha256-ppc.c - PowerPC vcrypto implementation of SHA-256 transform + * Copyright (C) 2019 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#if defined(ENABLE_PPC_CRYPTO_SUPPORT) && \ + defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) && \ + defined(USE_SHA256) && \ + __GNUC__ >= 4 + +#include +#include "bufhelp.h" + + +typedef vector unsigned char vector16x_u8; +typedef vector unsigned int vector4x_u32; +typedef vector unsigned long long vector2x_u64; + + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + + +static const u32 K[64] = + { +#define TBL(v) v + TBL(0x428a2f98), TBL(0x71374491), TBL(0xb5c0fbcf), TBL(0xe9b5dba5), + TBL(0x3956c25b), TBL(0x59f111f1), TBL(0x923f82a4), TBL(0xab1c5ed5), + TBL(0xd807aa98), TBL(0x12835b01), TBL(0x243185be), TBL(0x550c7dc3), + TBL(0x72be5d74), TBL(0x80deb1fe), TBL(0x9bdc06a7), TBL(0xc19bf174), + TBL(0xe49b69c1), TBL(0xefbe4786), TBL(0x0fc19dc6), TBL(0x240ca1cc), + TBL(0x2de92c6f), TBL(0x4a7484aa), TBL(0x5cb0a9dc), TBL(0x76f988da), + TBL(0x983e5152), TBL(0xa831c66d), TBL(0xb00327c8), TBL(0xbf597fc7), + TBL(0xc6e00bf3), TBL(0xd5a79147), TBL(0x06ca6351), TBL(0x14292967), + TBL(0x27b70a85), TBL(0x2e1b2138), TBL(0x4d2c6dfc), TBL(0x53380d13), + TBL(0x650a7354), TBL(0x766a0abb), TBL(0x81c2c92e), TBL(0x92722c85), + TBL(0xa2bfe8a1), TBL(0xa81a664b), TBL(0xc24b8b70), TBL(0xc76c51a3), + TBL(0xd192e819), TBL(0xd6990624), TBL(0xf40e3585), TBL(0x106aa070), + TBL(0x19a4c116), TBL(0x1e376c08), TBL(0x2748774c), TBL(0x34b0bcb5), + TBL(0x391c0cb3), TBL(0x4ed8aa4a), TBL(0x5b9cca4f), TBL(0x682e6ff3), + TBL(0x748f82ee), TBL(0x78a5636f), TBL(0x84c87814), TBL(0x8cc70208), + TBL(0x90befffa), TBL(0xa4506ceb), TBL(0xbef9a3f7), TBL(0xc67178f2) +#undef TBL + }; + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_rol_elems(vector4x_u32 v, unsigned int idx) +{ +#ifndef WORDS_BIGENDIAN + return vec_sld (v, v, (16 - (4 * idx)) & 15); +#else + return vec_sld (v, v, (4 * idx) & 15); +#endif +} + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_merge_idx0_elems(vector4x_u32 v0, vector4x_u32 v1, + vector4x_u32 v2, vector4x_u32 v3) +{ + return (vector4x_u32)vec_mergeh ((vector2x_u64) vec_mergeh(v0, v1), + (vector2x_u64) vec_mergeh(v2, v3)); +} + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_ror_u32(vector4x_u32 v, unsigned int shift) +{ + return (v >> (shift & 31)) ^ (v << ((32 - shift) & 31)); +} + + +static ASM_FUNC_ATTR_INLINE vector4x_u32 +vec_vshasigma_u32(vector4x_u32 v, unsigned int a, unsigned int b) +{ + asm ("vshasigmaw %0,%1,%2,%3" + : "=v" (v) + : "v" (v), "g" (a), "g" (b) + : "memory"); + return v; +} + + +/* SHA2 round in vector registers */ +#define R(a,b,c,d,e,f,g,h,k,w) do \ + { \ + t1 = (h); \ + t1 += ((k) + (w)); \ + t1 += Cho((e),(f),(g)); \ + t1 += Sum1((e)); \ + t2 = Sum0((a)); \ + t2 += Maj((a),(b),(c)); \ + d += t1; \ + h = t1 + t2; \ + } while (0) + +#define Cho(b, c, d) (vec_sel(d, c, b)) + +#define Maj(c, d, b) (vec_sel(c, b, c ^ d)) + +#define Sum0(x) (vec_vshasigma_u32(x, 1, 0)) + +#define Sum1(x) (vec_vshasigma_u32(x, 1, 15)) + + +/* Message expansion on general purpose registers */ +#define S0(x) (ror ((x), 7) ^ ror ((x), 18) ^ ((x) >> 3)) +#define S1(x) (ror ((x), 17) ^ ror ((x), 19) ^ ((x) >> 10)) + +#define I(i) ( w[i] = buf_get_be32(data + i * 4) ) +#define W(i) ({ w[i&0x0f] += w[(i-7) &0x0f]; \ + w[i&0x0f] += S0(w[(i-15)&0x0f]); \ + w[i&0x0f] += S1(w[(i-2) &0x0f]); \ + w[i&0x0f]; }) + +#define I2(i) ( w2[i] = buf_get_be32(64 + data + i * 4), I(i) ) +#define W2(i) ({ w2[i] = w2[i-7]; \ + w2[i] += S1(w2[i-2]); \ + w2[i] += S0(w2[i-15]); \ + w2[i] += w2[i-16]; \ + W(i); }) +#define R2(i) ( w2[i] ) + + +unsigned int ASM_FUNC_ATTR +_gcry_sha256_transform_ppc8(u32 state[8], const unsigned char *data, + size_t nblks) +{ + /* GPRs used for message expansion as vector intrinsics based generates + * slower code. */ + vector4x_u32 h0, h1, h2, h3, h4, h5, h6, h7; + vector4x_u32 h0_h3, h4_h7; + vector4x_u32 a, b, c, d, e, f, g, h, t1, t2; + u32 w[16]; + u32 w2[64]; + + h0_h3 = vec_vsx_ld (4 * 0, state); + h4_h7 = vec_vsx_ld (4 * 4, state); + + h0 = h0_h3; + h1 = vec_rol_elems (h0_h3, 1); + h2 = vec_rol_elems (h0_h3, 2); + h3 = vec_rol_elems (h0_h3, 3); + h4 = h4_h7; + h5 = vec_rol_elems (h4_h7, 1); + h6 = vec_rol_elems (h4_h7, 2); + h7 = vec_rol_elems (h4_h7, 3); + + while (nblks >= 2) + { + nblks -= 2; + + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], I2(0)); + R(h, a, b, c, d, e, f, g, K[1], I2(1)); + R(g, h, a, b, c, d, e, f, K[2], I2(2)); + R(f, g, h, a, b, c, d, e, K[3], I2(3)); + R(e, f, g, h, a, b, c, d, K[4], I2(4)); + R(d, e, f, g, h, a, b, c, K[5], I2(5)); + R(c, d, e, f, g, h, a, b, K[6], I2(6)); + R(b, c, d, e, f, g, h, a, K[7], I2(7)); + R(a, b, c, d, e, f, g, h, K[8], I2(8)); + R(h, a, b, c, d, e, f, g, K[9], I2(9)); + R(g, h, a, b, c, d, e, f, K[10], I2(10)); + R(f, g, h, a, b, c, d, e, K[11], I2(11)); + R(e, f, g, h, a, b, c, d, K[12], I2(12)); + R(d, e, f, g, h, a, b, c, K[13], I2(13)); + R(c, d, e, f, g, h, a, b, K[14], I2(14)); + R(b, c, d, e, f, g, h, a, K[15], I2(15)); + data += 64 * 2; + + R(a, b, c, d, e, f, g, h, K[16], W2(16)); + R(h, a, b, c, d, e, f, g, K[17], W2(17)); + R(g, h, a, b, c, d, e, f, K[18], W2(18)); + R(f, g, h, a, b, c, d, e, K[19], W2(19)); + R(e, f, g, h, a, b, c, d, K[20], W2(20)); + R(d, e, f, g, h, a, b, c, K[21], W2(21)); + R(c, d, e, f, g, h, a, b, K[22], W2(22)); + R(b, c, d, e, f, g, h, a, K[23], W2(23)); + R(a, b, c, d, e, f, g, h, K[24], W2(24)); + R(h, a, b, c, d, e, f, g, K[25], W2(25)); + R(g, h, a, b, c, d, e, f, K[26], W2(26)); + R(f, g, h, a, b, c, d, e, K[27], W2(27)); + R(e, f, g, h, a, b, c, d, K[28], W2(28)); + R(d, e, f, g, h, a, b, c, K[29], W2(29)); + R(c, d, e, f, g, h, a, b, K[30], W2(30)); + R(b, c, d, e, f, g, h, a, K[31], W2(31)); + + R(a, b, c, d, e, f, g, h, K[32], W2(32)); + R(h, a, b, c, d, e, f, g, K[33], W2(33)); + R(g, h, a, b, c, d, e, f, K[34], W2(34)); + R(f, g, h, a, b, c, d, e, K[35], W2(35)); + R(e, f, g, h, a, b, c, d, K[36], W2(36)); + R(d, e, f, g, h, a, b, c, K[37], W2(37)); + R(c, d, e, f, g, h, a, b, K[38], W2(38)); + R(b, c, d, e, f, g, h, a, K[39], W2(39)); + R(a, b, c, d, e, f, g, h, K[40], W2(40)); + R(h, a, b, c, d, e, f, g, K[41], W2(41)); + R(g, h, a, b, c, d, e, f, K[42], W2(42)); + R(f, g, h, a, b, c, d, e, K[43], W2(43)); + R(e, f, g, h, a, b, c, d, K[44], W2(44)); + R(d, e, f, g, h, a, b, c, K[45], W2(45)); + R(c, d, e, f, g, h, a, b, K[46], W2(46)); + R(b, c, d, e, f, g, h, a, K[47], W2(47)); + + R(a, b, c, d, e, f, g, h, K[48], W2(48)); + R(h, a, b, c, d, e, f, g, K[49], W2(49)); + R(g, h, a, b, c, d, e, f, K[50], W2(50)); + R(f, g, h, a, b, c, d, e, K[51], W2(51)); + R(e, f, g, h, a, b, c, d, K[52], W2(52)); + R(d, e, f, g, h, a, b, c, K[53], W2(53)); + R(c, d, e, f, g, h, a, b, K[54], W2(54)); + R(b, c, d, e, f, g, h, a, K[55], W2(55)); + R(a, b, c, d, e, f, g, h, K[56], W2(56)); + R(h, a, b, c, d, e, f, g, K[57], W2(57)); + R(g, h, a, b, c, d, e, f, K[58], W2(58)); + R(f, g, h, a, b, c, d, e, K[59], W2(59)); + R(e, f, g, h, a, b, c, d, K[60], W2(60)); + R(d, e, f, g, h, a, b, c, K[61], W2(61)); + R(c, d, e, f, g, h, a, b, K[62], W2(62)); + R(b, c, d, e, f, g, h, a, K[63], W2(63)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], R2(0)); + R(h, a, b, c, d, e, f, g, K[1], R2(1)); + R(g, h, a, b, c, d, e, f, K[2], R2(2)); + R(f, g, h, a, b, c, d, e, K[3], R2(3)); + R(e, f, g, h, a, b, c, d, K[4], R2(4)); + R(d, e, f, g, h, a, b, c, K[5], R2(5)); + R(c, d, e, f, g, h, a, b, K[6], R2(6)); + R(b, c, d, e, f, g, h, a, K[7], R2(7)); + R(a, b, c, d, e, f, g, h, K[8], R2(8)); + R(h, a, b, c, d, e, f, g, K[9], R2(9)); + R(g, h, a, b, c, d, e, f, K[10], R2(10)); + R(f, g, h, a, b, c, d, e, K[11], R2(11)); + R(e, f, g, h, a, b, c, d, K[12], R2(12)); + R(d, e, f, g, h, a, b, c, K[13], R2(13)); + R(c, d, e, f, g, h, a, b, K[14], R2(14)); + R(b, c, d, e, f, g, h, a, K[15], R2(15)); + + R(a, b, c, d, e, f, g, h, K[16], R2(16)); + R(h, a, b, c, d, e, f, g, K[17], R2(17)); + R(g, h, a, b, c, d, e, f, K[18], R2(18)); + R(f, g, h, a, b, c, d, e, K[19], R2(19)); + R(e, f, g, h, a, b, c, d, K[20], R2(20)); + R(d, e, f, g, h, a, b, c, K[21], R2(21)); + R(c, d, e, f, g, h, a, b, K[22], R2(22)); + R(b, c, d, e, f, g, h, a, K[23], R2(23)); + R(a, b, c, d, e, f, g, h, K[24], R2(24)); + R(h, a, b, c, d, e, f, g, K[25], R2(25)); + R(g, h, a, b, c, d, e, f, K[26], R2(26)); + R(f, g, h, a, b, c, d, e, K[27], R2(27)); + R(e, f, g, h, a, b, c, d, K[28], R2(28)); + R(d, e, f, g, h, a, b, c, K[29], R2(29)); + R(c, d, e, f, g, h, a, b, K[30], R2(30)); + R(b, c, d, e, f, g, h, a, K[31], R2(31)); + + R(a, b, c, d, e, f, g, h, K[32], R2(32)); + R(h, a, b, c, d, e, f, g, K[33], R2(33)); + R(g, h, a, b, c, d, e, f, K[34], R2(34)); + R(f, g, h, a, b, c, d, e, K[35], R2(35)); + R(e, f, g, h, a, b, c, d, K[36], R2(36)); + R(d, e, f, g, h, a, b, c, K[37], R2(37)); + R(c, d, e, f, g, h, a, b, K[38], R2(38)); + R(b, c, d, e, f, g, h, a, K[39], R2(39)); + R(a, b, c, d, e, f, g, h, K[40], R2(40)); + R(h, a, b, c, d, e, f, g, K[41], R2(41)); + R(g, h, a, b, c, d, e, f, K[42], R2(42)); + R(f, g, h, a, b, c, d, e, K[43], R2(43)); + R(e, f, g, h, a, b, c, d, K[44], R2(44)); + R(d, e, f, g, h, a, b, c, K[45], R2(45)); + R(c, d, e, f, g, h, a, b, K[46], R2(46)); + R(b, c, d, e, f, g, h, a, K[47], R2(47)); + + R(a, b, c, d, e, f, g, h, K[48], R2(48)); + R(h, a, b, c, d, e, f, g, K[49], R2(49)); + R(g, h, a, b, c, d, e, f, K[50], R2(50)); + R(f, g, h, a, b, c, d, e, K[51], R2(51)); + R(e, f, g, h, a, b, c, d, K[52], R2(52)); + R(d, e, f, g, h, a, b, c, K[53], R2(53)); + R(c, d, e, f, g, h, a, b, K[54], R2(54)); + R(b, c, d, e, f, g, h, a, K[55], R2(55)); + R(a, b, c, d, e, f, g, h, K[56], R2(56)); + R(h, a, b, c, d, e, f, g, K[57], R2(57)); + R(g, h, a, b, c, d, e, f, K[58], R2(58)); + R(f, g, h, a, b, c, d, e, K[59], R2(59)); + R(e, f, g, h, a, b, c, d, K[60], R2(60)); + R(d, e, f, g, h, a, b, c, K[61], R2(61)); + R(c, d, e, f, g, h, a, b, K[62], R2(62)); + R(b, c, d, e, f, g, h, a, K[63], R2(63)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + } + + while (nblks) + { + nblks--; + + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + f = h5; + g = h6; + h = h7; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 64; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + h0 += a; + h1 += b; + h2 += c; + h3 += d; + h4 += e; + h5 += f; + h6 += g; + h7 += h; + } + + h0_h3 = vec_merge_idx0_elems (h0, h1, h2, h3); + h4_h7 = vec_merge_idx0_elems (h4, h5, h6, h7); + vec_vsx_st (h0_h3, 4 * 0, state); + vec_vsx_st (h4_h7, 4 * 4, state); + + return sizeof(w2) + sizeof(w); +} +#undef R +#undef Cho +#undef Maj +#undef Sum0 +#undef Sum1 +#undef S0 +#undef S1 +#undef I +#undef W +#undef I2 +#undef W2 +#undef R2 + + +/* SHA2 round in general purpose registers */ +#define R(a,b,c,d,e,f,g,h,k,w) do \ + { \ + t1 = (h) + Sum1((e)) + Cho((e),(f),(g)) + ((k) + (w));\ + t2 = Sum0((a)) + Maj((a),(b),(c)); \ + d += t1; \ + h = t1 + t2; \ + } while (0) + +#define Cho(x, y, z) ((x & y) + (~x & z)) + +#define Maj(z, x, y) ((x & y) + (z & (x ^ y))) + +#define Sum0(x) (ror (x, 2) ^ ror (x ^ ror (x, 22-13), 13)) + +#define Sum1(x) (ror (x, 6) ^ ror (x, 11) ^ ror (x, 25)) + + +/* Message expansion on general purpose registers */ +#define S0(x) (ror ((x), 7) ^ ror ((x), 18) ^ ((x) >> 3)) +#define S1(x) (ror ((x), 17) ^ ror ((x), 19) ^ ((x) >> 10)) + +#define I(i) ( w[i] = buf_get_be32(data + i * 4) ) +#define W(i) ({ w[i&0x0f] += w[(i-7) &0x0f]; \ + w[i&0x0f] += S0(w[(i-15)&0x0f]); \ + w[i&0x0f] += S1(w[(i-2) &0x0f]); \ + w[i&0x0f]; }) + + +unsigned int ASM_FUNC_ATTR +_gcry_sha256_transform_ppc9(u32 state[8], const unsigned char *data, + size_t nblks) +{ + /* GPRs used for round function and message expansion as vector intrinsics + * based generates slower code for POWER9. */ + u32 a, b, c, d, e, f, g, h, t1, t2; + u32 w[16]; + + a = state[0]; + b = state[1]; + c = state[2]; + d = state[3]; + e = state[4]; + f = state[5]; + g = state[6]; + h = state[7]; + + while (nblks >= 2) + { + nblks -= 2; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 64; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 64; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + } + + while (nblks) + { + nblks--; + + R(a, b, c, d, e, f, g, h, K[0], I(0)); + R(h, a, b, c, d, e, f, g, K[1], I(1)); + R(g, h, a, b, c, d, e, f, K[2], I(2)); + R(f, g, h, a, b, c, d, e, K[3], I(3)); + R(e, f, g, h, a, b, c, d, K[4], I(4)); + R(d, e, f, g, h, a, b, c, K[5], I(5)); + R(c, d, e, f, g, h, a, b, K[6], I(6)); + R(b, c, d, e, f, g, h, a, K[7], I(7)); + R(a, b, c, d, e, f, g, h, K[8], I(8)); + R(h, a, b, c, d, e, f, g, K[9], I(9)); + R(g, h, a, b, c, d, e, f, K[10], I(10)); + R(f, g, h, a, b, c, d, e, K[11], I(11)); + R(e, f, g, h, a, b, c, d, K[12], I(12)); + R(d, e, f, g, h, a, b, c, K[13], I(13)); + R(c, d, e, f, g, h, a, b, K[14], I(14)); + R(b, c, d, e, f, g, h, a, K[15], I(15)); + data += 64; + + R(a, b, c, d, e, f, g, h, K[16], W(16)); + R(h, a, b, c, d, e, f, g, K[17], W(17)); + R(g, h, a, b, c, d, e, f, K[18], W(18)); + R(f, g, h, a, b, c, d, e, K[19], W(19)); + R(e, f, g, h, a, b, c, d, K[20], W(20)); + R(d, e, f, g, h, a, b, c, K[21], W(21)); + R(c, d, e, f, g, h, a, b, K[22], W(22)); + R(b, c, d, e, f, g, h, a, K[23], W(23)); + R(a, b, c, d, e, f, g, h, K[24], W(24)); + R(h, a, b, c, d, e, f, g, K[25], W(25)); + R(g, h, a, b, c, d, e, f, K[26], W(26)); + R(f, g, h, a, b, c, d, e, K[27], W(27)); + R(e, f, g, h, a, b, c, d, K[28], W(28)); + R(d, e, f, g, h, a, b, c, K[29], W(29)); + R(c, d, e, f, g, h, a, b, K[30], W(30)); + R(b, c, d, e, f, g, h, a, K[31], W(31)); + + R(a, b, c, d, e, f, g, h, K[32], W(32)); + R(h, a, b, c, d, e, f, g, K[33], W(33)); + R(g, h, a, b, c, d, e, f, K[34], W(34)); + R(f, g, h, a, b, c, d, e, K[35], W(35)); + R(e, f, g, h, a, b, c, d, K[36], W(36)); + R(d, e, f, g, h, a, b, c, K[37], W(37)); + R(c, d, e, f, g, h, a, b, K[38], W(38)); + R(b, c, d, e, f, g, h, a, K[39], W(39)); + R(a, b, c, d, e, f, g, h, K[40], W(40)); + R(h, a, b, c, d, e, f, g, K[41], W(41)); + R(g, h, a, b, c, d, e, f, K[42], W(42)); + R(f, g, h, a, b, c, d, e, K[43], W(43)); + R(e, f, g, h, a, b, c, d, K[44], W(44)); + R(d, e, f, g, h, a, b, c, K[45], W(45)); + R(c, d, e, f, g, h, a, b, K[46], W(46)); + R(b, c, d, e, f, g, h, a, K[47], W(47)); + + R(a, b, c, d, e, f, g, h, K[48], W(48)); + R(h, a, b, c, d, e, f, g, K[49], W(49)); + R(g, h, a, b, c, d, e, f, K[50], W(50)); + R(f, g, h, a, b, c, d, e, K[51], W(51)); + R(e, f, g, h, a, b, c, d, K[52], W(52)); + R(d, e, f, g, h, a, b, c, K[53], W(53)); + R(c, d, e, f, g, h, a, b, K[54], W(54)); + R(b, c, d, e, f, g, h, a, K[55], W(55)); + R(a, b, c, d, e, f, g, h, K[56], W(56)); + R(h, a, b, c, d, e, f, g, K[57], W(57)); + R(g, h, a, b, c, d, e, f, K[58], W(58)); + R(f, g, h, a, b, c, d, e, K[59], W(59)); + R(e, f, g, h, a, b, c, d, K[60], W(60)); + R(d, e, f, g, h, a, b, c, K[61], W(61)); + R(c, d, e, f, g, h, a, b, K[62], W(62)); + R(b, c, d, e, f, g, h, a, K[63], W(63)); + + a += state[0]; + b += state[1]; + c += state[2]; + d += state[3]; + e += state[4]; + f += state[5]; + g += state[6]; + h += state[7]; + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + } + + state[0] = a; + state[1] = b; + state[2] = c; + state[3] = d; + state[4] = e; + state[5] = f; + state[6] = g; + state[7] = h; + + return sizeof(w); +} + +#endif /* ENABLE_PPC_CRYPTO_SUPPORT */ diff --git a/cipher/sha256.c b/cipher/sha256.c index 6c6833482..562dee9af 100644 --- a/cipher/sha256.c +++ b/cipher/sha256.c @@ -98,6 +98,18 @@ # endif #endif +/* USE_PPC_CRYPTO indicates whether to enable PowerPC vector crypto + * accelerated code. */ +#undef USE_PPC_CRYPTO +#ifdef ENABLE_PPC_CRYPTO_SUPPORT +# if defined(HAVE_COMPATIBLE_CC_PPC_ALTIVEC) && \ + defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) +# if __GNUC__ >= 4 +# define USE_PPC_CRYPTO 1 +# endif +# endif +#endif + typedef struct { gcry_md_block_ctx_t bctx; @@ -196,28 +208,41 @@ do_sha256_transform_armv8_ce(void *ctx, const unsigned char *data, } #endif +#ifdef USE_PPC_CRYPTO +unsigned int _gcry_sha256_transform_ppc8(u32 state[8], + const unsigned char *input_data, + size_t num_blks); + +unsigned int _gcry_sha256_transform_ppc9(u32 state[8], + const unsigned char *input_data, + size_t num_blks); + +static unsigned int +do_sha256_transform_ppc8(void *ctx, const unsigned char *data, size_t nblks) +{ + SHA256_CONTEXT *hd = ctx; + return _gcry_sha256_transform_ppc8 (&hd->h0, data, nblks); +} + +static unsigned int +do_sha256_transform_ppc9(void *ctx, const unsigned char *data, size_t nblks) +{ + SHA256_CONTEXT *hd = ctx; + return _gcry_sha256_transform_ppc9 (&hd->h0, data, nblks); +} +#endif + + static unsigned int do_transform_generic (void *ctx, const unsigned char *data, size_t nblks); static void -sha256_init (void *context, unsigned int flags) +sha256_common_init (SHA256_CONTEXT *hd) { - SHA256_CONTEXT *hd = context; unsigned int features = _gcry_get_hw_features (); - (void)flags; - - hd->h0 = 0x6a09e667; - hd->h1 = 0xbb67ae85; - hd->h2 = 0x3c6ef372; - hd->h3 = 0xa54ff53a; - hd->h4 = 0x510e527f; - hd->h5 = 0x9b05688c; - hd->h6 = 0x1f83d9ab; - hd->h7 = 0x5be0cd19; - hd->bctx.nblocks = 0; hd->bctx.nblocks_high = 0; hd->bctx.count = 0; @@ -248,16 +273,41 @@ sha256_init (void *context, unsigned int flags) #ifdef USE_ARM_CE if ((features & HWF_ARM_SHA2) != 0) hd->bctx.bwrite = do_sha256_transform_armv8_ce; +#endif +#ifdef USE_PPC_CRYPTO + if ((features & HWF_PPC_VCRYPTO) != 0) + hd->bctx.bwrite = do_sha256_transform_ppc8; + if ((features & HWF_PPC_VCRYPTO) != 0 && (features & HWF_PPC_ARCH_3_00) != 0) + hd->bctx.bwrite = do_sha256_transform_ppc9; #endif (void)features; } +static void +sha256_init (void *context, unsigned int flags) +{ + SHA256_CONTEXT *hd = context; + + (void)flags; + + hd->h0 = 0x6a09e667; + hd->h1 = 0xbb67ae85; + hd->h2 = 0x3c6ef372; + hd->h3 = 0xa54ff53a; + hd->h4 = 0x510e527f; + hd->h5 = 0x9b05688c; + hd->h6 = 0x1f83d9ab; + hd->h7 = 0x5be0cd19; + + sha256_common_init (hd); +} + + static void sha224_init (void *context, unsigned int flags) { SHA256_CONTEXT *hd = context; - unsigned int features = _gcry_get_hw_features (); (void)flags; @@ -270,38 +320,7 @@ sha224_init (void *context, unsigned int flags) hd->h6 = 0x64f98fa7; hd->h7 = 0xbefa4fa4; - hd->bctx.nblocks = 0; - hd->bctx.nblocks_high = 0; - hd->bctx.count = 0; - hd->bctx.blocksize = 64; - - /* Order of feature checks is important here; last match will be - * selected. Keep slower implementations at the top and faster at - * the bottom. */ - hd->bctx.bwrite = do_transform_generic; -#ifdef USE_SSSE3 - if ((features & HWF_INTEL_SSSE3) != 0) - hd->bctx.bwrite = do_sha256_transform_amd64_ssse3; -#endif -#ifdef USE_AVX - /* AVX implementation uses SHLD which is known to be slow on non-Intel CPUs. - * Therefore use this implementation on Intel CPUs only. */ - if ((features & HWF_INTEL_AVX) && (features & HWF_INTEL_FAST_SHLD)) - hd->bctx.bwrite = do_sha256_transform_amd64_avx; -#endif -#ifdef USE_AVX2 - if ((features & HWF_INTEL_AVX2) && (features & HWF_INTEL_BMI2)) - hd->bctx.bwrite = do_sha256_transform_amd64_avx2; -#endif -#ifdef USE_SHAEXT - if ((features & HWF_INTEL_SHAEXT) && (features & HWF_INTEL_SSE4_1)) - hd->bctx.bwrite = do_sha256_transform_intel_shaext; -#endif -#ifdef USE_ARM_CE - if ((features & HWF_ARM_SHA2) != 0) - hd->bctx.bwrite = do_sha256_transform_armv8_ce; -#endif - (void)features; + sha256_common_init (hd); } diff --git a/configure.ac b/configure.ac index d7725b553..fb7b40874 100644 --- a/configure.ac +++ b/configure.ac @@ -1906,6 +1906,7 @@ AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto "vcipher %v0, %v1, %v22;\n" "lxvw4x %vs32, %r0, %r1;\n" "vadduwm %v0, %v1, %v22;\n" + "vshasigmaw %v0, %v1, 0, 15;\n" ); ]])], [gcry_cv_gcc_inline_asm_ppc_altivec=yes]) @@ -2613,6 +2614,19 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_DIGESTS="$GCRYPT_DIGESTS sha256-armv8-aarch64-ce.lo" ;; + powerpc64le-*-*) + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha256-ppc.lo" + ;; + powerpc64-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha256-ppc.lo" + ;; + powerpc-*-*) + # Big-Endian. + # Build with the crypto extension implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sha256-ppc.lo" esac case "$mpi_cpu_arch" in From shawn at git.icu Sat Aug 31 04:10:26 2019 From: shawn at git.icu (Shawn Landden) Date: Fri, 30 Aug 2019 21:10:26 -0500 Subject: [PATCH 1/3] hwf-ppc: add detection for PowerISA 3.00 In-Reply-To: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> References: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> Message-ID: <10635571567217426@myt4-eb6256e01f8b.qloud-c.yandex.net> 30.08.2019, 18:50, "Jussi Kivilinna via Gcrypt-devel" : > * src/g10lib.h (HWF_PPC_ARCH_3_00): New. > * src/hwf-ppc.c (feature_map_s): Remove unused 'feature_match'. > (PPC_FEATURE2_ARCH_3_00): New. > (ppc_features, get_hwcap): Add PowerISA 3.00. Ahh, I see that Power 9 has different performance characteristics. I was wondering what this was for. I am working on GHASH now. > * src/hwfeatures.c (hwflist): Rename "ppc-crypto" to "ppc-vcrypto"; Add > "ppc-arch_3_00". > -- > > Signed-off-by: Jussi Kivilinna > --- > ?0 files changed > > diff --git a/src/g10lib.h b/src/g10lib.h > index 41e18c137..bbdaf58be 100644 > --- a/src/g10lib.h > +++ b/src/g10lib.h > @@ -237,6 +237,7 @@ char **_gcry_strtokenize (const char *string, const char *delim); > ?#define HWF_ARM_PMULL (1 << 21) > > ?#define HWF_PPC_VCRYPTO (1 << 22) > +#define HWF_PPC_ARCH_3_00 (1 << 23) > > ?gpg_err_code_t _gcry_disable_hw_feature (const char *name); > ?void _gcry_detect_hw_features (void); > diff --git a/src/hwf-ppc.c b/src/hwf-ppc.c > index 1bf2edf70..2ed60c0f1 100644 > --- a/src/hwf-ppc.c > +++ b/src/hwf-ppc.c > @@ -70,7 +70,6 @@ struct feature_map_s > ???{ > ?????unsigned int hwcap_flag; > ?????unsigned int hwcap2_flag; > - const char *feature_match; > ?????unsigned int hwf_flag; > ???}; > > @@ -87,12 +86,16 @@ struct feature_map_s > ?#ifndef PPC_FEATURE2_VEC_CRYPTO > ?# define PPC_FEATURE2_VEC_CRYPTO 0x02000000 > ?#endif > +#ifndef PPC_FEATURE2_ARCH_3_00 > +# define PPC_FEATURE2_ARCH_3_00 0x00800000 > +#endif > > ?static const struct feature_map_s ppc_features[] = > ???{ > ?#ifdef ENABLE_PPC_CRYPTO_SUPPORT > - { 0, PPC_FEATURE2_VEC_CRYPTO, " crypto", HWF_PPC_VCRYPTO }, > + { 0, PPC_FEATURE2_VEC_CRYPTO, HWF_PPC_VCRYPTO }, > ?#endif > + { 0, PPC_FEATURE2_ARCH_3_00, HWF_PPC_ARCH_3_00 }, > ???}; > ?#endif > > @@ -114,22 +117,23 @@ get_hwcap(unsigned int *hwcap, unsigned int *hwcap2) > ?????} > > ?#if 0 // TODO: configure.ac detection for __builtin_cpu_supports > -#if defined(__GLIBC__) && defined(__GNUC__) > -#if __GNUC__ >= 6 > - /* Returns 0 if glibc support doesn't exist, so we can > - * only trust positive results. This function will need updating > - * if we ever need more than one cpu feature. > - */ > - // TODO: fix, false if ENABLE_PPC_CRYPTO_SUPPORT > - if (sizeof(ppc_features)/sizeof(ppc_features[0]) == 0) { > - if (__builtin_cpu_supports("vcrypto")) { > - stored_hwcap = 0; > - stored_hwcap2 = PPC_FEATURE2_VEC_CRYPTO; > - hwcap_initialized = 1; > - return 0; > + // TODO: move to 'detect_ppc_builtin_cpu_supports' > +#if defined(__GLIBC__) && defined(__GNUC__) && __GNUC__ >= 6 > + /* __builtin_cpu_supports returns 0 if glibc support doesn't exist, so > + * we can only trust positive results. */ > +#ifdef ENABLE_PPC_CRYPTO_SUPPORT > + if (__builtin_cpu_supports("vcrypto")) /* TODO: Configure.ac */ > + { > + stored_hwcap2 |= PPC_FEATURE2_VEC_CRYPTO; > + hwcap_initialized = 1; > ?????} > - } > ?#endif > + > + if (__builtin_cpu_supports("arch_3_00")) /* TODO: Configure.ac */ > + { > + stored_hwcap2 |= PPC_FEATURE2_ARCH_3_00; > + hwcap_initialized = 1; > + } > ?#endif > ?#endif > > @@ -188,6 +192,7 @@ get_hwcap(unsigned int *hwcap, unsigned int *hwcap2) > ???????err = 0; > > ???fclose(f); > + > ???*hwcap = stored_hwcap; > ???*hwcap2 = stored_hwcap2; > ???return err; > diff --git a/src/hwfeatures.c b/src/hwfeatures.c > index fe5137538..1021bd3b1 100644 > --- a/src/hwfeatures.c > +++ b/src/hwfeatures.c > @@ -67,7 +67,8 @@ static struct > ?????{ HWF_ARM_SHA2, "arm-sha2" }, > ?????{ HWF_ARM_PMULL, "arm-pmull" }, > ?#elif defined(HAVE_CPU_ARCH_PPC) > - { HWF_PPC_VCRYPTO, "ppc-crypto" }, > + { HWF_PPC_VCRYPTO, "ppc-vcrypto" }, > + { HWF_PPC_ARCH_3_00, "ppc-arch_3_00" }, > ?#endif > ???}; > > _______________________________________________ > Gcrypt-devel mailing list > Gcrypt-devel at gnupg.org > http://lists.gnupg.org/mailman/listinfo/gcrypt-devel -- Shawn Landden From jussi.kivilinna at iki.fi Sat Aug 31 11:11:43 2019 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 31 Aug 2019 12:11:43 +0300 Subject: [PATCH 1/3] hwf-ppc: add detection for PowerISA 3.00 In-Reply-To: <10635571567217426@myt4-eb6256e01f8b.qloud-c.yandex.net> References: <156720897756.9538.1166473599154419488.stgit@localhost.localdomain> <10635571567217426@myt4-eb6256e01f8b.qloud-c.yandex.net> Message-ID: <41949991-687b-80d2-dd7a-dd97f0056e68@iki.fi> On 31.8.2019 5.10, Shawn Landden wrote: > > > 30.08.2019, 18:50, "Jussi Kivilinna via Gcrypt-devel" : >> * src/g10lib.h (HWF_PPC_ARCH_3_00): New. >> * src/hwf-ppc.c (feature_map_s): Remove unused 'feature_match'. >> (PPC_FEATURE2_ARCH_3_00): New. >> (ppc_features, get_hwcap): Add PowerISA 3.00. > > Ahh, I see that Power 9 has different performance characteristics. I was wondering what this was for. I am working on GHASH now. Right, PowerISA 3.00 is not actually used and this is only for detecting Power 9. Later if Power 10 has better vector SHA2 performance, we'll need new HW feature flag to detect that processor to select best implementation for it. -Jussi