From gniibe at fsij.org Fri Nov 1 08:43:01 2024 From: gniibe at fsij.org (NIIBE Yutaka) Date: Fri, 01 Nov 2024 16:43:01 +0900 Subject: FIPS 140 service indicator revamp In-Reply-To: <002642d8-4401-4fdb-975e-a5350293db95@atsec.com> References: <87sesmffmg.fsf@akagi.fsij.org> <6497406.YeSz9MTgJO@tauon.atsec.com> <002642d8-4401-4fdb-975e-a5350293db95@atsec.com> Message-ID: <875xp7mmbe.fsf@akagi.fsij.org> David Sugar wrote: > Sure, I've attached a diff. Thank you. While I understand the requirement of the check by _gcry_fips_check_kdf_compliant after actual computation, I'm a bit confused about your suggested change. Here are two comments, for now. (1) There are two different failures; a failure before the computation, and a failure after the computation. In your patch, it returns GPG_ERR_FORBIDDEN by fips_not_compliant macro for both cases. I wonder if this two cases should be distinguished differently. (2) gcry_kdf_derive does not allocate memory, but let us consider a function which allocates memory on successful computation. It's the case where an application needs to release memory after use of a function when success. Considering this situation... IIUC, your change implies changing success path for all applications (even if an application does not care about FIPS). That's because "success" may be actually failure (by examining ERRNO), releasing memory should not be done in this case. I wonder if it's possible to ask changing error path for FIPS-contious applications, and non FIPS-contious application can avoid any changes. -- From jussi.kivilinna at iki.fi Sun Nov 3 17:06:27 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 18:06:27 +0200 Subject: [PATCH 2/2] Add GHASH AArch64/SIMD intrinsics implementation In-Reply-To: <20241103160629.2591775-1-jussi.kivilinna@iki.fi> References: <20241103160629.2591775-1-jussi.kivilinna@iki.fi> Message-ID: <20241103160629.2591775-2-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'cipher-gcm-aarch64-simd.c'. * cipher/cipher-gcm-aarch64-simd.c: New. * cipher/cipher-gcm.c [GCM_USE_AARCH64]: Add function prototypes for AArch64/SIMD implementation. (setupM) [GCM_USE_AARCH64]: Add setup for AArch64/SIMD implementation. * cipher/cipher-internal.h (GCM_USE_AARCH64): New. * configure.ac: Add 'cipher-gcm-aarch64-simd.c'. -- Patch adds GHASH/GCM intrinsics implementation for AArch64. This is for CPUs without crypto extensions instruction set support. Benchmark on Cortex-A53 (1152 Mhz): Before: | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 12.22 ns/B 78.07 MiB/s 14.07 c/B After: | nanosecs/byte mebibytes/sec cycles/byte GMAC_AES | 7.38 ns/B 129.2 MiB/s 8.50 c/B Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 9 +- cipher/cipher-gcm-aarch64-simd.c | 320 +++++++++++++++++++++++++++++++ cipher/cipher-gcm.c | 14 ++ cipher/cipher-internal.h | 6 + configure.ac | 1 + 5 files changed, 349 insertions(+), 1 deletion(-) create mode 100644 cipher/cipher-gcm-aarch64-simd.c diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 2528bc39..633c53ed 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -89,7 +89,8 @@ EXTRA_libcipher_la_SOURCES = \ chacha20-amd64-avx512.S chacha20-armv7-neon.S chacha20-aarch64.S \ chacha20-ppc.c chacha20-s390x.S \ chacha20-p10le-8x.s \ - cipher-gcm-ppc.c cipher-gcm-intel-pclmul.c cipher-gcm-armv7-neon.S \ + cipher-gcm-ppc.c cipher-gcm-intel-pclmul.c \ + cipher-gcm-aarch64-simd.c cipher-gcm-armv7-neon.S \ cipher-gcm-armv8-aarch32-ce.S cipher-gcm-armv8-aarch64-ce.S \ crc.c crc-intel-pclmul.c crc-armv8-ce.c \ crc-armv8-aarch64-ce.S \ @@ -325,6 +326,12 @@ camellia-aarch64-ce.o: $(srcdir)/camellia-aarch64-ce.c Makefile camellia-aarch64-ce.lo: $(srcdir)/camellia-aarch64-ce.c Makefile `echo $(LTCOMPILE) $(aarch64_crypto_cflags) -c $< | $(instrumentation_munging) ` +cipher-gcm-aarch64-simd.o: $(srcdir)/cipher-gcm-aarch64-simd.c Makefile + `echo $(COMPILE) $(aarch64_simd_cflags) -c $< | $(instrumentation_munging) ` + +cipher-gcm-aarch64-simd.lo: $(srcdir)/cipher-gcm-aarch64-simd.c Makefile + `echo $(LTCOMPILE) $(aarch64_simd_cflags) -c $< | $(instrumentation_munging) ` + rijndael-vp-aarch64.o: $(srcdir)/rijndael-vp-aarch64.c Makefile `echo $(COMPILE) $(aarch64_simd_cflags) -c $< | $(instrumentation_munging) ` diff --git a/cipher/cipher-gcm-aarch64-simd.c b/cipher/cipher-gcm-aarch64-simd.c new file mode 100644 index 00000000..ecb55a9f --- /dev/null +++ b/cipher/cipher-gcm-aarch64-simd.c @@ -0,0 +1,320 @@ +/* cipher-gcm-aarch64-simd.c - ARM/NEON accelerated GHASH + * Copyright (C) 2019-2024 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include + +#include "types.h" +#include "g10lib.h" +#include "cipher.h" +#include "bufhelp.h" +#include "./cipher-internal.h" + +#ifdef GCM_USE_AARCH64 + +#include "simd-common-aarch64.h" +#include + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE + +static ASM_FUNC_ATTR_INLINE uint64x2_t +byteswap_u64x2(uint64x2_t vec) +{ + vec = (uint64x2_t)vrev64q_u8((uint8x16_t)vec); + vec = (uint64x2_t)vextq_u8((uint8x16_t)vec, (uint8x16_t)vec, 8); + return vec; +} + +static ASM_FUNC_ATTR_INLINE uint64x2_t +veor_u64x2(uint64x2_t va, uint64x2_t vb) +{ + return (uint64x2_t)veorq_u8((uint8x16_t)va, (uint8x16_t)vb); +} + +static ASM_FUNC_ATTR_INLINE uint64x1_t +veor_u64x1(uint64x1_t va, uint64x1_t vb) +{ + return (uint64x1_t)veor_u8((uint8x8_t)va, (uint8x8_t)vb); +} + +static ASM_FUNC_ATTR_INLINE uint64x1_t +vand_u64x1(uint64x1_t va, uint64x1_t vb) +{ + return (uint64x1_t)vand_u8((uint8x8_t)va, (uint8x8_t)vb); +} + +static ASM_FUNC_ATTR_INLINE uint64x1_t +vorr_u64x1(uint64x1_t va, uint64x1_t vb) +{ + return (uint64x1_t)vorr_u8((uint8x8_t)va, (uint8x8_t)vb); +} + +/* 64x64=>128 carry-less multiplication using vmull.p8 instruction. + * + * From "C?mara, D.; Gouv?a, C. P. L.; L?pez, J. & Dahab, R. Fast Software + * Polynomial Multiplication on ARM Processors using the NEON Engine. The + * Second International Workshop on Modern Cryptography and Security + * Engineering ? MoCrySEn, 2013". */ +static ASM_FUNC_ATTR_INLINE uint64x2_t +emulate_vmull_p64(uint64x1_t ad, uint64x1_t bd) +{ + static const uint64x1_t k0 = { 0 }; + static const uint64x1_t k16 = { U64_C(0xffff) }; + static const uint64x1_t k32 = { U64_C(0xffffffff) }; + static const uint64x1_t k48 = { U64_C(0xffffffffffff) }; + uint64x1_t rl; + uint64x2_t rq; + uint64x1_t t0l; + uint64x1_t t0h; + uint64x2_t t0q; + uint64x1_t t1l; + uint64x1_t t1h; + uint64x2_t t1q; + uint64x1_t t2l; + uint64x1_t t2h; + uint64x2_t t2q; + uint64x1_t t3l; + uint64x1_t t3h; + uint64x2_t t3q; + + t0l = (uint64x1_t)vext_u8((uint8x8_t)ad, (uint8x8_t)ad, 1); + t0q = (uint64x2_t)vmull_p8((poly8x8_t)t0l, (poly8x8_t)bd); + + rl = (uint64x1_t)vext_u8((uint8x8_t)bd, (uint8x8_t)bd, 1); + rq = (uint64x2_t)vmull_p8((poly8x8_t)ad, (poly8x8_t)rl); + + t1l = (uint64x1_t)vext_u8((uint8x8_t)ad, (uint8x8_t)ad, 2); + t1q = (uint64x2_t)vmull_p8((poly8x8_t)t1l, (poly8x8_t)bd); + + t3l = (uint64x1_t)vext_u8((uint8x8_t)bd, (uint8x8_t)bd, 2); + t3q = (uint64x2_t)vmull_p8((poly8x8_t)ad, (poly8x8_t)t3l); + + t2l = (uint64x1_t)vext_u8((uint8x8_t)ad, (uint8x8_t)ad, 3); + t2q = (uint64x2_t)vmull_p8((poly8x8_t)t2l, (poly8x8_t)bd); + + t0q = veor_u64x2(t0q, rq); + t0l = vget_low_u64(t0q); + t0h = vget_high_u64(t0q); + + rl = (uint64x1_t)vext_u8((uint8x8_t)bd, (uint8x8_t)bd, 3); + rq = (uint64x2_t)vmull_p8((poly8x8_t)ad, (poly8x8_t)rl); + + t1q = veor_u64x2(t1q, t3q); + t1l = vget_low_u64(t1q); + t1h = vget_high_u64(t1q); + + t3l = (uint64x1_t)vext_u8((uint8x8_t)bd, (uint8x8_t)bd, 4); + t3q = (uint64x2_t)vmull_p8((poly8x8_t)ad, (poly8x8_t)t3l); + t3l = vget_low_u64(t3q); + t3h = vget_high_u64(t3q); + + t0l = veor_u64x1(t0l, t0h); + t0h = vand_u64x1(t0h, k48); + t1l = veor_u64x1(t1l, t1h); + t1h = vand_u64x1(t1h, k32); + t2q = veor_u64x2(t2q, rq); + t2l = vget_low_u64(t2q); + t2h = vget_high_u64(t2q); + t0l = veor_u64x1(t0l, t0h); + t1l = veor_u64x1(t1l, t1h); + t2l = veor_u64x1(t2l, t2h); + t2h = vand_u64x1(t2h, k16); + t3l = veor_u64x1(t3l, t3h); + t3h = k0; + t0q = vcombine_u64(t0l, t0h); + t0q = (uint64x2_t)vextq_u8((uint8x16_t)t0q, (uint8x16_t)t0q, 15); + t2l = veor_u64x1(t2l, t2h); + t1q = vcombine_u64(t1l, t1h); + t1q = (uint64x2_t)vextq_u8((uint8x16_t)t1q, (uint8x16_t)t1q, 14); + rq = (uint64x2_t)vmull_p8((poly8x8_t)ad, (poly8x8_t)bd); + t2q = vcombine_u64(t2l, t2h); + t2q = (uint64x2_t)vextq_u8((uint8x16_t)t2q, (uint8x16_t)t2q, 13); + t3q = vcombine_u64(t3l, t3h); + t3q = (uint64x2_t)vextq_u8((uint8x16_t)t3q, (uint8x16_t)t3q, 12); + t0q = veor_u64x2(t0q, t1q); + t2q = veor_u64x2(t2q, t3q); + rq = veor_u64x2(rq, t0q); + rq = veor_u64x2(rq, t2q); + return rq; +} + +/* GHASH functions. + * + * See "Gouv?a, C. P. L. & L?pez, J. Implementing GCM on ARMv8. Topics in + * Cryptology ? CT-RSA 2015" for details. + */ +static ASM_FUNC_ATTR_INLINE uint64x2x2_t +pmul_128x128(uint64x2_t a, uint64x2_t b) +{ + uint64x1_t a_l = vget_low_u64(a); + uint64x1_t a_h = vget_high_u64(a); + uint64x1_t b_l = vget_low_u64(b); + uint64x1_t b_h = vget_high_u64(b); + uint64x1_t t1_h = veor_u64x1(b_l, b_h); + uint64x1_t t1_l = veor_u64x1(a_l, a_h); + uint64x2_t r0 = emulate_vmull_p64(a_l, b_l); + uint64x2_t r1 = emulate_vmull_p64(a_h, b_h); + uint64x2_t t2 = emulate_vmull_p64(t1_h, t1_l); + uint64x1_t t2_l, t2_h; + uint64x1_t r0_l, r0_h; + uint64x1_t r1_l, r1_h; + + t2 = veor_u64x2(t2, r0); + t2 = veor_u64x2(t2, r1); + + r0_l = vget_low_u64(r0); + r0_h = vget_high_u64(r0); + r1_l = vget_low_u64(r1); + r1_h = vget_high_u64(r1); + t2_l = vget_low_u64(t2); + t2_h = vget_high_u64(t2); + + r0_h = veor_u64x1(r0_h, t2_l); + r1_l = veor_u64x1(r1_l, t2_h); + + r0 = vcombine_u64(r0_l, r0_h); + r1 = vcombine_u64(r1_l, r1_h); + + return (const uint64x2x2_t){ .val = { r0, r1 } }; +} + +/* Reduction using Xor and Shift. + * + * See "Shay Gueron, Michael E. Kounavis. Intel Carry-Less Multiplication + * Instruction and its Usage for Computing the GCM Mode" for details. + */ +static ASM_FUNC_ATTR_INLINE uint64x2_t +reduction(uint64x2x2_t r0r1) +{ + static const uint64x2_t k0 = { U64_C(0), U64_C(0) }; + uint64x2_t r0 = r0r1.val[0]; + uint64x2_t r1 = r0r1.val[1]; + uint64x2_t t0q; + uint64x2_t t1q; + uint64x2_t t2q; + uint64x2_t t; + + t0q = (uint64x2_t)vshlq_n_u32((uint32x4_t)r0, 31); + t1q = (uint64x2_t)vshlq_n_u32((uint32x4_t)r0, 30); + t2q = (uint64x2_t)vshlq_n_u32((uint32x4_t)r0, 25); + t0q = veor_u64x2(t0q, t1q); + t0q = veor_u64x2(t0q, t2q); + t = (uint64x2_t)vextq_u8((uint8x16_t)t0q, (uint8x16_t)k0, 4); + t0q = (uint64x2_t)vextq_u8((uint8x16_t)k0, (uint8x16_t)t0q, 16 - 12); + r0 = veor_u64x2(r0, t0q); + t0q = (uint64x2_t)vshrq_n_u32((uint32x4_t)r0, 1); + t1q = (uint64x2_t)vshrq_n_u32((uint32x4_t)r0, 2); + t2q = (uint64x2_t)vshrq_n_u32((uint32x4_t)r0, 7); + t0q = veor_u64x2(t0q, t1q); + t0q = veor_u64x2(t0q, t2q); + t0q = veor_u64x2(t0q, t); + r0 = veor_u64x2(r0, t0q); + return veor_u64x2(r0, r1); +} + +ASM_FUNC_ATTR_NOINLINE unsigned int +_gcry_ghash_aarch64_simd(gcry_cipher_hd_t c, byte *result, const byte *buf, + size_t nblocks) +{ + uint64x2_t rhash; + uint64x2_t rh1; + uint64x2_t rbuf; + uint64x2x2_t rr0rr1; + + if (nblocks == 0) + return 0; + + rhash = vld1q_u64((const void *)result); + rh1 = vld1q_u64((const void *)c->u_mode.gcm.u_ghash_key.key); + + rhash = byteswap_u64x2(rhash); + + rbuf = vld1q_u64((const void *)buf); + buf += 16; + nblocks--; + + rbuf = byteswap_u64x2(rbuf); + + rhash = veor_u64x2(rhash, rbuf); + + while (nblocks) + { + rbuf = vld1q_u64((const void *)buf); + buf += 16; + nblocks--; + + rr0rr1 = pmul_128x128(rhash, rh1); + + rbuf = byteswap_u64x2(rbuf); + + rhash = reduction(rr0rr1); + + rhash = veor_u64x2(rhash, rbuf); + } + + rr0rr1 = pmul_128x128(rhash, rh1); + rhash = reduction(rr0rr1); + + rhash = byteswap_u64x2(rhash); + + vst1q_u64((void *)result, rhash); + + clear_vec_regs(); + + return 0; +} + +static ASM_FUNC_ATTR_INLINE void +gcm_lsh_1(void *r_out, uint64x2_t i) +{ + static const uint64x1_t const_d = { U64_C(0xc200000000000000) }; + uint64x1_t ia = vget_low_u64(i); + uint64x1_t ib = vget_high_u64(i); + uint64x1_t oa, ob, ma; + + ma = (uint64x1_t)vshr_n_s64((int64x1_t)ib, 63); + oa = vshr_n_u64(ib, 63); + ob = vshr_n_u64(ia, 63); + ma = vand_u64x1(ma, const_d); + ib = vshl_n_u64(ib, 1); + ia = vshl_n_u64(ia, 1); + ob = vorr_u64x1(ob, ib); + oa = vorr_u64x1(oa, ia); + ob = veor_u64x1(ob, ma); + vst2_u64(r_out, (const uint64x1x2_t){ .val = { oa, ob } }); +} + +ASM_FUNC_ATTR_NOINLINE void +_gcry_ghash_setup_aarch64_simd(gcry_cipher_hd_t c) +{ + uint64x2_t rhash = vld1q_u64((const void *)c->u_mode.gcm.u_ghash_key.key); + + rhash = byteswap_u64x2(rhash); + + gcm_lsh_1(c->u_mode.gcm.u_ghash_key.key, rhash); + + clear_vec_regs(); +} + +#endif /* GCM_USE_AARCH64 */ diff --git a/cipher/cipher-gcm.c b/cipher/cipher-gcm.c index d3c04d58..9fbdb02e 100644 --- a/cipher/cipher-gcm.c +++ b/cipher/cipher-gcm.c @@ -102,6 +102,13 @@ ghash_armv7_neon (gcry_cipher_hd_t c, byte *result, const byte *buf, } #endif /* GCM_USE_ARM_NEON */ +#ifdef GCM_USE_AARCH64 +extern void _gcry_ghash_setup_aarch64_simd(gcry_cipher_hd_t c); + +extern unsigned int _gcry_ghash_aarch64_simd(gcry_cipher_hd_t c, byte *result, + const byte *buf, size_t nblocks); +#endif /* GCM_USE_AARCH64 */ + #ifdef GCM_USE_S390X_CRYPTO #include "asm-inline-s390x.h" @@ -607,6 +614,13 @@ setupM (gcry_cipher_hd_t c) ghash_setup_armv7_neon (c); } #endif +#ifdef GCM_USE_AARCH64 + else if (features & HWF_ARM_NEON) + { + c->u_mode.gcm.ghash_fn = _gcry_ghash_aarch64_simd; + _gcry_ghash_setup_aarch64_simd (c); + } +#endif #ifdef GCM_USE_PPC_VPMSUM else if (features & HWF_PPC_VCRYPTO) { diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index cd8ff788..ddf8fbb5 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -112,6 +112,12 @@ #endif #endif /* GCM_USE_ARM_NEON */ +/* GCM_USE_AARCH64 indicates whether to compile GCM with AArch64 SIMD code. */ +#undef GCM_USE_AARCH64 +#if defined(__AARCH64EL__) && defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) +# define GCM_USE_AARCH64 1 +#endif + /* GCM_USE_S390X_CRYPTO indicates whether to enable zSeries code. */ #undef GCM_USE_S390X_CRYPTO #if defined(HAVE_GCC_INLINE_ASM_S390X) diff --git a/configure.ac b/configure.ac index 6347ea25..a7f922b1 100644 --- a/configure.ac +++ b/configure.ac @@ -3644,6 +3644,7 @@ case "${host}" in GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS cipher-gcm-armv8-aarch32-ce.lo" ;; aarch64-*-*) + GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS cipher-gcm-aarch64-simd.lo" GCRYPT_ASM_DIGESTS="$GCRYPT_ASM_DIGESTS cipher-gcm-armv8-aarch64-ce.lo" ;; powerpc64le-*-* | powerpc64-*-* | powerpc-*-*) -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 17:06:26 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 18:06:26 +0200 Subject: [PATCH 1/2] Add AES Vector Permute intrinsics implementation for AArch64 Message-ID: <20241103160629.2591775-1-jussi.kivilinna@iki.fi> * cipher/Makefile: Add 'rijndael-vp-aarch64.c', 'rijndael-vp-simd128.h' and 'simd-common-aarch64.h'. * cipher/rijndael-internal.h (USE_VP_AARCH64): New. * cipher/rijndael-vp-aarch64.c: New. * cipher/rijndael-vp-simd128.h: New. * cipher/rijndael.c [USE_VP_AARCH64]: Add function prototypes for AArch64 vector permutation implementation. (do_setkey) [USE_VP_AARCH64]: Setup function pointers for AArch64 vector permutation implementation. * cipher/simd-common-aarch64.h: New. * configure.ac: Add 'rijndael-vp-aarch64.lo'. -- Patch adds AES Vector Permute intrinsics implementation for AArch64. This is for CPUs without crypto extensions instruction set support. Benchmark on Cortex-A53 (1152 Mhz): Before: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 22.31 ns/B 42.75 MiB/s 25.70 c/B ECB dec | 22.79 ns/B 41.84 MiB/s 26.26 c/B CBC enc | 18.61 ns/B 51.24 MiB/s 21.44 c/B CBC dec | 18.56 ns/B 51.37 MiB/s 21.39 c/B CFB enc | 18.56 ns/B 51.37 MiB/s 21.39 c/B CFB dec | 18.56 ns/B 51.38 MiB/s 21.38 c/B OFB enc | 22.63 ns/B 42.13 MiB/s 26.07 c/B OFB dec | 22.63 ns/B 42.13 MiB/s 26.07 c/B CTR enc | 19.05 ns/B 50.05 MiB/s 21.95 c/B CTR dec | 19.05 ns/B 50.05 MiB/s 21.95 c/B XTS enc | 19.27 ns/B 49.50 MiB/s 22.19 c/B XTS dec | 19.38 ns/B 49.22 MiB/s 22.32 c/B CCM enc | 37.71 ns/B 25.29 MiB/s 43.45 c/B After: AES | nanosecs/byte mebibytes/sec cycles/byte ECB enc | 16.10 ns/B 59.23 MiB/s 18.55 c/B ECB dec | 18.35 ns/B 51.98 MiB/s 21.14 c/B CBC enc | 18.47 ns/B 51.62 MiB/s 21.28 c/B CBC dec | 18.49 ns/B 51.58 MiB/s 21.30 c/B CFB enc | 18.35 ns/B 51.98 MiB/s 21.13 c/B CFB dec | 16.24 ns/B 58.72 MiB/s 18.71 c/B OFB enc | 22.58 ns/B 42.24 MiB/s 26.01 c/B OFB dec | 22.58 ns/B 42.24 MiB/s 26.01 c/B CTR enc | 16.27 ns/B 58.61 MiB/s 18.75 c/B CTR dec | 16.27 ns/B 58.61 MiB/s 18.75 c/B XTS enc | 16.56 ns/B 57.60 MiB/s 19.07 c/B XTS dec | 18.92 ns/B 50.41 MiB/s 21.79 c/B Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 33 +- cipher/rijndael-internal.h | 6 + cipher/rijndael-vp-aarch64.c | 78 ++ cipher/rijndael-vp-simd128.h | 2371 ++++++++++++++++++++++++++++++++++ cipher/rijndael.c | 77 ++ cipher/simd-common-aarch64.h | 62 + configure.ac | 3 + 7 files changed, 2618 insertions(+), 12 deletions(-) create mode 100644 cipher/rijndael-vp-aarch64.c create mode 100644 cipher/rijndael-vp-simd128.h create mode 100644 cipher/simd-common-aarch64.h diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 149c9f21..2528bc39 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -118,6 +118,7 @@ EXTRA_libcipher_la_SOURCES = \ rijndael-p10le.c rijndael-gcm-p10le.s \ rijndael-ppc-common.h rijndael-ppc-functions.h \ rijndael-s390x.c \ + rijndael-vp-aarch64.c rijndael-vp-simd128.h \ rmd160.c \ rsa.c \ salsa20.c salsa20-amd64.S salsa20-armv7-neon.S \ @@ -125,6 +126,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \ serpent-avx512-x86.c serpent-armv7-neon.S \ + simd-common-aarch64.h \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ sm4-gfni-avx2-amd64.S sm4-gfni-avx512-amd64.S \ sm4-aarch64.S sm4-armv8-aarch64-ce.S sm4-armv9-aarch64-sve-ce.S \ @@ -243,12 +245,6 @@ else ppc_vcrypto_cflags = endif -if ENABLE_AARCH64_NEON_INTRINSICS_EXTRA_CFLAGS -aarch64_neon_cflags = -O2 -march=armv8-a+crypto -else -aarch64_neon_cflags = -endif - rijndael-ppc.o: $(srcdir)/rijndael-ppc.c Makefile `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` @@ -309,18 +305,31 @@ camellia-ppc9le.o: $(srcdir)/camellia-ppc9le.c Makefile camellia-ppc9le.lo: $(srcdir)/camellia-ppc9le.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` -camellia-aarch64-ce.o: $(srcdir)/camellia-aarch64-ce.c Makefile - `echo $(COMPILE) $(aarch64_neon_cflags) -c $< | $(instrumentation_munging) ` - -camellia-aarch64-ce.lo: $(srcdir)/camellia-aarch64-ce.c Makefile - `echo $(LTCOMPILE) $(aarch64_neon_cflags) -c $< | $(instrumentation_munging) ` - sm4-ppc.o: $(srcdir)/sm4-ppc.c Makefile `echo $(COMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` sm4-ppc.lo: $(srcdir)/sm4-ppc.c Makefile `echo $(LTCOMPILE) $(ppc_vcrypto_cflags) -c $< | $(instrumentation_munging) ` +if ENABLE_AARCH64_NEON_INTRINSICS_EXTRA_CFLAGS +aarch64_crypto_cflags = -O2 -march=armv8-a+simd+crypto +aarch64_simd_cflags = -O2 -march=armv8-a+simd +else +aarch64_crypto_cflags = +aarch64_simd_cflags = +endif + +camellia-aarch64-ce.o: $(srcdir)/camellia-aarch64-ce.c Makefile + `echo $(COMPILE) $(aarch64_crypto_cflags) -c $< | $(instrumentation_munging) ` + +camellia-aarch64-ce.lo: $(srcdir)/camellia-aarch64-ce.c Makefile + `echo $(LTCOMPILE) $(aarch64_crypto_cflags) -c $< | $(instrumentation_munging) ` + +rijndael-vp-aarch64.o: $(srcdir)/rijndael-vp-aarch64.c Makefile + `echo $(COMPILE) $(aarch64_simd_cflags) -c $< | $(instrumentation_munging) ` + +rijndael-vp-aarch64.lo: $(srcdir)/rijndael-vp-aarch64.c Makefile + `echo $(LTCOMPILE) $(aarch64_simd_cflags) -c $< | $(instrumentation_munging) ` if ENABLE_X86_AVX512_INTRINSICS_EXTRA_CFLAGS avx512f_cflags = -mavx512f diff --git a/cipher/rijndael-internal.h b/cipher/rijndael-internal.h index 166f2415..69ef86af 100644 --- a/cipher/rijndael-internal.h +++ b/cipher/rijndael-internal.h @@ -124,6 +124,12 @@ # endif #endif /* ENABLE_ARM_CRYPTO_SUPPORT */ +/* USE_ARM_CE indicates whether to enable vector permute AArch64 SIMD code. */ +#undef USE_VP_AARCH64 +#if defined(__AARCH64EL__) && defined(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS) +# define USE_VP_AARCH64 1 +#endif + /* USE_PPC_CRYPTO indicates whether to enable PowerPC vector crypto * accelerated code. USE_PPC_CRYPTO_WITH_PPC9LE indicates whether to * enable POWER9 optimized variant. */ diff --git a/cipher/rijndael-vp-aarch64.c b/cipher/rijndael-vp-aarch64.c new file mode 100644 index 00000000..0532c421 --- /dev/null +++ b/cipher/rijndael-vp-aarch64.c @@ -0,0 +1,78 @@ +/* SSSE3 vector permutation AES for Libgcrypt + * Copyright (C) 2014-2017 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * + * + * The code is based on the public domain library libvpaes version 0.5 + * available at http://crypto.stanford.edu/vpaes/ and which carries + * this notice: + * + * libvpaes: constant-time SSSE3 AES encryption and decryption. + * version 0.5 + * + * By Mike Hamburg, Stanford University, 2009. Public domain. + * I wrote essentially all of this code. I did not write the test + * vectors; they are the NIST known answer tests. I hereby release all + * the code and documentation here that I wrote into the public domain. + * + * This is an implementation of AES following my paper, + * "Accelerating AES with Vector Permute Instructions" + * CHES 2009; http://shiftleft.org/papers/vector_aes/ + */ + +#include +#include +#include +#include /* for memcmp() */ + +#include "types.h" /* for byte and u32 typedefs */ +#include "g10lib.h" +#include "cipher.h" +#include "bufhelp.h" +#include "rijndael-internal.h" +#include "./cipher-internal.h" + + +#ifdef USE_VP_AARCH64 + + +#ifdef HAVE_GCC_ATTRIBUTE_OPTIMIZE +# define FUNC_ATTR_OPT __attribute__((optimize("-O2"))) +#else +# define FUNC_ATTR_OPT +#endif + +#define SIMD128_OPT_ATTR FUNC_ATTR_OPT + +#define FUNC_ENCRYPT _gcry_aes_vp_aarch64_encrypt +#define FUNC_DECRYPT _gcry_aes_vp_aarch64_decrypt +#define FUNC_CFB_ENC _gcry_aes_vp_aarch64_cfb_enc +#define FUNC_CFB_DEC _gcry_aes_vp_aarch64_cfb_dec +#define FUNC_CBC_ENC _gcry_aes_vp_aarch64_cbc_enc +#define FUNC_CBC_DEC _gcry_aes_vp_aarch64_cbc_dec +#define FUNC_CTR_ENC _gcry_aes_vp_aarch64_ctr_enc +#define FUNC_CTR32LE_ENC _gcry_aes_vp_aarch64_ctr32le_enc +#define FUNC_OCB_CRYPT _gcry_aes_vp_aarch64_ocb_crypt +#define FUNC_OCB_AUTH _gcry_aes_vp_aarch64_ocb_auth +#define FUNC_ECB_CRYPT _gcry_aes_vp_aarch64_ecb_crypt +#define FUNC_XTS_CRYPT _gcry_aes_vp_aarch64_xts_crypt +#define FUNC_SETKEY _gcry_aes_vp_aarch64_do_setkey +#define FUNC_PREPARE_DEC _gcry_aes_vp_aarch64_prepare_decryption + +#include "rijndael-vp-simd128.h" + +#endif /* USE_VP_AARCH64 */ diff --git a/cipher/rijndael-vp-simd128.h b/cipher/rijndael-vp-simd128.h new file mode 100644 index 00000000..0d53c62e --- /dev/null +++ b/cipher/rijndael-vp-simd128.h @@ -0,0 +1,2371 @@ +/* SIMD128 intrinsics implementation vector permutation AES for Libgcrypt + * Copyright (C) 2024 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + * + * + * The code is based on the public domain library libvpaes version 0.5 + * available at http://crypto.stanford.edu/vpaes/ and which carries + * this notice: + * + * libvpaes: constant-time SSSE3 AES encryption and decryption. + * version 0.5 + * + * By Mike Hamburg, Stanford University, 2009. Public domain. + * I wrote essentially all of this code. I did not write the test + * vectors; they are the NIST known answer tests. I hereby release all + * the code and documentation here that I wrote into the public domain. + * + * This is an implementation of AES following my paper, + * "Accelerating AES with Vector Permute Instructions" + * CHES 2009; http://shiftleft.org/papers/vector_aes/ + */ + +#include +#include "types.h" +#include "bufhelp.h" + +#define ALWAYS_INLINE inline __attribute__((always_inline)) +#define NO_INLINE __attribute__((noinline)) +#define NO_INSTRUMENT_FUNCTION __attribute__((no_instrument_function)) + +#define ASM_FUNC_ATTR NO_INSTRUMENT_FUNCTION +#define ASM_FUNC_ATTR_INLINE ASM_FUNC_ATTR ALWAYS_INLINE +#define ASM_FUNC_ATTR_NOINLINE ASM_FUNC_ATTR NO_INLINE SIMD128_OPT_ATTR + +/********************************************************************** + helper macros + **********************************************************************/ + +#define SWAP_LE64(x) (x) + +#define M128I_BYTE(a0, a1, a2, a3, a4, a5, a6, a7, b0, b1, b2, b3, b4, b5, b6, b7) \ + { \ + SWAP_LE64((((a0) & 0xffULL) << 0) | \ + (((a1) & 0xffULL) << 8) | \ + (((a2) & 0xffULL) << 16) | \ + (((a3) & 0xffULL) << 24) | \ + (((a4) & 0xffULL) << 32) | \ + (((a5) & 0xffULL) << 40) | \ + (((a6) & 0xffULL) << 48) | \ + (((a7) & 0xffULL) << 56)), \ + SWAP_LE64((((b0) & 0xffULL) << 0) | \ + (((b1) & 0xffULL) << 8) | \ + (((b2) & 0xffULL) << 16) | \ + (((b3) & 0xffULL) << 24) | \ + (((b4) & 0xffULL) << 32) | \ + (((b5) & 0xffULL) << 40) | \ + (((b6) & 0xffULL) << 48) | \ + (((b7) & 0xffULL) << 56)) \ + } + +#define PSHUFD_MASK_TO_PSHUFB_MASK(m32) \ + M128I_BYTE(((((m32) >> 0) & 0x03) * 4) + 0, \ + ((((m32) >> 0) & 0x03) * 4) + 1, \ + ((((m32) >> 0) & 0x03) * 4) + 2, \ + ((((m32) >> 0) & 0x03) * 4) + 3, \ + ((((m32) >> 2) & 0x03) * 4) + 0, \ + ((((m32) >> 2) & 0x03) * 4) + 1, \ + ((((m32) >> 2) & 0x03) * 4) + 2, \ + ((((m32) >> 2) & 0x03) * 4) + 3, \ + ((((m32) >> 4) & 0x03) * 4) + 0, \ + ((((m32) >> 4) & 0x03) * 4) + 1, \ + ((((m32) >> 4) & 0x03) * 4) + 2, \ + ((((m32) >> 4) & 0x03) * 4) + 3, \ + ((((m32) >> 6) & 0x03) * 4) + 0, \ + ((((m32) >> 6) & 0x03) * 4) + 1, \ + ((((m32) >> 6) & 0x03) * 4) + 2, \ + ((((m32) >> 6) & 0x03) * 4) + 3) + +#define M128I_U64(a0, a1) { a0, a1 } + +#ifdef __ARM_NEON + +/********************************************************************** + AT&T x86 asm to intrinsics conversion macros (ARM) + **********************************************************************/ + +#include "simd-common-aarch64.h" +#include + +#define __m128i uint64x2_t + +#define pand128(a, o) (o = vandq_u64(o, a)) +#define pandn128(a, o) (o = vbicq_u64(a, o)) +#define pxor128(a, o) (o = veorq_u64(o, a)) +#define paddq128(a, o) (o = vaddq_u64(o, a)) +#define paddd128(a, o) (o = (__m128i)vaddq_u32((uint32x4_t)o, (uint32x4_t)a)) +#define paddb128(a, o) (o = (__m128i)vaddq_u8((uint8x16_t)o, (uint8x16_t)a)) + +#define psrld128(s, o) (o = (__m128i)vshrq_n_u32((uint32x4_t)o, s)) +#define psraq128(s, o) (o = (__m128i)vshrq_n_s64((int64x2_t)o, s)) +#define psrldq128(s, o) ({ uint64x2_t __tmp = { 0, 0 }; \ + o = (__m128i)vextq_u8((uint8x16_t)o, \ + (uint8x16_t)__tmp, (s) & 15);}) +#define pslldq128(s, o) ({ uint64x2_t __tmp = { 0, 0 }; \ + o = (__m128i)vextq_u8((uint8x16_t)__tmp, \ + (uint8x16_t)o, (16 - (s)) & 15);}) + +#define pshufb128(m8, o) (o = (__m128i)vqtbl1q_u8((uint8x16_t)o, (uint8x16_t)m8)) +#define pshufd128(m32, a, o) ({ static const __m128i __tmp = PSHUFD_MASK_TO_PSHUFB_MASK(m32); \ + movdqa128(a, o); \ + pshufb128(__tmp, o); }) +#define pshufd128_0x93(a, o) (o = (__m128i)vextq_u8((uint8x16_t)a, (uint8x16_t)a, 12)) +#define pshufd128_0xFF(a, o) (o = (__m128i)vdupq_laneq_u32((uint32x4_t)a, 3)) +#define pshufd128_0xFE(a, o) pshufd128(0xFE, a, o) +#define pshufd128_0x4E(a, o) (o = (__m128i)vextq_u8((uint8x16_t)a, (uint8x16_t)a, 8)) + +#define palignr128(s, a, o) (o = (__m128i)vextq_u8((uint8x16_t)a, (uint8x16_t)o, s)) + +#define movdqa128(a, o) (o = a) + +#define pxor128_amemld(m, o) pxor128(*(const __m128i *)(m), o) + +/* Following operations may have unaligned memory input */ +#define movdqu128_memld(a, o) (o = (__m128i)vld1q_u8((const uint8_t *)(a))) + +/* Following operations may have unaligned memory output */ +#define movdqu128_memst(a, o) vst1q_u8((uint8_t *)(o), (uint8x16_t)a) + +#endif /* __ARM_NEON */ + +#if defined(__x86_64__) || defined(__i386__) + +/********************************************************************** + AT&T x86 asm to intrinsics conversion macros + **********************************************************************/ + +#include + +#define pand128(a, o) (o = _mm_and_si128(o, a)) +#define pandn128(a, o) (o = _mm_andnot_si128(o, a)) +#define pxor128(a, o) (o = _mm_xor_si128(o, a)) +#define paddq128(a, o) (o = _mm_add_epi64(o, a)) +#define vpaddd128(a, o) (o = _mm_add_epi32(o, a)) +#define vpaddb128(a, o) (o = _mm_add_epi8(o, a)) + +#define psrld128(s, o) (o = _mm_srli_epi32(o, s)) +#define psraq128(s, o) (o = _mm_srai_epi64(o, s)) +#define psrldq128(s, o) (o = _mm_srli_si128(o, s)) +#define pslldq128(s, o) (o = _mm_slli_si128(o, s)) + +#define pshufb128(m8, o) (o = _mm_shuffle_epi8(o, m8)) +#define pshufd128(m32, a, o) (o = _mm_shuffle_epi32(a, m32)) +#define pshufd128_0x93(a, o) pshufd128(0x93, a, o) +#define pshufd128_0xFF(a, o) pshufd128(0xFF, a, o) +#define pshufd128_0xFE(a, o) pshufd128(0xFE, a, o) +#define pshufd128_0x4E(a, o) pshufd128(0x4E, a, o) + +#define palignr128(s, a, o) (o = _mm_alignr_epi8(o, a, s)) + +#define movdqa128(a, o) (o = a) + +#define pxor128_amemld(m, o) pxor128(*(const __m128i *)(m), o) + +/* Following operations may have unaligned memory input */ +#define movdqu128_memld(a, o) (o = _mm_loadu_si128((const __m128i *)(a))) + +/* Following operations may have unaligned memory output */ +#define movdqu128_memst(a, o) _mm_storeu_si128((__m128i *)(o), a) + +#define memory_barrier_with_vec(a) __asm__("" : "+x"(a) :: "memory") + +#ifdef __WIN64__ +#define clear_vec_regs() __asm__ volatile("pxor %%xmm0, %%xmm0\n" \ + "pxor %%xmm1, %%xmm1\n" \ + "pxor %%xmm2, %%xmm2\n" \ + "pxor %%xmm3, %%xmm3\n" \ + "pxor %%xmm4, %%xmm4\n" \ + "pxor %%xmm5, %%xmm5\n" \ + /* xmm6-xmm15 are ABI callee \ + * saved and get cleared by \ + * function epilog when used. */ \ + ::: "memory", "xmm0", "xmm1", \ + "xmm2", "xmm3", "xmm4", "xmm5") +#else +#define clear_vec_regs() __asm__ volatile("pxor %%xmm0, %%xmm0\n" \ + "pxor %%xmm1, %%xmm1\n" \ + "pxor %%xmm2, %%xmm2\n" \ + "pxor %%xmm3, %%xmm3\n" \ + "pxor %%xmm4, %%xmm4\n" \ + "pxor %%xmm5, %%xmm5\n" \ + "pxor %%xmm6, %%xmm6\n" \ + "pxor %%xmm7, %%xmm7\n" \ + "pxor %%xmm8, %%xmm8\n" \ + "pxor %%xmm9, %%xmm9\n" \ + "pxor %%xmm10, %%xmm10\n" \ + "pxor %%xmm11, %%xmm11\n" \ + "pxor %%xmm12, %%xmm12\n" \ + "pxor %%xmm13, %%xmm13\n" \ + "pxor %%xmm14, %%xmm14\n" \ + "pxor %%xmm15, %%xmm15\n" \ + ::: "memory", "xmm0", "xmm1", \ + "xmm2", "xmm3", "xmm4", "xmm5", \ + "xmm6", "xmm7", "xmm8", "xmm9", \ + "xmm10", "xmm11", "xmm12", \ + "xmm13", "xmm14", "xmm15") +#endif + +#endif /* x86 */ + +/********************************************************************** + constant vectors + **********************************************************************/ + +static const __m128i k_s0F = + M128I_U64( + 0x0F0F0F0F0F0F0F0F, + 0x0F0F0F0F0F0F0F0F + ); + +static const __m128i k_iptlo = + M128I_U64( + 0xC2B2E8985A2A7000, + 0xCABAE09052227808 + ); + +static const __m128i k_ipthi = + M128I_U64( + 0x4C01307D317C4D00, + 0xCD80B1FCB0FDCC81 + ); + +static const __m128i k_inv = + M128I_U64( + 0x0E05060F0D080180, + 0x040703090A0B0C02 + ); + +static const __m128i k_inva = + M128I_U64( + 0x01040A060F0B0780, + 0x030D0E0C02050809 + ); + +static const __m128i k_sb1u = + M128I_U64( + 0xB19BE18FCB503E00, + 0xA5DF7A6E142AF544 + ); + +static const __m128i k_sb1t = + M128I_U64( + 0x3618D415FAE22300, + 0x3BF7CCC10D2ED9EF + ); + +static const __m128i k_sb2u = + M128I_U64( + 0xE27A93C60B712400, + 0x5EB7E955BC982FCD + ); + +static const __m128i k_sb2t = + M128I_U64( + 0x69EB88400AE12900, + 0xC2A163C8AB82234A + ); + +static const __m128i k_sbou = + M128I_U64( + 0xD0D26D176FBDC700, + 0x15AABF7AC502A878 + ); + +static const __m128i k_sbot = + M128I_U64( + 0xCFE474A55FBB6A00, + 0x8E1E90D1412B35FA + ); + +static const __m128i k_mc_forward[4] = +{ + M128I_U64( + 0x0407060500030201, + 0x0C0F0E0D080B0A09 + ), + M128I_U64( + 0x080B0A0904070605, + 0x000302010C0F0E0D + ), + M128I_U64( + 0x0C0F0E0D080B0A09, + 0x0407060500030201 + ), + M128I_U64( + 0x000302010C0F0E0D, + 0x080B0A0904070605 + ) +}; + +static const __m128i k_mc_backward[4] = +{ + M128I_U64( + 0x0605040702010003, + 0x0E0D0C0F0A09080B + ), + M128I_U64( + 0x020100030E0D0C0F, + 0x0A09080B06050407 + ), + M128I_U64( + 0x0E0D0C0F0A09080B, + 0x0605040702010003 + ), + M128I_U64( + 0x0A09080B06050407, + 0x020100030E0D0C0F + ) +}; + +static const __m128i k_sr[4] = +{ + M128I_U64( + 0x0706050403020100, + 0x0F0E0D0C0B0A0908 + ), + M128I_U64( + 0x030E09040F0A0500, + 0x0B06010C07020D08 + ), + M128I_U64( + 0x0F060D040B020900, + 0x070E050C030A0108 + ), + M128I_U64( + 0x0B0E0104070A0D00, + 0x0306090C0F020508 + ) +}; + +static const __m128i k_rcon = + M128I_U64( + 0x1F8391B9AF9DEEB6, + 0x702A98084D7C7D81 + ); + +static const __m128i k_s63 = + M128I_U64( + 0x5B5B5B5B5B5B5B5B, + 0x5B5B5B5B5B5B5B5B + ); + +static const __m128i k_opt[2] = +{ + M128I_U64( + 0xFF9F4929D6B66000, + 0xF7974121DEBE6808 + ), + M128I_U64( + 0x01EDBD5150BCEC00, + 0xE10D5DB1B05C0CE0 + ) +}; + +static const __m128i k_deskew[2] = +{ + M128I_U64( + 0x07E4A34047A4E300, + 0x1DFEB95A5DBEF91A + ), + M128I_U64( + 0x5F36B5DC83EA6900, + 0x2841C2ABF49D1E77 + ) +}; + +static const __m128i k_dks_1[2] = +{ + M128I_U64( + 0xB6116FC87ED9A700, + 0x4AED933482255BFC + ), + M128I_U64( + 0x4576516227143300, + 0x8BB89FACE9DAFDCE + ) +}; + +static const __m128i k_dks_2[2] = +{ + M128I_U64( + 0x27438FEBCCA86400, + 0x4622EE8AADC90561 + ), + M128I_U64( + 0x815C13CE4F92DD00, + 0x73AEE13CBD602FF2 + ) +}; + +static const __m128i k_dks_3[2] = +{ + M128I_U64( + 0x03C4C50201C6C700, + 0xF83F3EF9FA3D3CFB + ), + M128I_U64( + 0xEE1921D638CFF700, + 0xA5526A9D7384BC4B + ) +}; + +static const __m128i k_dks_4[2] = +{ + M128I_U64( + 0xE3C390B053732000, + 0xA080D3F310306343 + ), + M128I_U64( + 0xA0CA214B036982E8, + 0x2F45AEC48CE60D67 + ) +}; + +static const __m128i k_dipt[2] = +{ + M128I_U64( + 0x0F505B040B545F00, + 0x154A411E114E451A + ), + M128I_U64( + 0x86E383E660056500, + 0x12771772F491F194 + ) +}; + +static const __m128i k_dsb9[2] = +{ + M128I_U64( + 0x851C03539A86D600, + 0xCAD51F504F994CC9 + ), + M128I_U64( + 0xC03B1789ECD74900, + 0x725E2C9EB2FBA565 + ) +}; + +static const __m128i k_dsbd[2] = +{ + M128I_U64( + 0x7D57CCDFE6B1A200, + 0xF56E9B13882A4439 + ), + M128I_U64( + 0x3CE2FAF724C6CB00, + 0x2931180D15DEEFD3 + ) +}; + +static const __m128i k_dsbb[2] = +{ + M128I_U64( + 0xD022649296B44200, + 0x602646F6B0F2D404 + ), + M128I_U64( + 0xC19498A6CD596700, + 0xF3FF0C3E3255AA6B + ) +}; + +static const __m128i k_dsbe[2] = +{ + M128I_U64( + 0x46F2929626D4D000, + 0x2242600464B4F6B0 + ), + M128I_U64( + 0x0C55A6CDFFAAC100, + 0x9467F36B98593E32 + ) +}; + +static const __m128i k_dsbo[2] = +{ + M128I_U64( + 0x1387EA537EF94000, + 0xC7AA6DB9D4943E2D + ), + M128I_U64( + 0x12D7560F93441D00, + 0xCA4B8159D8C58E9C + ) +}; + +/********************************************************************** + vector permutate AES + **********************************************************************/ + +struct vp_aes_config_s +{ + union + { + const byte *sched_keys; + byte *keysched; + }; + unsigned int nround; +}; + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_round(__m128i *pxmm0, __m128i *pxmm7, __m128i *pxmm8, + __m128i xmm9, __m128i xmm10, __m128i xmm11, + int low_round_only) +{ + /* aes_schedule_round + * + * Runs one main round of the key schedule on %xmm0, %xmm7 + * + * Specifically, runs subbytes on the high dword of %xmm0 + * then rotates it by one byte and xors into the low dword of + * %xmm7. + * + * Adds rcon from low byte of %xmm8, then rotates %xmm8 for + * next rcon. + * + * Smears the dwords of %xmm7 by xoring the low into the + * second low, result into third, result into highest. + * + * Returns results in %xmm7 = %xmm0. + */ + + __m128i xmm1, xmm2, xmm3, xmm4; + __m128i xmm0 = *pxmm0; + __m128i xmm7 = *pxmm7; + __m128i xmm8 = *pxmm8; + + if (!low_round_only) + { + /* extract rcon from xmm8 */ + pxor128(xmm1, xmm1); + palignr128(15, xmm8, xmm1); + palignr128(15, xmm8, xmm8); + pxor128(xmm1, xmm7); + + /* rotate */ + pshufd128_0xFF(xmm0, xmm0); + palignr128(1, xmm0, xmm0); + } + + /* smear xmm7 */ + movdqa128(xmm7, xmm1); + pslldq128(4, xmm7); + pxor128(xmm1, xmm7); + movdqa128(xmm7, xmm1); + pslldq128(8, xmm7); + pxor128(xmm1, xmm7); + pxor128(k_s63, xmm7); + + /* subbytes */ + movdqa128(xmm9, xmm1); + pandn128(xmm0, xmm1); + psrld128(4, xmm1); /* 1 = i */ + pand128(xmm9, xmm0); /* 0 = k */ + movdqa128(xmm11, xmm2); /* 2 : a/k */ + pshufb128(xmm0, xmm2); /* 2 = a/k */ + pxor128(xmm1, xmm0); /* 0 = j */ + movdqa128(xmm10, xmm3); /* 3 : 1/i */ + pshufb128(xmm1, xmm3); /* 3 = 1/i */ + pxor128(xmm2, xmm3); /* 3 = iak = 1/i + a/k */ + movdqa128(xmm10, xmm4); /* 4 : 1/j */ + pshufb128(xmm0, xmm4); /* 4 = 1/j */ + pxor128(xmm2, xmm4); /* 4 = jak = 1/j + a/k */ + movdqa128(xmm10, xmm2); /* 2 : 1/iak */ + pshufb128(xmm3, xmm2); /* 2 = 1/iak */ + pxor128(xmm0, xmm2); /* 2 = io */ + movdqa128(xmm10, xmm3); /* 3 : 1/jak */ + pshufb128(xmm4, xmm3); /* 3 = 1/jak */ + pxor128(xmm1, xmm3); /* 3 = jo */ + movdqa128(k_sb1u, xmm4); /* 4 : sbou */ + pshufb128(xmm2, xmm4); /* 4 = sbou */ + movdqa128(k_sb1t, xmm0); /* 0 : sbot */ + pshufb128(xmm3, xmm0); /* 0 = sb1t */ + pxor128(xmm4, xmm0); /* 0 = sbox output */ + + /* add in smeared stuff */ + pxor128(xmm7, xmm0); + movdqa128(xmm0, xmm7); + + *pxmm0 = xmm0; + *pxmm7 = xmm7; + *pxmm8 = xmm8; +} + +static ASM_FUNC_ATTR_INLINE __m128i +aes_schedule_transform(__m128i xmm0, const __m128i xmm9, + const __m128i tablelo, const __m128i tablehi) +{ + /* aes_schedule_transform + * + * Linear-transform %xmm0 according to tablelo:tablehi + * + * Requires that %xmm9 = 0x0F0F... as in preheat + * Output in %xmm0 + */ + + __m128i xmm1, xmm2; + + movdqa128(xmm9, xmm1); + pandn128(xmm0, xmm1); + psrld128(4, xmm1); + pand128(xmm9, xmm0); + movdqa128(tablelo, xmm2); + pshufb128(xmm0, xmm2); + movdqa128(tablehi, xmm0); + pshufb128(xmm1, xmm0); + pxor128(xmm2, xmm0); + + return xmm0; +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_mangle(__m128i xmm0, struct vp_aes_config_s *pconfig, int decrypt, + unsigned int *protoffs, __m128i xmm9) +{ + /* aes_schedule_mangle + * + * Mangle xmm0 from (basis-transformed) standard version + * to our version. + * + * On encrypt, + * xor with 0x63 + * multiply by circulant 0,1,1,1 + * apply shiftrows transform + * + * On decrypt, + * xor with 0x63 + * multiply by 'inverse mixcolumns' circulant E,B,D,9 + * deskew + * apply shiftrows transform + * + * Writes out to (keysched), and increments or decrements it + * Keeps track of round number mod 4 in (rotoffs) + */ + __m128i xmm3, xmm4, xmm5; + struct vp_aes_config_s config = *pconfig; + byte *keysched = config.keysched; + unsigned int rotoffs = *protoffs; + + movdqa128(xmm0, xmm4); + movdqa128(k_mc_forward[0], xmm5); + + if (!decrypt) + { + keysched += 16; + pxor128(k_s63, xmm4); + pshufb128(xmm5, xmm4); + movdqa128(xmm4, xmm3); + pshufb128(xmm5, xmm4); + pxor128(xmm4, xmm3); + pshufb128(xmm5, xmm4); + pxor128(xmm4, xmm3); + } + else + { + /* first table: *9 */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_dks_1[0], k_dks_1[1]); + movdqa128(xmm0, xmm3); + pshufb128(xmm5, xmm3); + + /* next table: *B */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_dks_2[0], k_dks_2[1]); + pxor128(xmm0, xmm3); + pshufb128(xmm5, xmm3); + + /* next table: *D */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_dks_3[0], k_dks_3[1]); + pxor128(xmm0, xmm3); + pshufb128(xmm5, xmm3); + + /* next table: *E */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_dks_4[0], k_dks_4[1]); + pxor128(xmm0, xmm3); + pshufb128(xmm5, xmm3); + + keysched -= 16; + } + + pshufb128(k_sr[rotoffs], xmm3); + rotoffs -= 16 / 16; + rotoffs &= 48 / 16; + movdqu128_memst(xmm3, keysched); + + config.keysched = keysched; + *pconfig = config; + *protoffs = rotoffs; +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_mangle_last(__m128i xmm0, struct vp_aes_config_s config, + int decrypt, unsigned int rotoffs, __m128i xmm9) +{ + /* aes_schedule_mangle_last + * + * Mangler for last round of key schedule + * + * Mangles %xmm0 + * when encrypting, outputs out(%xmm0) ^ 63 + * when decrypting, outputs unskew(%xmm0) + */ + + if (!decrypt) + { + pshufb128(k_sr[rotoffs], xmm0); /* output permute */ + config.keysched += 16; + pxor128(k_s63, xmm0); + xmm0 = aes_schedule_transform(xmm0, xmm9, k_opt[0], k_opt[1]); + } + else + { + config.keysched -= 16; + pxor128(k_s63, xmm0); + xmm0 = aes_schedule_transform(xmm0, xmm9, k_deskew[0], k_deskew[1]); + } + + movdqu128_memst(xmm0, config.keysched); /* save last key */ +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_128(struct vp_aes_config_s config, int decrypt, + unsigned int rotoffs, __m128i xmm0, __m128i xmm7, + __m128i xmm8, __m128i xmm9, __m128i xmm10, __m128i xmm11) +{ + /* aes_schedule_128 + * + * 128-bit specific part of key schedule. + * + * This schedule is really simple, because all its parts + * are accomplished by the subroutines. + */ + + int r = 10; + + while (1) + { + aes_schedule_round(&xmm0, &xmm7, &xmm8, xmm9, xmm10, xmm11, 0); + + if (--r == 0) + break; + + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); + } + + aes_schedule_mangle_last(xmm0, config, decrypt, rotoffs, xmm9); +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_192_smear(__m128i *pxmm0, __m128i *pxmm6, __m128i xmm7) +{ + /* + * aes_schedule_192_smear + * + * Smear the short, low side in the 192-bit key schedule. + * + * Inputs: + * %xmm7: high side, b a x y + * %xmm6: low side, d c 0 0 + * + * Outputs: + * %xmm6: b+c+d b+c 0 0 + * %xmm0: b+c+d b+c b a + */ + + __m128i xmm0 = *pxmm0; + __m128i xmm6 = *pxmm6; + + movdqa128(xmm6, xmm0); + pslldq128(4, xmm0); /* d c 0 0 -> c 0 0 0 */ + pxor128(xmm0, xmm6); /* -> c+d c 0 0 */ + pshufd128_0xFE(xmm7, xmm0); /* b a _ _ -> b b b a */ + pxor128(xmm6, xmm0); /* -> b+c+d b+c b a */ + movdqa128(xmm0, xmm6); + psrldq128(8, xmm6); + pslldq128(8, xmm6); /* clobber low side with zeros */ + + *pxmm0 = xmm0; + *pxmm6 = xmm6; +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_192(const byte *key, struct vp_aes_config_s config, int decrypt, + unsigned int rotoffs, __m128i xmm0, __m128i xmm7, + __m128i xmm8, __m128i xmm9, __m128i xmm10, __m128i xmm11) +{ + /* aes_schedule_192 + * + * 192-bit specific part of key schedule. + * + * The main body of this schedule is the same as the 128-bit + * schedule, but with more smearing. The long, high side is + * stored in %xmm7 as before, and the short, low side is in + * the high bits of %xmm6. + * + * This schedule is somewhat nastier, however, because each + * round produces 192 bits of key material, or 1.5 round keys. + * Therefore, on each cycle we do 2 rounds and produce 3 round + * keys. + */ + + __m128i xmm6; + int r = 4; + + movdqu128_memld(key + 8, xmm0); /* load key part 2 (very unaligned) */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_iptlo, k_ipthi); /* input transform */ + movdqa128(xmm0, xmm6); + psrldq128(8, xmm6); + pslldq128(8, xmm6); /* clobber low side with zeros */ + + while (1) + { + aes_schedule_round(&xmm0, &xmm7, &xmm8, xmm9, xmm10, xmm11, 0); + palignr128(8, xmm6, xmm0); + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); /* save key n */ + aes_schedule_192_smear(&xmm0, &xmm6, xmm7); + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); /* save key n+1 */ + aes_schedule_round(&xmm0, &xmm7, &xmm8, xmm9, xmm10, xmm11, 0); + if (--r == 0) + break; + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); /* save key n+2 */ + aes_schedule_192_smear(&xmm0, &xmm6, xmm7); + } + + aes_schedule_mangle_last(xmm0, config, decrypt, rotoffs, xmm9); +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_256(const byte *key, struct vp_aes_config_s config, int decrypt, + unsigned int rotoffs, __m128i xmm0, __m128i xmm7, + __m128i xmm8, __m128i xmm9, __m128i xmm10, __m128i xmm11) +{ + /* aes_schedule_256 + * + * 256-bit specific part of key schedule. + * + * The structure here is very similar to the 128-bit + * schedule, but with an additional 'low side' in + * %xmm6. The low side's rounds are the same as the + * high side's, except no rcon and no rotation. + */ + + __m128i xmm5, xmm6; + + int r = 7; + + movdqu128_memld(key + 16, xmm0); /* load key part 2 (unaligned) */ + xmm0 = aes_schedule_transform(xmm0, xmm9, k_iptlo, k_ipthi); /* input transform */ + + while (1) + { + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); /* output low result */ + movdqa128(xmm0, xmm6); /* save cur_lo in xmm6 */ + + /* high round */ + aes_schedule_round(&xmm0, &xmm7, &xmm8, xmm9, xmm10, xmm11, 0); + + if (--r == 0) + break; + + aes_schedule_mangle(xmm0, &config, decrypt, &rotoffs, xmm9); + + /* low round. swap xmm7 and xmm6 */ + pshufd128_0xFF(xmm0, xmm0); + movdqa128(xmm7, xmm5); + movdqa128(xmm6, xmm7); + aes_schedule_round(&xmm0, &xmm7, &xmm8, xmm9, xmm10, xmm11, 1); + movdqa128(xmm5, xmm7); + } + + aes_schedule_mangle_last(xmm0, config, decrypt, rotoffs, xmm9); +} + +static ASM_FUNC_ATTR_INLINE void +aes_schedule_core(const byte *key, struct vp_aes_config_s config, + int decrypt, unsigned int rotoffs) +{ + unsigned int keybits = (config.nround - 10) * 32 + 128; + __m128i xmm0, xmm3, xmm7, xmm8, xmm9, xmm10, xmm11; + + movdqa128(k_s0F, xmm9); + movdqa128(k_inv, xmm10); + movdqa128(k_inva, xmm11); + movdqa128(k_rcon, xmm8); + + movdqu128_memld(key, xmm0); + + /* input transform */ + movdqa128(xmm0, xmm3); + xmm0 = aes_schedule_transform(xmm0, xmm9, k_iptlo, k_ipthi); + movdqa128(xmm0, xmm7); + + if (!decrypt) + { + /* encrypting, output zeroth round key after transform */ + movdqu128_memst(xmm0, config.keysched); + } + else + { + /* decrypting, output zeroth round key after shiftrows */ + pshufb128(k_sr[rotoffs], xmm3); + movdqu128_memst(xmm3, config.keysched); + rotoffs ^= 48 / 16; + } + + if (keybits < 192) + { + aes_schedule_128(config, decrypt, rotoffs, xmm0, xmm7, xmm8, xmm9, + xmm10, xmm11); + } + else if (keybits == 192) + { + aes_schedule_192(key, config, decrypt, rotoffs, xmm0, xmm7, xmm8, xmm9, + xmm10, xmm11); + } + else + { + aes_schedule_256(key, config, decrypt, rotoffs, xmm0, xmm7, xmm8, xmm9, + xmm10, xmm11); + } +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_SETKEY (RIJNDAEL_context *ctx, const byte *key) +{ + unsigned int keybits = (ctx->rounds - 10) * 32 + 128; + struct vp_aes_config_s config; + __m128i xmm0, xmm1; + + config.nround = ctx->rounds; + config.keysched = (byte *)&ctx->keyschenc32[0][0]; + + aes_schedule_core(key, config, 0, 48 / 16); + + /* Save key for setting up decryption. */ + switch (keybits) + { + default: + case 128: + movdqu128_memld(key, xmm0); + movdqu128_memst(xmm0, ((byte *)&ctx->keyschdec32[0][0])); + break; + + case 192: + movdqu128_memld(key, xmm0); + movdqu128_memld(key + 8, xmm1); + movdqu128_memst(xmm0, ((byte *)&ctx->keyschdec32[0][0])); + movdqu128_memst(xmm1, ((byte *)&ctx->keyschdec32[0][0]) + 8); + break; + + case 256: + movdqu128_memld(key, xmm0); + movdqu128_memld(key + 16, xmm1); + movdqu128_memst(xmm0, ((byte *)&ctx->keyschdec32[0][0])); + movdqu128_memst(xmm1, ((byte *)&ctx->keyschdec32[0][0]) + 16); + break; + } + + clear_vec_regs(); +} + + +ASM_FUNC_ATTR_NOINLINE void +FUNC_PREPARE_DEC (RIJNDAEL_context *ctx) +{ + unsigned int keybits = (ctx->rounds - 10) * 32 + 128; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.keysched = (byte *)&ctx->keyschdec32[ctx->rounds][0]; + + aes_schedule_core((byte *)&ctx->keyschdec32[0][0], config, 1, + ((keybits == 192) ? 0 : 32) / 16); + + clear_vec_regs(); +} + +#define enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15) \ + movdqa128(k_s0F, xmm9); \ + movdqa128(k_inv, xmm10); \ + movdqa128(k_inva, xmm11); \ + movdqa128(k_sb1u, xmm13); \ + movdqa128(k_sb1t, xmm12); \ + movdqa128(k_sb2u, xmm15); \ + movdqa128(k_sb2t, xmm14); + +#define dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8) \ + movdqa128(k_s0F, xmm9); \ + movdqa128(k_inv, xmm10); \ + movdqa128(k_inva, xmm11); \ + movdqa128(k_dsb9[0], xmm13); \ + movdqa128(k_dsb9[1], xmm12); \ + movdqa128(k_dsbd[0], xmm15); \ + movdqa128(k_dsbb[0], xmm14); \ + movdqa128(k_dsbe[0], xmm8); + +static ASM_FUNC_ATTR_INLINE __m128i +aes_encrypt_core(__m128i xmm0, struct vp_aes_config_s config, + __m128i xmm9, __m128i xmm10, __m128i xmm11, __m128i xmm12, + __m128i xmm13, __m128i xmm14, __m128i xmm15) +{ + __m128i xmm1, xmm2, xmm3, xmm4; + const byte *end_keys = config.sched_keys + 16 * config.nround; + unsigned int mc_pos = 1; + + movdqa128(k_iptlo, xmm2); + movdqa128(xmm9, xmm1); + pandn128(xmm0, xmm1); + psrld128(4, xmm1); + pand128(xmm9, xmm0); + pshufb128(xmm0, xmm2); + movdqa128(k_ipthi, xmm0); + + pshufb128(xmm1, xmm0); + pxor128_amemld(config.sched_keys, xmm2); + pxor128(xmm2, xmm0); + + config.sched_keys += 16; + + while (1) + { + /* top of round */ + movdqa128(xmm9, xmm1); /* 1 : i */ + pandn128(xmm0, xmm1); /* 1 = i<<4 */ + psrld128(4, xmm1); /* 1 = i */ + pand128(xmm9, xmm0); /* 0 = k */ + movdqa128(xmm11, xmm2); /* 2 : a/k */ + pshufb128(xmm0, xmm2); /* 2 = a/k */ + pxor128(xmm1, xmm0); /* 0 = j */ + movdqa128(xmm10, xmm3); /* 3 : 1/i */ + pshufb128(xmm1, xmm3); /* 3 = 1/i */ + pxor128(xmm2, xmm3); /* 3 = iak = 1/i + a/k */ + movdqa128(xmm10, xmm4); /* 4 : 1/j */ + pshufb128(xmm0, xmm4); /* 4 = 1/j */ + pxor128(xmm2, xmm4); /* 4 = jak = 1/j + a/k */ + movdqa128(xmm10, xmm2); /* 2 : 1/iak */ + pshufb128(xmm3, xmm2); /* 2 = 1/iak */ + pxor128(xmm0, xmm2); /* 2 = io */ + movdqa128(xmm10, xmm3); /* 3 : 1/jak */ + pshufb128(xmm4, xmm3); /* 3 = 1/jak */ + pxor128(xmm1, xmm3); /* 3 = jo */ + + if (config.sched_keys == end_keys) + break; + + /* middle of middle round */ + movdqa128(xmm13, xmm4); /* 4 : sb1u */ + pshufb128(xmm2, xmm4); /* 4 = sb1u */ + pxor128_amemld(config.sched_keys, xmm4); /* 4 = sb1u + k */ + movdqa128(xmm12, xmm0); /* 0 : sb1t */ + pshufb128(xmm3, xmm0); /* 0 = sb1t */ + pxor128(xmm4, xmm0); /* 0 = A */ + movdqa128(xmm15, xmm4); /* 4 : sb2u */ + pshufb128(xmm2, xmm4); /* 4 = sb2u */ + movdqa128(k_mc_forward[mc_pos], xmm1); + movdqa128(xmm14, xmm2); /* 2 : sb2t */ + pshufb128(xmm3, xmm2); /* 2 = sb2t */ + pxor128(xmm4, xmm2); /* 2 = 2A */ + movdqa128(xmm0, xmm3); /* 3 = A */ + pshufb128(xmm1, xmm0); /* 0 = B */ + pxor128(xmm2, xmm0); /* 0 = 2A+B */ + pshufb128(k_mc_backward[mc_pos], xmm3); /* 3 = D */ + pxor128(xmm0, xmm3); /* 3 = 2A+B+D */ + pshufb128(xmm1, xmm0); /* 0 = 2B+C */ + pxor128(xmm3, xmm0); /* 0 = 2A+3B+C+D */ + + config.sched_keys += 16; + mc_pos = (mc_pos + 1) % 4; /* next mc mod 4 */ + } + + /* middle of last round */ + movdqa128(k_sbou, xmm4); /* 3 : sbou */ + pshufb128(xmm2, xmm4); /* 4 = sbou */ + pxor128_amemld(config.sched_keys, xmm4); /* 4 = sb1u + k */ + movdqa128(k_sbot, xmm0); /* 0 : sbot */ + pshufb128(xmm3, xmm0); /* 0 = sb1t */ + pxor128(xmm4, xmm0); /* 0 = A */ + pshufb128(k_sr[mc_pos], xmm0); + + return xmm0; +} + +static ASM_FUNC_ATTR_INLINE void +aes_encrypt_core_2blks(__m128i *pxmm0_a, __m128i *pxmm0_b, + struct vp_aes_config_s config, + __m128i xmm9, __m128i xmm10, __m128i xmm11, + __m128i xmm12, __m128i xmm13, __m128i xmm14, + __m128i xmm15) +{ + __m128i xmm0_a, xmm0_b; + __m128i xmm1_a, xmm2_a, xmm3_a, xmm4_a; + __m128i xmm1_b, xmm2_b, xmm3_b, xmm4_b; + __m128i xmm5; + const byte *end_keys = config.sched_keys + 16 * config.nround; + unsigned int mc_pos = 1; + + xmm0_a = *pxmm0_a; + xmm0_b = *pxmm0_b; + + movdqa128(k_iptlo, xmm2_a); movdqa128(k_iptlo, xmm2_b); + movdqa128(xmm9, xmm1_a); movdqa128(xmm9, xmm1_b); + pandn128(xmm0_a, xmm1_a); pandn128(xmm0_b, xmm1_b); + psrld128(4, xmm1_a); psrld128(4, xmm1_b); + pand128(xmm9, xmm0_a); pand128(xmm9, xmm0_b); + pshufb128(xmm0_a, xmm2_a); pshufb128(xmm0_b, xmm2_b); + movdqa128(k_ipthi, xmm0_a); movdqa128(k_ipthi, xmm0_b); + + pshufb128(xmm1_a, xmm0_a); pshufb128(xmm1_b, xmm0_b); + movdqu128_memld(config.sched_keys, xmm5); + pxor128(xmm5, xmm2_a); pxor128(xmm5, xmm2_b); + pxor128(xmm2_a, xmm0_a); pxor128(xmm2_b, xmm0_b); + + config.sched_keys += 16; + + while (1) + { + /* top of round */ + movdqa128(xmm9, xmm1_a); movdqa128(xmm9, xmm1_b); + pandn128(xmm0_a, xmm1_a); pandn128(xmm0_b, xmm1_b); + psrld128(4, xmm1_a); psrld128(4, xmm1_b); + pand128(xmm9, xmm0_a); pand128(xmm9, xmm0_b); + movdqa128(xmm11, xmm2_a); movdqa128(xmm11, xmm2_b); + pshufb128(xmm0_a, xmm2_a); pshufb128(xmm0_b, xmm2_b); + pxor128(xmm1_a, xmm0_a); pxor128(xmm1_b, xmm0_b); + movdqa128(xmm10, xmm3_a); movdqa128(xmm10, xmm3_b); + pshufb128(xmm1_a, xmm3_a); pshufb128(xmm1_b, xmm3_b); + pxor128(xmm2_a, xmm3_a); pxor128(xmm2_b, xmm3_b); + movdqa128(xmm10, xmm4_a); movdqa128(xmm10, xmm4_b); + pshufb128(xmm0_a, xmm4_a); pshufb128(xmm0_b, xmm4_b); + pxor128(xmm2_a, xmm4_a); pxor128(xmm2_b, xmm4_b); + movdqa128(xmm10, xmm2_a); movdqa128(xmm10, xmm2_b); + pshufb128(xmm3_a, xmm2_a); pshufb128(xmm3_b, xmm2_b); + pxor128(xmm0_a, xmm2_a); pxor128(xmm0_b, xmm2_b); + movdqa128(xmm10, xmm3_a); movdqa128(xmm10, xmm3_b); + pshufb128(xmm4_a, xmm3_a); pshufb128(xmm4_b, xmm3_b); + pxor128(xmm1_a, xmm3_a); pxor128(xmm1_b, xmm3_b); + + if (config.sched_keys == end_keys) + break; + + /* middle of middle round */ + movdqa128(xmm13, xmm4_a); movdqa128(xmm13, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + movdqu128_memld(config.sched_keys, xmm5); + pxor128(xmm5, xmm4_a); pxor128(xmm5, xmm4_b); + movdqa128(xmm12, xmm0_a); movdqa128(xmm12, xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + movdqa128(xmm15, xmm4_a); movdqa128(xmm15, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + movdqa128(k_mc_forward[mc_pos], xmm1_a); + movdqa128(k_mc_forward[mc_pos], xmm1_b); + movdqa128(xmm14, xmm2_a); movdqa128(xmm14, xmm2_b); + pshufb128(xmm3_a, xmm2_a); pshufb128(xmm3_b, xmm2_b); + pxor128(xmm4_a, xmm2_a); pxor128(xmm4_b, xmm2_b); + movdqa128(xmm0_a, xmm3_a); movdqa128(xmm0_b, xmm3_b); + pshufb128(xmm1_a, xmm0_a); pshufb128(xmm1_b, xmm0_b); + pxor128(xmm2_a, xmm0_a); pxor128(xmm2_b, xmm0_b); + pshufb128(k_mc_backward[mc_pos], xmm3_a); + pshufb128(k_mc_backward[mc_pos], xmm3_b); + pxor128(xmm0_a, xmm3_a); pxor128(xmm0_b, xmm3_b); + pshufb128(xmm1_a, xmm0_a); pshufb128(xmm1_b, xmm0_b); + pxor128(xmm3_a, xmm0_a); pxor128(xmm3_b, xmm0_b); + + config.sched_keys += 16; + mc_pos = (mc_pos + 1) % 4; /* next mc mod 4 */ + } + + /* middle of last round */ + movdqa128(k_sbou, xmm4_a); movdqa128(k_sbou, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + movdqu128_memld(config.sched_keys, xmm5); + pxor128(xmm5, xmm4_a); pxor128(xmm5, xmm4_b); + movdqa128(k_sbot, xmm0_a); movdqa128(k_sbot, xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + pshufb128(k_sr[mc_pos], xmm0_a); + pshufb128(k_sr[mc_pos], xmm0_b); + + *pxmm0_a = xmm0_a; + *pxmm0_b = xmm0_b; +} + +static ASM_FUNC_ATTR_INLINE __m128i +aes_decrypt_core(__m128i xmm0, struct vp_aes_config_s config, + __m128i xmm9, __m128i xmm10, __m128i xmm11, __m128i xmm12, + __m128i xmm13, __m128i xmm14, __m128i xmm15, __m128i xmm8) +{ + __m128i xmm1, xmm2, xmm3, xmm4, xmm5; + const byte *end_keys = config.sched_keys + 16 * config.nround; + unsigned int mc_pos = config.nround % 4; + + movdqa128(k_dipt[0], xmm2); + movdqa128(xmm9, xmm1); + pandn128(xmm0, xmm1); + psrld128(4, xmm1); + pand128(xmm9, xmm0); + pshufb128(xmm0, xmm2); + movdqa128(k_dipt[1], xmm0); + pshufb128(xmm1, xmm0); + pxor128_amemld(config.sched_keys, xmm2); + pxor128(xmm2, xmm0); + movdqa128(k_mc_forward[3], xmm5); + + config.sched_keys += 16; + + while (1) + { + /* top of round */ + movdqa128(xmm9, xmm1); /* 1 : i */ + pandn128(xmm0, xmm1); /* 1 = i<<4 */ + psrld128(4, xmm1); /* 1 = i */ + pand128(xmm9, xmm0); /* 0 = k */ + movdqa128(xmm11, xmm2); /* 2 : a/k */ + pshufb128(xmm0, xmm2); /* 2 = a/k */ + pxor128(xmm1, xmm0); /* 0 = j */ + movdqa128(xmm10, xmm3); /* 3 : 1/i */ + pshufb128(xmm1, xmm3); /* 3 = 1/i */ + pxor128(xmm2, xmm3); /* 3 = iak = 1/i + a/k */ + movdqa128(xmm10, xmm4); /* 4 : 1/j */ + pshufb128(xmm0, xmm4); /* 4 = 1/j */ + pxor128(xmm2, xmm4); /* 4 = jak = 1/j + a/k */ + movdqa128(xmm10, xmm2); /* 2 : 1/iak */ + pshufb128(xmm3, xmm2); /* 2 = 1/iak */ + pxor128(xmm0, xmm2); /* 2 = io */ + movdqa128(xmm10, xmm3); /* 3 : 1/jak */ + pshufb128(xmm4, xmm3); /* 3 = 1/jak */ + pxor128(xmm1, xmm3); /* 3 = jo */ + + if (config.sched_keys == end_keys) + break; + + /* Inverse mix columns */ + movdqa128(xmm13, xmm4); /* 4 : sb9u */ + pshufb128(xmm2, xmm4); /* 4 = sb9u */ + pxor128_amemld(config.sched_keys, xmm4); + movdqa128(xmm12, xmm0); /* 0 : sb9t */ + pshufb128(xmm3, xmm0); /* 0 = sb9t */ + movdqa128(k_dsbd[1], xmm1); /* 1 : sbdt */ + pxor128(xmm4, xmm0); /* 0 = ch */ + + pshufb128(xmm5, xmm0); /* MC ch */ + movdqa128(xmm15, xmm4); /* 4 : sbdu */ + pshufb128(xmm2, xmm4); /* 4 = sbdu */ + pxor128(xmm0, xmm4); /* 4 = ch */ + pshufb128(xmm3, xmm1); /* 1 = sbdt */ + pxor128(xmm4, xmm1); /* 1 = ch */ + + pshufb128(xmm5, xmm1); /* MC ch */ + movdqa128(xmm14, xmm4); /* 4 : sbbu */ + pshufb128(xmm2, xmm4); /* 4 = sbbu */ + pxor128(xmm1, xmm4); /* 4 = ch */ + movdqa128(k_dsbb[1], xmm0); /* 0 : sbbt */ + pshufb128(xmm3, xmm0); /* 0 = sbbt */ + pxor128(xmm4, xmm0); /* 0 = ch */ + + pshufb128(xmm5, xmm0); /* MC ch */ + movdqa128(xmm8, xmm4); /* 4 : sbeu */ + pshufb128(xmm2, xmm4); /* 4 = sbeu */ + pshufd128_0x93(xmm5, xmm5); + pxor128(xmm0, xmm4); /* 4 = ch */ + movdqa128(k_dsbe[1], xmm0); /* 0 : sbet */ + pshufb128(xmm3, xmm0); /* 0 = sbet */ + pxor128(xmm4, xmm0); /* 0 = ch */ + + config.sched_keys += 16; + } + + /* middle of last round */ + movdqa128(k_dsbo[0], xmm4); /* 3 : sbou */ + pshufb128(xmm2, xmm4); /* 4 = sbou */ + pxor128_amemld(config.sched_keys, xmm4); /* 4 = sb1u + k */ + movdqa128(k_dsbo[1], xmm0); /* 0 : sbot */ + pshufb128(xmm3, xmm0); /* 0 = sb1t */ + pxor128(xmm4, xmm0); /* 0 = A */ + pshufb128(k_sr[mc_pos], xmm0); + + return xmm0; +} + +static ASM_FUNC_ATTR_INLINE void +aes_decrypt_core_2blks(__m128i *pxmm0_a, __m128i *pxmm0_b, + struct vp_aes_config_s config, + __m128i xmm9, __m128i xmm10, __m128i xmm11, + __m128i xmm12, __m128i xmm13, __m128i xmm14, + __m128i xmm15, __m128i xmm8) +{ + __m128i xmm0_a, xmm0_b; + __m128i xmm1_a, xmm2_a, xmm3_a, xmm4_a; + __m128i xmm1_b, xmm2_b, xmm3_b, xmm4_b; + __m128i xmm5, xmm6; + const byte *end_keys = config.sched_keys + 16 * config.nround; + unsigned int mc_pos = config.nround % 4; + + xmm0_a = *pxmm0_a; + xmm0_b = *pxmm0_b; + + movdqa128(k_dipt[0], xmm2_a); movdqa128(k_dipt[0], xmm2_b); + movdqa128(xmm9, xmm1_a); movdqa128(xmm9, xmm1_b); + pandn128(xmm0_a, xmm1_a); pandn128(xmm0_b, xmm1_b); + psrld128(4, xmm1_a); psrld128(4, xmm1_b); + pand128(xmm9, xmm0_a); pand128(xmm9, xmm0_b); + pshufb128(xmm0_a, xmm2_a); pshufb128(xmm0_b, xmm2_b); + movdqa128(k_dipt[1], xmm0_a); movdqa128(k_dipt[1], xmm0_b); + pshufb128(xmm1_a, xmm0_a); pshufb128(xmm1_b, xmm0_b); + movdqu128_memld(config.sched_keys, xmm6); + pxor128(xmm6, xmm2_a); pxor128(xmm6, xmm2_b); + pxor128(xmm2_a, xmm0_a); pxor128(xmm2_b, xmm0_b); + movdqa128(k_mc_forward[3], xmm5); + + config.sched_keys += 16; + + while (1) + { + /* top of round */ + movdqa128(xmm9, xmm1_a); movdqa128(xmm9, xmm1_b); + pandn128(xmm0_a, xmm1_a); pandn128(xmm0_b, xmm1_b); + psrld128(4, xmm1_a); psrld128(4, xmm1_b); + pand128(xmm9, xmm0_a); pand128(xmm9, xmm0_b); + movdqa128(xmm11, xmm2_a); movdqa128(xmm11, xmm2_b); + pshufb128(xmm0_a, xmm2_a); pshufb128(xmm0_b, xmm2_b); + pxor128(xmm1_a, xmm0_a); pxor128(xmm1_b, xmm0_b); + movdqa128(xmm10, xmm3_a); movdqa128(xmm10, xmm3_b); + pshufb128(xmm1_a, xmm3_a); pshufb128(xmm1_b, xmm3_b); + pxor128(xmm2_a, xmm3_a); pxor128(xmm2_b, xmm3_b); + movdqa128(xmm10, xmm4_a); movdqa128(xmm10, xmm4_b); + pshufb128(xmm0_a, xmm4_a); pshufb128(xmm0_b, xmm4_b); + pxor128(xmm2_a, xmm4_a); pxor128(xmm2_b, xmm4_b); + movdqa128(xmm10, xmm2_a); movdqa128(xmm10, xmm2_b); + pshufb128(xmm3_a, xmm2_a); pshufb128(xmm3_b, xmm2_b); + pxor128(xmm0_a, xmm2_a); pxor128(xmm0_b, xmm2_b); + movdqa128(xmm10, xmm3_a); movdqa128(xmm10, xmm3_b); + pshufb128(xmm4_a, xmm3_a); pshufb128(xmm4_b, xmm3_b); + pxor128(xmm1_a, xmm3_a); pxor128(xmm1_b, xmm3_b); + + if (config.sched_keys == end_keys) + break; + + /* Inverse mix columns */ + movdqa128(xmm13, xmm4_a); movdqa128(xmm13, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + movdqu128_memld(config.sched_keys, xmm6); + pxor128(xmm6, xmm4_a); pxor128(xmm6, xmm4_b); + movdqa128(xmm12, xmm0_a); movdqa128(xmm12, xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + movdqa128(k_dsbd[1], xmm1_a); movdqa128(k_dsbd[1], xmm1_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + + pshufb128(xmm5, xmm0_a); pshufb128(xmm5, xmm0_b); + movdqa128(xmm15, xmm4_a); movdqa128(xmm15, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + pxor128(xmm0_a, xmm4_a); pxor128(xmm0_b, xmm4_b); + pshufb128(xmm3_a, xmm1_a); pshufb128(xmm3_b, xmm1_b); + pxor128(xmm4_a, xmm1_a); pxor128(xmm4_b, xmm1_b); + + pshufb128(xmm5, xmm1_a); pshufb128(xmm5, xmm1_b); + movdqa128(xmm14, xmm4_a); movdqa128(xmm14, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + pxor128(xmm1_a, xmm4_a); pxor128(xmm1_b, xmm4_b); + movdqa128(k_dsbb[1], xmm0_a); movdqa128(k_dsbb[1], xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + + pshufb128(xmm5, xmm0_a); pshufb128(xmm5, xmm0_b); + movdqa128(xmm8, xmm4_a); movdqa128(xmm8, xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + pshufd128_0x93(xmm5, xmm5); + pxor128(xmm0_a, xmm4_a); pxor128(xmm0_b, xmm4_b); + movdqa128(k_dsbe[1], xmm0_a); movdqa128(k_dsbe[1], xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + + config.sched_keys += 16; + } + + /* middle of last round */ + movdqa128(k_dsbo[0], xmm4_a); movdqa128(k_dsbo[0], xmm4_b); + pshufb128(xmm2_a, xmm4_a); pshufb128(xmm2_b, xmm4_b); + movdqu128_memld(config.sched_keys, xmm6); + pxor128(xmm6, xmm4_a); pxor128(xmm6, xmm4_b); + movdqa128(k_dsbo[1], xmm0_a); movdqa128(k_dsbo[1], xmm0_b); + pshufb128(xmm3_a, xmm0_a); pshufb128(xmm3_b, xmm0_b); + pxor128(xmm4_a, xmm0_a); pxor128(xmm4_b, xmm0_b); + pshufb128(k_sr[mc_pos], xmm0_a); + pshufb128(k_sr[mc_pos], xmm0_b); + + *pxmm0_a = xmm0_a; + *pxmm0_b = xmm0_b; +} + +ASM_FUNC_ATTR_NOINLINE unsigned int +FUNC_ENCRYPT (const RIJNDAEL_context *ctx, unsigned char *dst, + const unsigned char *src) +{ + __m128i xmm0, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(src, xmm0); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memst(xmm0, dst); + + clear_vec_regs(); + + return 0; +} + +ASM_FUNC_ATTR_NOINLINE unsigned int +FUNC_DECRYPT (const RIJNDAEL_context *ctx, unsigned char *dst, + const unsigned char *src) +{ + __m128i xmm0, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschdec[0][0]; + + dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + movdqu128_memld(src, xmm0); + + xmm0 = aes_decrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + movdqu128_memst(xmm0, dst); + + clear_vec_regs(); + + return 0; +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CFB_ENC (RIJNDAEL_context *ctx, unsigned char *iv, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(iv, xmm0); + + for (; nblocks; nblocks--) + { + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(inbuf, xmm1); + pxor128(xmm1, xmm0); + movdqu128_memst(xmm0, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm0, iv); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CBC_ENC (RIJNDAEL_context *ctx, unsigned char *iv, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks, int cbc_mac) +{ + __m128i xmm0, xmm7, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + struct vp_aes_config_s config; + size_t outbuf_add = (!cbc_mac) * BLOCKSIZE; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(iv, xmm7); + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + pxor128(xmm7, xmm0); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqa128(xmm0, xmm7); + movdqu128_memst(xmm0, outbuf); + + inbuf += BLOCKSIZE; + outbuf += outbuf_add; + } + + movdqu128_memst(xmm7, iv); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CTR_ENC (RIJNDAEL_context *ctx, unsigned char *ctr, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm6, xmm7, xmm8; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + static const __m128i be_mask = + M128I_BYTE(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); + static const __m128i bigendian_add = + M128I_BYTE(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1); + static const __m128i carry_add = M128I_U64(1, 1); + static const __m128i nocarry_add = M128I_U64(1, 0); + u64 ctrlow = buf_get_be64(ctr + 8); + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqa128(bigendian_add, xmm8); /* Preload byte add */ + movdqu128_memld(ctr, xmm7); /* Preload CTR */ + movdqa128(be_mask, xmm6); /* Preload mask */ + + for (; nblocks >= 2; nblocks -= 2) + { + movdqa128(xmm7, xmm0); + + /* detect if 8-bit carry handling is needed */ + if (UNLIKELY(((ctrlow += 2) & 0xff) <= 1)) + { + pshufb128(xmm6, xmm7); + + /* detect if 64-bit carry handling is needed */ + if (UNLIKELY(ctrlow == 1)) + { + paddq128(carry_add, xmm7); + movdqa128(xmm7, xmm1); + pshufb128(xmm6, xmm1); + paddq128(nocarry_add, xmm7); + } + else if (UNLIKELY(ctrlow == 0)) + { + paddq128(nocarry_add, xmm7); + movdqa128(xmm7, xmm1); + pshufb128(xmm6, xmm1); + paddq128(carry_add, xmm7); + } + else + { + paddq128(nocarry_add, xmm7); + movdqa128(xmm7, xmm1); + pshufb128(xmm6, xmm1); + paddq128(nocarry_add, xmm7); + } + + pshufb128(xmm6, xmm7); + } + else + { + paddb128(xmm8, xmm7); + movdqa128(xmm7, xmm1); + paddb128(xmm8, xmm7); + } + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(inbuf, xmm2); + movdqu128_memld(inbuf + BLOCKSIZE, xmm3); + pxor128(xmm2, xmm0); + pxor128(xmm3, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqa128(xmm7, xmm0); + + /* detect if 8-bit carry handling is needed */ + if (UNLIKELY((++ctrlow & 0xff) == 0)) + { + pshufb128(xmm6, xmm7); + + /* detect if 64-bit carry handling is needed */ + paddq128(UNLIKELY(ctrlow == 0) ? carry_add : nocarry_add, xmm7); + + pshufb128(xmm6, xmm7); + } + else + { + paddb128(xmm8, xmm7); + } + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(inbuf, xmm1); + pxor128(xmm1, xmm0); + movdqu128_memst(xmm0, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm7, ctr); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CTR32LE_ENC (RIJNDAEL_context *ctx, unsigned char *ctr, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm7, xmm8; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + static const __m128i add_one = M128I_U64(1, 0); + static const __m128i add_two = M128I_U64(2, 0); + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqa128(add_one, xmm8); /* Preload byte add */ + movdqu128_memld(ctr, xmm7); /* Preload CTR */ + + for (; nblocks >= 2; nblocks -= 2) + { + movdqa128(xmm7, xmm0); + movdqa128(xmm7, xmm1); + paddd128(xmm8, xmm1); + paddd128(add_two, xmm7); + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(inbuf, xmm2); + movdqu128_memld(inbuf + BLOCKSIZE, xmm3); + pxor128(xmm2, xmm0); + pxor128(xmm3, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqa128(xmm7, xmm0); + paddd128(xmm8, xmm7); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(inbuf, xmm1); + pxor128(xmm1, xmm0); + movdqu128_memst(xmm0, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm7, ctr); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CFB_DEC (RIJNDAEL_context *ctx, unsigned char *iv, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm6, xmm9; + __m128i xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(iv, xmm0); + + for (; nblocks >= 2; nblocks -= 2) + { + movdqa128(xmm0, xmm1); + movdqu128_memld(inbuf, xmm2); + movdqu128_memld(inbuf + BLOCKSIZE, xmm0); + movdqa128(xmm2, xmm6); + + aes_encrypt_core_2blks(&xmm1, &xmm2, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm6, xmm1); + pxor128(xmm0, xmm2); + movdqu128_memst(xmm1, outbuf); + movdqu128_memst(xmm2, outbuf + BLOCKSIZE); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqa128(xmm0, xmm6); + movdqu128_memld(inbuf, xmm0); + pxor128(xmm0, xmm6); + movdqu128_memst(xmm6, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm0, iv); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_CBC_DEC (RIJNDAEL_context *ctx, unsigned char *iv, + unsigned char *outbuf, const unsigned char *inbuf, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm5, xmm6, xmm7; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8; + struct vp_aes_config_s config; + + if (!ctx->decryption_prepared) + { + FUNC_PREPARE_DEC (ctx); + ctx->decryption_prepared = 1; + } + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschdec[0][0]; + + dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + movdqu128_memld(iv, xmm7); + + for (; nblocks >= 2; nblocks -= 2) + { + movdqu128_memld(inbuf, xmm0); + movdqu128_memld(inbuf + BLOCKSIZE, xmm1); + movdqa128(xmm0, xmm5); + movdqa128(xmm1, xmm6); + + aes_decrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + pxor128(xmm7, xmm0); + pxor128(xmm5, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + movdqa128(xmm6, xmm7); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + movdqa128(xmm0, xmm6); + + xmm0 = aes_decrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, + xmm8); + + pxor128(xmm7, xmm0); + movdqu128_memst(xmm0, outbuf); + movdqa128(xmm6, xmm7); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm7, iv); + + clear_vec_regs(); +} + +static ASM_FUNC_ATTR_NOINLINE size_t +aes_simd128_ocb_enc (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm6, xmm7; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = (void *)&c->context.c; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + u64 n = c->u_mode.ocb.data_nblocks; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + /* Preload Offset and Checksum */ + movdqu128_memld(c->u_iv.iv, xmm7); + movdqu128_memld(c->u_ctr.ctr, xmm6); + + for (; nblocks >= 2; nblocks -= 2) + { + const unsigned char *l; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + movdqu128_memld(inbuf, xmm0); + movdqu128_memld(inbuf + BLOCKSIZE, xmm1); + movdqa128(xmm7, xmm3); + pxor128(xmm2, xmm3); + pxor128(xmm0, xmm6); + pxor128(xmm3, xmm0); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + movdqa128(xmm3, xmm7); + pxor128(xmm2, xmm7); + pxor128(xmm1, xmm6); + pxor128(xmm7, xmm1); + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm3, xmm0); + pxor128(xmm7, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + inbuf += 2 * BLOCKSIZE; + outbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + const unsigned char *l; + + l = ocb_get_l(c, ++n); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + movdqu128_memld(l, xmm1); + movdqu128_memld(inbuf, xmm0); + pxor128(xmm1, xmm7); + pxor128(xmm0, xmm6); + pxor128(xmm7, xmm0); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm7, xmm0); + movdqu128_memst(xmm0, outbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + c->u_mode.ocb.data_nblocks = n; + movdqu128_memst(xmm7, c->u_iv.iv); + movdqu128_memst(xmm6, c->u_ctr.ctr); + + clear_vec_regs(); + + return 0; +} + +static ASM_FUNC_ATTR_NOINLINE size_t +aes_simd128_ocb_dec (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm6, xmm7; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8; + RIJNDAEL_context *ctx = (void *)&c->context.c; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + u64 n = c->u_mode.ocb.data_nblocks; + struct vp_aes_config_s config; + + if (!ctx->decryption_prepared) + { + FUNC_PREPARE_DEC (ctx); + ctx->decryption_prepared = 1; + } + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschdec[0][0]; + + dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + /* Preload Offset and Checksum */ + movdqu128_memld(c->u_iv.iv, xmm7); + movdqu128_memld(c->u_ctr.ctr, xmm6); + + for (; nblocks >= 2; nblocks -= 2) + { + const unsigned char *l; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + movdqu128_memld(inbuf, xmm0); + movdqu128_memld(inbuf + BLOCKSIZE, xmm1); + movdqa128(xmm7, xmm3); + pxor128(xmm2, xmm3); + pxor128(xmm3, xmm0); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + movdqa128(xmm3, xmm7); + pxor128(xmm2, xmm7); + pxor128(xmm7, xmm1); + + aes_decrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + pxor128(xmm3, xmm0); + pxor128(xmm7, xmm1); + pxor128(xmm0, xmm6); + pxor128(xmm1, xmm6); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + inbuf += 2 * BLOCKSIZE; + outbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + const unsigned char *l; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm1); + movdqu128_memld(inbuf, xmm0); + pxor128(xmm1, xmm7); + pxor128(xmm7, xmm0); + + xmm0 = aes_decrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, + xmm8); + + pxor128(xmm7, xmm0); + pxor128(xmm0, xmm6); + movdqu128_memst(xmm0, outbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + c->u_mode.ocb.data_nblocks = n; + movdqu128_memst(xmm7, c->u_iv.iv); + movdqu128_memst(xmm6, c->u_ctr.ctr); + + clear_vec_regs(); + + return 0; +} + +ASM_FUNC_ATTR_NOINLINE size_t +FUNC_OCB_CRYPT(gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, int encrypt) +{ + if (encrypt) + return aes_simd128_ocb_enc(c, outbuf_arg, inbuf_arg, nblocks); + else + return aes_simd128_ocb_dec(c, outbuf_arg, inbuf_arg, nblocks); +} + +ASM_FUNC_ATTR_NOINLINE size_t +FUNC_OCB_AUTH(gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm6, xmm7; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = (void *)&c->context.c; + const unsigned char *abuf = abuf_arg; + u64 n = c->u_mode.ocb.aad_nblocks; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + /* Preload Offset and Sum */ + movdqu128_memld(c->u_mode.ocb.aad_offset, xmm7); + movdqu128_memld(c->u_mode.ocb.aad_sum, xmm6); + + for (; nblocks >= 2; nblocks -= 2) + { + const unsigned char *l; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + movdqu128_memld(abuf, xmm0); + movdqu128_memld(abuf + BLOCKSIZE, xmm1); + pxor128(xmm2, xmm7); + pxor128(xmm7, xmm0); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm2); + pxor128(xmm2, xmm7); + pxor128(xmm7, xmm1); + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm0, xmm6); + pxor128(xmm1, xmm6); + + abuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + const unsigned char *l; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + l = ocb_get_l(c, ++n); + movdqu128_memld(l, xmm1); + movdqu128_memld(abuf, xmm0); + pxor128(xmm1, xmm7); + pxor128(xmm7, xmm0); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm0, xmm6); + + abuf += BLOCKSIZE; + } + + c->u_mode.ocb.aad_nblocks = n; + movdqu128_memst(xmm7, c->u_mode.ocb.aad_offset); + movdqu128_memst(xmm6, c->u_mode.ocb.aad_sum); + + clear_vec_regs(); + + return 0; +} + +ASM_FUNC_ATTR_NOINLINE void +aes_simd128_ecb_enc (void *context, void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + for (; nblocks >= 2; nblocks -= 2) + { + movdqu128_memld(inbuf + 0 * BLOCKSIZE, xmm0); + movdqu128_memld(inbuf + 1 * BLOCKSIZE, xmm1); + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15); + + movdqu128_memst(xmm0, outbuf + 0 * BLOCKSIZE); + movdqu128_memst(xmm1, outbuf + 1 * BLOCKSIZE); + + inbuf += 2 * BLOCKSIZE; + outbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15); + + movdqu128_memst(xmm0, outbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +aes_simd128_ecb_dec (void *context, void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + __m128i xmm0, xmm1, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + struct vp_aes_config_s config; + + if (!ctx->decryption_prepared) + { + FUNC_PREPARE_DEC (ctx); + ctx->decryption_prepared = 1; + } + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschdec[0][0]; + + dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + for (; nblocks >= 2; nblocks -= 2) + { + movdqu128_memld(inbuf + 0 * BLOCKSIZE, xmm0); + movdqu128_memld(inbuf + 1 * BLOCKSIZE, xmm1); + + aes_decrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + movdqu128_memst(xmm0, outbuf + 0 * BLOCKSIZE); + movdqu128_memst(xmm1, outbuf + 1 * BLOCKSIZE); + + inbuf += 2 * BLOCKSIZE; + outbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + + xmm0 = aes_decrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + movdqu128_memst(xmm0, outbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_ECB_CRYPT (void *context, void *outbuf_arg, const void *inbuf_arg, + size_t nblocks, int encrypt) +{ + if (encrypt) + aes_simd128_ecb_enc(context, outbuf_arg, inbuf_arg, nblocks); + else + aes_simd128_ecb_dec(context, outbuf_arg, inbuf_arg, nblocks); +} + +static ASM_FUNC_ATTR_INLINE __m128i xts_gfmul_byA (__m128i xmm5) +{ + static const __m128i xts_gfmul_const = M128I_U64(0x87, 0x01); + __m128i xmm1; + + pshufd128_0x4E(xmm5, xmm1); + psraq128(63, xmm1); + paddq128(xmm5, xmm5); + pand128(xts_gfmul_const, xmm1); + pxor128(xmm1, xmm5); + + return xmm5; +} + +ASM_FUNC_ATTR_NOINLINE void +aes_simd128_xts_enc (void *context, unsigned char *tweak, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm7; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + struct vp_aes_config_s config; + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschenc[0][0]; + + enc_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + movdqu128_memld(tweak, xmm7); /* Preload tweak */ + + for (; nblocks >= 2; nblocks -= 2) + { + movdqu128_memld(inbuf, xmm0); + movdqu128_memld(inbuf + BLOCKSIZE, xmm1); + pxor128(xmm7, xmm0); + movdqa128(xmm7, xmm2); + xmm3 = xts_gfmul_byA(xmm7); + pxor128(xmm3, xmm1); + xmm7 = xts_gfmul_byA(xmm3); + + aes_encrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15); + + pxor128(xmm2, xmm0); + pxor128(xmm3, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + pxor128(xmm7, xmm0); + movdqa128(xmm7, xmm2); + xmm7 = xts_gfmul_byA(xmm7); + + xmm0 = aes_encrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15); + + pxor128(xmm2, xmm0); + movdqu128_memst(xmm0, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm7, tweak); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +aes_simd128_xts_dec (void *context, unsigned char *tweak, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm7, xmm8; + __m128i xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15; + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + struct vp_aes_config_s config; + + if (!ctx->decryption_prepared) + { + FUNC_PREPARE_DEC (ctx); + ctx->decryption_prepared = 1; + } + + config.nround = ctx->rounds; + config.sched_keys = ctx->keyschdec[0][0]; + + dec_preload(xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm8); + + movdqu128_memld(tweak, xmm7); /* Preload tweak */ + + for (; nblocks >= 2; nblocks -= 2) + { + movdqu128_memld(inbuf, xmm0); + movdqu128_memld(inbuf + BLOCKSIZE, xmm1); + pxor128(xmm7, xmm0); + movdqa128(xmm7, xmm2); + xmm3 = xts_gfmul_byA(xmm7); + pxor128(xmm3, xmm1); + xmm7 = xts_gfmul_byA(xmm3); + + aes_decrypt_core_2blks(&xmm0, &xmm1, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + pxor128(xmm2, xmm0); + pxor128(xmm3, xmm1); + movdqu128_memst(xmm0, outbuf); + movdqu128_memst(xmm1, outbuf + BLOCKSIZE); + + outbuf += 2 * BLOCKSIZE; + inbuf += 2 * BLOCKSIZE; + } + + for (; nblocks; nblocks--) + { + movdqu128_memld(inbuf, xmm0); + pxor128(xmm7, xmm0); + movdqa128(xmm7, xmm2); + xmm7 = xts_gfmul_byA(xmm7); + + xmm0 = aes_decrypt_core(xmm0, config, + xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, + xmm15, xmm8); + + pxor128(xmm2, xmm0); + movdqu128_memst(xmm0, outbuf); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + movdqu128_memst(xmm7, tweak); + + clear_vec_regs(); +} + +ASM_FUNC_ATTR_NOINLINE void +FUNC_XTS_CRYPT (void *context, unsigned char *tweak, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, int encrypt) +{ + if (encrypt) + aes_simd128_xts_enc(context, tweak, outbuf_arg, inbuf_arg, nblocks); + else + aes_simd128_xts_dec(context, tweak, outbuf_arg, inbuf_arg, nblocks); +} diff --git a/cipher/rijndael.c b/cipher/rijndael.c index f1683007..12c27319 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -170,6 +170,60 @@ extern size_t _gcry_aes_ssse3_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg size_t nblocks); #endif +#ifdef USE_VP_AARCH64 +/* AArch64 vector permutation implementation of AES */ +extern void _gcry_aes_vp_aarch64_do_setkey(RIJNDAEL_context *ctx, + const byte *key); +extern void _gcry_aes_vp_aarch64_prepare_decryption(RIJNDAEL_context *ctx); + +extern unsigned int _gcry_aes_vp_aarch64_encrypt (const RIJNDAEL_context *ctx, + unsigned char *dst, + const unsigned char *src); +extern unsigned int _gcry_aes_vp_aarch64_decrypt (const RIJNDAEL_context *ctx, + unsigned char *dst, + const unsigned char *src); +extern void _gcry_aes_vp_aarch64_cfb_enc (void *context, unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_vp_aarch64_cbc_enc (void *context, unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + int cbc_mac); +extern void _gcry_aes_vp_aarch64_ctr_enc (void *context, unsigned char *ctr, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_vp_aarch64_ctr32le_enc (void *context, unsigned char *ctr, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_vp_aarch64_cfb_dec (void *context, unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks); +extern void _gcry_aes_vp_aarch64_cbc_dec (void *context, unsigned char *iv, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks); +extern size_t _gcry_aes_vp_aarch64_ocb_crypt (gcry_cipher_hd_t c, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, + int encrypt); +extern size_t _gcry_aes_vp_aarch64_ocb_auth (gcry_cipher_hd_t c, + const void *abuf_arg, + size_t nblocks); +extern void _gcry_aes_vp_aarch64_ecb_crypt (void *context, void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, int encrypt); +extern void _gcry_aes_vp_aarch64_xts_crypt (void *context, unsigned char *tweak, + void *outbuf_arg, + const void *inbuf_arg, + size_t nblocks, int encrypt); +#endif + #ifdef USE_PADLOCK extern unsigned int _gcry_aes_padlock_encrypt (const RIJNDAEL_context *ctx, unsigned char *bx, @@ -641,6 +695,29 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen, bulk_ops->ecb_crypt = _gcry_aes_armv8_ce_ecb_crypt; } #endif +#ifdef USE_VP_AARCH64 + else if (hwfeatures & HWF_ARM_NEON) + { + hw_setkey = _gcry_aes_vp_aarch64_do_setkey; + ctx->encrypt_fn = _gcry_aes_vp_aarch64_encrypt; + ctx->decrypt_fn = _gcry_aes_vp_aarch64_decrypt; + ctx->prefetch_enc_fn = NULL; + ctx->prefetch_dec_fn = NULL; + ctx->prepare_decryption = _gcry_aes_vp_aarch64_prepare_decryption; + + /* Setup vector permute AArch64 bulk encryption routines. */ + bulk_ops->cfb_enc = _gcry_aes_vp_aarch64_cfb_enc; + bulk_ops->cfb_dec = _gcry_aes_vp_aarch64_cfb_dec; + bulk_ops->cbc_enc = _gcry_aes_vp_aarch64_cbc_enc; + bulk_ops->cbc_dec = _gcry_aes_vp_aarch64_cbc_dec; + bulk_ops->ctr_enc = _gcry_aes_vp_aarch64_ctr_enc; + bulk_ops->ctr32le_enc = _gcry_aes_vp_aarch64_ctr32le_enc; + bulk_ops->ocb_crypt = _gcry_aes_vp_aarch64_ocb_crypt; + bulk_ops->ocb_auth = _gcry_aes_vp_aarch64_ocb_auth; + bulk_ops->ecb_crypt = _gcry_aes_vp_aarch64_ecb_crypt; + bulk_ops->xts_crypt = _gcry_aes_vp_aarch64_xts_crypt; + } +#endif #ifdef USE_PPC_CRYPTO_WITH_PPC9LE else if ((hwfeatures & HWF_PPC_VCRYPTO) && (hwfeatures & HWF_PPC_ARCH_3_00)) { diff --git a/cipher/simd-common-aarch64.h b/cipher/simd-common-aarch64.h new file mode 100644 index 00000000..72e1b099 --- /dev/null +++ b/cipher/simd-common-aarch64.h @@ -0,0 +1,62 @@ +/* simd-common-aarch64.h - Common macros for AArch64 SIMD code + * + * Copyright (C) 2024 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#ifndef GCRY_SIMD_COMMON_AARCH64_H +#define GCRY_SIMD_COMMON_AARCH64_H + +#include + +#define memory_barrier_with_vec(a) __asm__("" : "+w"(a) :: "memory") + +#define clear_vec_regs() __asm__ volatile("movi v0.16b, #0;\n" \ + "movi v1.16b, #0;\n" \ + "movi v2.16b, #0;\n" \ + "movi v3.16b, #0;\n" \ + "movi v4.16b, #0;\n" \ + "movi v5.16b, #0;\n" \ + "movi v6.16b, #0;\n" \ + "movi v7.16b, #0;\n" \ + /* v8-v15 are ABI callee saved and \ + * get cleared by function \ + * epilog when used. */ \ + "movi v16.16b, #0;\n" \ + "movi v17.16b, #0;\n" \ + "movi v18.16b, #0;\n" \ + "movi v19.16b, #0;\n" \ + "movi v20.16b, #0;\n" \ + "movi v21.16b, #0;\n" \ + "movi v22.16b, #0;\n" \ + "movi v23.16b, #0;\n" \ + "movi v24.16b, #0;\n" \ + "movi v25.16b, #0;\n" \ + "movi v26.16b, #0;\n" \ + "movi v27.16b, #0;\n" \ + "movi v28.16b, #0;\n" \ + "movi v29.16b, #0;\n" \ + "movi v30.16b, #0;\n" \ + "movi v31.16b, #0;\n" \ + ::: "memory", "v0", "v1", "v2", \ + "v3", "v4", "v5", "v6", "v7", \ + "v16", "v17", "v18", "v19", \ + "v20", "v21", "v22", "v23", \ + "v24", "v25", "v26", "v27", \ + "v28", "v29", "v30", "v31") + +#endif /* GCRY_SIMD_COMMON_AARCH64_H */ diff --git a/configure.ac b/configure.ac index 1a5dd20a..6347ea25 100644 --- a/configure.ac +++ b/configure.ac @@ -3054,6 +3054,9 @@ if test "$found" = "1" ; then # Build with the assembly implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-aarch64.lo" + # Build with the vector permute SIMD128 implementation + GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-vp-aarch64.lo" + # Build with the ARMv8/AArch64 CE implementation GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-armv8-ce.lo" GCRYPT_ASM_CIPHERS="$GCRYPT_ASM_CIPHERS rijndael-armv8-aarch64-ce.lo" -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:45 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:45 +0200 Subject: [PATCH 01/11] sm4-aarch64-sve: add missing .text section Message-ID: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> * cipher/sm4-armv9-aarch64-sve-ce.S: Add missing '.text'. -- Signed-off-by: Jussi Kivilinna --- cipher/sm4-armv9-aarch64-sve-ce.S | 1 + 1 file changed, 1 insertion(+) diff --git a/cipher/sm4-armv9-aarch64-sve-ce.S b/cipher/sm4-armv9-aarch64-sve-ce.S index f01a41bf..cdf20719 100644 --- a/cipher/sm4-armv9-aarch64-sve-ce.S +++ b/cipher/sm4-armv9-aarch64-sve-ce.S @@ -350,6 +350,7 @@ ELF(.size _gcry_sm4_armv9_svesm4_consts,.-_gcry_sm4_armv9_svesm4_consts) ext b0.16b, b0.16b, b0.16b, #8; \ rev32 b0.16b, b0.16b; +.text .align 4 .global _gcry_sm4_armv9_sve_ce_crypt -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:48 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:48 +0200 Subject: [PATCH 04/11] sm4-aarch64-ce: clear volatile vector registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-4-jussi.kivilinna@iki.fi> * cipher/sm4-armv8-aarch64-ce.S (_gcry_sm4_armv8_ce_expand_key) (_gcry_sm4_armv8_ce_crypt_blk1_8, _gcry_sm4_armv8_ce_crypt) (_gcry_sm4_armv8_ce_cbc_dec, _gcry_sm4_armv8_ce_cfb_dec) (_gcry_sm4_armv8_ce_ctr_enc, _gcry_sm4_armv8_ce_xts_crypt): Add CLEAR_ALL_REGS. -- Signed-off-by: Jussi Kivilinna --- cipher/sm4-armv8-aarch64-ce.S | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/cipher/sm4-armv8-aarch64-ce.S b/cipher/sm4-armv8-aarch64-ce.S index eea56cdf..01f3df92 100644 --- a/cipher/sm4-armv8-aarch64-ce.S +++ b/cipher/sm4-armv8-aarch64-ce.S @@ -290,6 +290,7 @@ _gcry_sm4_armv8_ce_expand_key: st1 {v1.16b}, [x2], #16; st1 {v0.16b}, [x2]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_expand_key,.-_gcry_sm4_armv8_ce_expand_key;) @@ -383,6 +384,7 @@ _gcry_sm4_armv8_ce_crypt_blk1_8: st1 {v7.16b}, [x1]; .Lblk8_store_output_done: + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_crypt_blk1_8,.-_gcry_sm4_armv8_ce_crypt_blk1_8;) @@ -416,6 +418,7 @@ _gcry_sm4_armv8_ce_crypt: b .Lcrypt_loop_blk; .Lcrypt_end: + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_crypt,.-_gcry_sm4_armv8_ce_crypt;) @@ -468,6 +471,7 @@ _gcry_sm4_armv8_ce_cbc_dec: /* store new IV */ st1 {RIV.16b}, [x3]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_cbc_dec,.-_gcry_sm4_armv8_ce_cbc_dec;) @@ -520,6 +524,7 @@ _gcry_sm4_armv8_ce_cfb_dec: /* store new IV */ st1 {v0.16b}, [x3]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_cfb_dec,.-_gcry_sm4_armv8_ce_cfb_dec;) @@ -588,6 +593,7 @@ _gcry_sm4_armv8_ce_ctr_enc: rev x8, x8; stp x7, x8, [x3]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv8_ce_ctr_enc,.-_gcry_sm4_armv8_ce_ctr_enc;) @@ -713,17 +719,8 @@ _gcry_sm4_armv8_ce_xts_crypt: /* store new tweak */ st1 {v8.16b}, [x3] - CLEAR_REG(v8) - CLEAR_REG(v9) - CLEAR_REG(v10) - CLEAR_REG(v11) - CLEAR_REG(v12) - CLEAR_REG(v13) - CLEAR_REG(v14) - CLEAR_REG(v15) - CLEAR_REG(RIV) - VPOP_ABI + CLEAR_ALL_REGS(); ret_spec_stop CFI_ENDPROC() ELF(.size _gcry_sm4_armv8_ce_xts_crypt,.-_gcry_sm4_armv8_ce_xts_crypt;) -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:50 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:50 +0200 Subject: [PATCH 06/11] gcm-aarch64-ce: clear volatile vector registers at setup function In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-6-jussi.kivilinna@iki.fi> * cipher/cipher-gcm-armv8-aarch64-ce.S (_gcry_ghash_setup_armv8_ce_pmull): Clear used vectors registers before function exit. -- Signed-off-by: Jussi Kivilinna --- cipher/cipher-gcm-armv8-aarch64-ce.S | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/cipher/cipher-gcm-armv8-aarch64-ce.S b/cipher/cipher-gcm-armv8-aarch64-ce.S index 0c31a563..4cb63212 100644 --- a/cipher/cipher-gcm-armv8-aarch64-ce.S +++ b/cipher/cipher-gcm-armv8-aarch64-ce.S @@ -610,6 +610,7 @@ _gcry_ghash_setup_armv8_ce_pmull: /* H? */ PMUL_128x128(rr0, rr1, rh2, rh1, t0, t1, __) REDUCTION(rh3, rr0, rr1, rrconst, t0, t1, __, __, __) + CLEAR_REG(rh1) /* H? */ PMUL_128x128(rr0, rr1, rh2, rh2, t0, t1, __) @@ -622,9 +623,18 @@ _gcry_ghash_setup_armv8_ce_pmull: /* H? */ PMUL_128x128(rr0, rr1, rh3, rh3, t0, t1, __) REDUCTION(rh6, rr0, rr1, rrconst, t0, t1, __, __, __) + CLEAR_REG(rr0) + CLEAR_REG(rr1) + CLEAR_REG(t0) + CLEAR_REG(t1) st1 {rh2.16b-rh4.16b}, [x1], #(3*16) + CLEAR_REG(rh2) + CLEAR_REG(rh3) + CLEAR_REG(rh4) st1 {rh5.16b-rh6.16b}, [x1] + CLEAR_REG(rh5) + CLEAR_REG(rh6) ret_spec_stop CFI_ENDPROC() -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:51 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:51 +0200 Subject: [PATCH 07/11] camellia-aarch64-ce: clear volatile vectors registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-7-jussi.kivilinna@iki.fi> * cipher/camellia-simd128.h [__powerpc__] (clear_vec_regs): New. [__ARM_NEON]: Include 'simd-common-aarch64.h'. [__ARM_NEON] (memory_barrier_with_vec): Remove. [__x86_64__] (clear_vec_regs): New. (FUNC_ENC_BLK16, FUNC_DEC_BLK16, FUNC_KEY_SETUP): Add clear_vec_regs. -- Signed-off-by: Jussi Kivilinna --- cipher/camellia-simd128.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/cipher/camellia-simd128.h b/cipher/camellia-simd128.h index ed26afb7..120fbe5a 100644 --- a/cipher/camellia-simd128.h +++ b/cipher/camellia-simd128.h @@ -152,6 +152,7 @@ static const uint8x16_t shift_row = #define if_not_aes_subbytes(...) /*_*/ #define memory_barrier_with_vec(a) __asm__("" : "+wa"(a) :: "memory") +#define clear_vec_regs() ((void)0) #endif /* __powerpc__ */ @@ -160,6 +161,7 @@ static const uint8x16_t shift_row = /********************************************************************** AT&T x86 asm to intrinsics conversion macros (ARMv8-CE) **********************************************************************/ +#include "simd-common-aarch64.h" #include #define __m128i uint64x2_t @@ -232,8 +234,6 @@ static const uint8x16_t shift_row = #define if_aes_subbytes(...) /*_*/ #define if_not_aes_subbytes(...) __VA_ARGS__ -#define memory_barrier_with_vec(a) __asm__("" : "+w"(a) :: "memory") - #endif /* __ARM_NEON */ #if defined(__x86_64__) || defined(__i386__) @@ -307,6 +307,7 @@ static const uint8x16_t shift_row = #define if_not_aes_subbytes(...) __VA_ARGS__ #define memory_barrier_with_vec(a) __asm__("" : "+x"(a) :: "memory") +#define clear_vec_regs() ((void)0) #endif /* defined(__x86_64__) || defined(__i386__) */ @@ -1123,6 +1124,8 @@ FUNC_ENC_BLK16(const void *key_table, void *vout, const void *vin, write_output(x7, x6, x5, x4, x3, x2, x1, x0, x15, x14, x13, x12, x11, x10, x9, x8, out); + + clear_vec_regs(); } /* Decrypts 16 input block from IN and writes result to OUT. IN and OUT may @@ -1193,6 +1196,8 @@ FUNC_DEC_BLK16(const void *key_table, void *vout, const void *vin, write_output(x7, x6, x5, x4, x3, x2, x1, x0, x15, x14, x13, x12, x11, x10, x9, x8, out); + + clear_vec_regs(); } /********* Key setup **********************************************************/ @@ -2232,4 +2237,6 @@ FUNC_KEY_SETUP(void *key_table, const void *vkey, unsigned int keylen) } camellia_setup256(key_table, x0, x1); + + clear_vec_regs(); } -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:55 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:55 +0200 Subject: [PATCH 11/11] Add vector register clearing for PowerPC implementations In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-11-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'simd-common-ppc.h'. * cipher/camellia-simd128.h [HAVE_GCC_INLINE_ASM_PPC_ALTIVEC]: Include "simd-common-ppc.h". [HAVE_GCC_INLINE_ASM_PPC_ALTIVEC] (memory_barrier_with_vec) (clear_vec_regs): Remove. * cipher/chacha20-p10le-8x.s (clear_vec_regs): New. (_gcry_chacha20_p10le_8x): Add clear_vec_regs. * cipher/chacha20-ppc.c: Include "simd-common-ppc.h". (chacha20_ppc_blocks1, chacha20_ppc_blocks4) (chacha20_poly1305_ppc_blocks4): Add clear_vec_regs. * cipher/cipher-gcm-ppc.c: Include "simd-common-ppc.h". (_gcry_ghash_setup_ppc_vpmsum, _gcry_ghash_ppc_vpmsum): Add clear_vec_regs. * cipher/poly1305-p10le.s (clear_vec_regs): New. (gcry_poly1305_p10le_4blocks): Add clear_vec_regs. * cipher/rijndael-p10le.c: Include "simd-common-ppc.h". (_gcry_aes_p10le_gcm_crypt): Add clear_vec_regs. * cipher/rijndael-ppc-common.h: Include "simd-common-ppc.h". * cipher/rijndael-ppc-functions.h (ENCRYPT_BLOCK_FUNC): (DECRYPT_BLOCK_FUNC, CFB_ENC_FUNC, ECB_CRYPT_FUNC, CFB_DEC_FUNC) (CBC_ENC_FUNC, CBC_DEC_FUNC, CTR_ENC_FUNC, OCB_CRYPT_FUNC) (OCB_AUTH_FUNC, XTS_CRYPT_FUNC, CTR32LE_ENC_FUNC): Add clear_vec_regs. * cipher/rijndael-ppc.c (_gcry_aes_ppc8_setkey) (_gcry_aes_ppc8_prepare_decryption): Add clear_vec_regs. * cipher/sha256-ppc.c: Include "simd-common-ppc.h". (sha256_transform_ppc): Add clear_vec_regs. * cipher/sha512-ppc.c: Include "simd-common-ppc.h". (sha512_transform_ppc): Add clear_vec_regs. * cipher/simd-common-ppc.h: New. * cipher/sm4-ppc.c: Include "simd-common-ppc.h". (sm4_ppc_crypt_blk1_16): Add clear_vec_regs. -- Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 2 +- cipher/camellia-simd128.h | 4 +- cipher/chacha20-p10le-8x.s | 41 ++++++++++++++++++ cipher/chacha20-ppc.c | 7 +++ cipher/cipher-gcm-ppc.c | 5 +++ cipher/poly1305-p10le.s | 41 ++++++++++++++++++ cipher/rijndael-p10le.c | 5 +++ cipher/rijndael-ppc-common.h | 1 + cipher/rijndael-ppc-functions.h | 24 ++++++++++ cipher/rijndael-ppc.c | 4 ++ cipher/sha256-ppc.c | 3 ++ cipher/sha512-ppc.c | 3 ++ cipher/simd-common-ppc.h | 77 +++++++++++++++++++++++++++++++++ cipher/sm4-ppc.c | 34 ++++++++------- 14 files changed, 232 insertions(+), 19 deletions(-) create mode 100644 cipher/simd-common-ppc.h diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 633c53ed..90415d83 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -127,7 +127,7 @@ EXTRA_libcipher_la_SOURCES = \ seed.c \ serpent.c serpent-sse2-amd64.S serpent-avx2-amd64.S \ serpent-avx512-x86.c serpent-armv7-neon.S \ - simd-common-aarch64.h \ + simd-common-aarch64.h simd-common-ppc.h \ sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ sm4-gfni-avx2-amd64.S sm4-gfni-avx512-amd64.S \ sm4-aarch64.S sm4-armv8-aarch64-ce.S sm4-armv9-aarch64-sve-ce.S \ diff --git a/cipher/camellia-simd128.h b/cipher/camellia-simd128.h index 120fbe5a..df36a1a2 100644 --- a/cipher/camellia-simd128.h +++ b/cipher/camellia-simd128.h @@ -47,6 +47,7 @@ /********************************************************************** AT&T x86 asm to intrinsics conversion macros (PowerPC VSX+crypto) **********************************************************************/ +#include "simd-common-ppc.h" #include typedef vector signed char int8x16_t; @@ -151,9 +152,6 @@ static const uint8x16_t shift_row = #define if_aes_subbytes(...) __VA_ARGS__ #define if_not_aes_subbytes(...) /*_*/ -#define memory_barrier_with_vec(a) __asm__("" : "+wa"(a) :: "memory") -#define clear_vec_regs() ((void)0) - #endif /* __powerpc__ */ #ifdef __ARM_NEON diff --git a/cipher/chacha20-p10le-8x.s b/cipher/chacha20-p10le-8x.s index ff68c9ef..f75ffb12 100644 --- a/cipher/chacha20-p10le-8x.s +++ b/cipher/chacha20-p10le-8x.s @@ -61,6 +61,45 @@ # .text +.macro clear_vec_regs + xxlxor 0, 0, 0 + xxlxor 1, 1, 1 + xxlxor 2, 2, 2 + xxlxor 3, 3, 3 + xxlxor 4, 4, 4 + xxlxor 5, 5, 5 + xxlxor 6, 6, 6 + xxlxor 7, 7, 7 + xxlxor 8, 8, 8 + xxlxor 9, 9, 9 + xxlxor 10, 10, 10 + xxlxor 11, 11, 11 + xxlxor 12, 12, 12 + xxlxor 13, 13, 13 + # vs14-vs31 (f14-f31) are ABI callee saved. + xxlxor 32, 32, 32 + xxlxor 33, 33, 33 + xxlxor 34, 34, 34 + xxlxor 35, 35, 35 + xxlxor 36, 36, 36 + xxlxor 37, 37, 37 + xxlxor 38, 38, 38 + xxlxor 39, 39, 39 + xxlxor 40, 40, 40 + xxlxor 41, 41, 41 + xxlxor 42, 42, 42 + xxlxor 43, 43, 43 + xxlxor 44, 44, 44 + xxlxor 45, 45, 45 + xxlxor 46, 46, 46 + xxlxor 47, 47, 47 + xxlxor 48, 48, 48 + xxlxor 49, 49, 49 + xxlxor 50, 50, 50 + xxlxor 51, 51, 51 + # vs52-vs63 (v20-v31) are ABI callee saved. +.endm + .macro QT_loop_8x # QR(v0, v4, v8, v12, v1, v5, v9, v13, v2, v6, v10, v14, v3, v7, v11, v15) xxlor 0, 32+25, 32+25 @@ -782,6 +821,8 @@ Out_loop: lvx 30, 26, 9 lvx 31, 27, 9 + clear_vec_regs + add 9, 9, 27 addi 14, 17, 16 lxvx 14, 14, 9 diff --git a/cipher/chacha20-ppc.c b/cipher/chacha20-ppc.c index e640010a..376d0642 100644 --- a/cipher/chacha20-ppc.c +++ b/cipher/chacha20-ppc.c @@ -25,6 +25,7 @@ defined(USE_CHACHA20) && \ __GNUC__ >= 4 +#include "simd-common-ppc.h" #include #include "bufhelp.h" #include "poly1305-internal.h" @@ -252,6 +253,8 @@ chacha20_ppc_blocks1(u32 *state, byte *dst, const byte *src, size_t nblks) vec_vsx_st(state3, 3 * 16, state); /* store counter */ + clear_vec_regs(); + return 0; } @@ -414,6 +417,8 @@ chacha20_ppc_blocks4(u32 *state, byte *dst, const byte *src, size_t nblks) vec_vsx_st(state3, 3 * 16, state); /* store counter */ + clear_vec_regs(); + return 0; } @@ -636,6 +641,8 @@ chacha20_poly1305_ppc_blocks4(u32 *state, byte *dst, const byte *src, st->h[3] = h1 >> 32; st->h[4] = h2; + clear_vec_regs(); + return 0; } diff --git a/cipher/cipher-gcm-ppc.c b/cipher/cipher-gcm-ppc.c index 648d1598..486295af 100644 --- a/cipher/cipher-gcm-ppc.c +++ b/cipher/cipher-gcm-ppc.c @@ -80,6 +80,7 @@ #ifdef GCM_USE_PPC_VPMSUM +#include "simd-common-ppc.h" #include #define ALWAYS_INLINE inline __attribute__((always_inline)) @@ -370,6 +371,8 @@ _gcry_ghash_setup_ppc_vpmsum (void *gcm_table_arg, void *gcm_key) STORE_TABLE (gcm_table, 10, H4l); STORE_TABLE (gcm_table, 11, H4); STORE_TABLE (gcm_table, 12, H4h); + + clear_vec_regs(); } unsigned int ASM_FUNC_ATTR @@ -542,6 +545,8 @@ _gcry_ghash_ppc_vpmsum (byte *result, void *gcm_table, vec_store_he (vec_be_swap (cur, bswap_const), 0, result); + clear_vec_regs(); + return 0; } diff --git a/cipher/poly1305-p10le.s b/cipher/poly1305-p10le.s index 4202b41e..d21f8245 100644 --- a/cipher/poly1305-p10le.s +++ b/cipher/poly1305-p10le.s @@ -57,6 +57,45 @@ # .text +.macro clear_vec_regs + xxlxor 0, 0, 0 + xxlxor 1, 1, 1 + xxlxor 2, 2, 2 + xxlxor 3, 3, 3 + xxlxor 4, 4, 4 + xxlxor 5, 5, 5 + xxlxor 6, 6, 6 + xxlxor 7, 7, 7 + xxlxor 8, 8, 8 + xxlxor 9, 9, 9 + xxlxor 10, 10, 10 + xxlxor 11, 11, 11 + xxlxor 12, 12, 12 + xxlxor 13, 13, 13 + # vs14-vs31 (f14-f31) are ABI callee saved. + xxlxor 32, 32, 32 + xxlxor 33, 33, 33 + xxlxor 34, 34, 34 + xxlxor 35, 35, 35 + xxlxor 36, 36, 36 + xxlxor 37, 37, 37 + xxlxor 38, 38, 38 + xxlxor 39, 39, 39 + xxlxor 40, 40, 40 + xxlxor 41, 41, 41 + xxlxor 42, 42, 42 + xxlxor 43, 43, 43 + xxlxor 44, 44, 44 + xxlxor 45, 45, 45 + xxlxor 46, 46, 46 + xxlxor 47, 47, 47 + xxlxor 48, 48, 48 + xxlxor 49, 49, 49 + xxlxor 50, 50, 50 + xxlxor 51, 51, 51 + # vs52-vs63 (v20-v31) are ABI callee saved. +.endm + # Block size 16 bytes # key = (r, s) # clamp r &= 0x0FFFFFFC0FFFFFFC 0x0FFFFFFC0FFFFFFF @@ -745,6 +784,8 @@ do_final_update: Out_loop: li 3, 0 + clear_vec_regs + li 14, 256 lvx 20, 14, 1 addi 14, 14, 16 diff --git a/cipher/rijndael-p10le.c b/cipher/rijndael-p10le.c index 65d804f9..448b45ed 100644 --- a/cipher/rijndael-p10le.c +++ b/cipher/rijndael-p10le.c @@ -30,6 +30,8 @@ #ifdef USE_PPC_CRYPTO_WITH_PPC9LE +#include "simd-common-ppc.h" + extern size_t _gcry_ppc10_aes_gcm_encrypt (const void *inp, void *out, size_t len, @@ -113,6 +115,9 @@ _gcry_aes_p10le_gcm_crypt(gcry_cipher_hd_t c, void *outbuf_arg, */ s = ndone / GCRY_GCM_BLOCK_LEN; s = nblocks - s; + + clear_vec_regs(); + return ( s ); } diff --git a/cipher/rijndael-ppc-common.h b/cipher/rijndael-ppc-common.h index bd2ad8b1..611b5871 100644 --- a/cipher/rijndael-ppc-common.h +++ b/cipher/rijndael-ppc-common.h @@ -26,6 +26,7 @@ #ifndef G10_RIJNDAEL_PPC_COMMON_H #define G10_RIJNDAEL_PPC_COMMON_H +#include "simd-common-ppc.h" #include diff --git a/cipher/rijndael-ppc-functions.h b/cipher/rijndael-ppc-functions.h index ec5cda73..eb39717d 100644 --- a/cipher/rijndael-ppc-functions.h +++ b/cipher/rijndael-ppc-functions.h @@ -40,6 +40,8 @@ ENCRYPT_BLOCK_FUNC (const RIJNDAEL_context *ctx, unsigned char *out, AES_ENCRYPT (b, rounds); VEC_STORE_BE (out, 0, b, bige_const); + clear_vec_regs(); + return 0; /* does not use stack */ } @@ -61,6 +63,8 @@ DECRYPT_BLOCK_FUNC (const RIJNDAEL_context *ctx, unsigned char *out, AES_DECRYPT (b, rounds); VEC_STORE_BE (out, 0, b, bige_const); + clear_vec_regs(); + return 0; /* does not use stack */ } @@ -116,6 +120,8 @@ CFB_ENC_FUNC (void *context, unsigned char *iv_arg, void *outbuf_arg, } VEC_STORE_BE (iv_arg, 0, outiv, bige_const); + + clear_vec_regs(); } @@ -373,6 +379,8 @@ ECB_CRYPT_FUNC (void *context, void *outbuf_arg, const void *inbuf_arg, out++; in++; } + + clear_vec_regs(); } @@ -571,6 +579,8 @@ CFB_DEC_FUNC (void *context, unsigned char *iv_arg, void *outbuf_arg, } VEC_STORE_BE (iv_arg, 0, iv, bige_const); + + clear_vec_regs(); } @@ -640,6 +650,8 @@ CBC_ENC_FUNC (void *context, unsigned char *iv_arg, void *outbuf_arg, while (nblocks); VEC_STORE_BE (iv_arg, 0, outiv, bige_const); + + clear_vec_regs(); } @@ -845,6 +857,8 @@ CBC_DEC_FUNC (void *context, unsigned char *iv_arg, void *outbuf_arg, } VEC_STORE_BE (iv_arg, 0, iv, bige_const); + + clear_vec_regs(); } @@ -1078,6 +1092,8 @@ CTR_ENC_FUNC (void *context, unsigned char *ctr_arg, void *outbuf_arg, } VEC_STORE_BE (ctr_arg, 0, ctr, bige_const); + + clear_vec_regs(); } @@ -1584,6 +1600,8 @@ OCB_CRYPT_FUNC (gcry_cipher_hd_t c, void *outbuf_arg, const void *inbuf_arg, VEC_STORE_BE (c->u_ctr.ctr, 0, ctr, bige_const); c->u_mode.ocb.data_nblocks = data_nblocks; + clear_vec_regs(); + return 0; } @@ -1794,6 +1812,8 @@ OCB_AUTH_FUNC (gcry_cipher_hd_t c, void *abuf_arg, size_t nblocks) VEC_STORE_BE (c->u_mode.ocb.aad_sum, 0, ctr, bige_const); c->u_mode.ocb.aad_nblocks = data_nblocks; + clear_vec_regs(); + return 0; } @@ -2295,6 +2315,8 @@ XTS_CRYPT_FUNC (void *context, unsigned char *tweak_arg, void *outbuf_arg, VEC_STORE_BE (tweak_arg, 0, tweak, bige_const); #undef GEN_TWEAK + + clear_vec_regs(); } @@ -2541,4 +2563,6 @@ CTR32LE_ENC_FUNC(void *context, unsigned char *ctr_arg, void *outbuf_arg, #undef VEC_ADD_CTRLE32 VEC_STORE_BE (ctr_arg, 0, vec_reve((block)ctr), bige_const); + + clear_vec_regs(); } diff --git a/cipher/rijndael-ppc.c b/cipher/rijndael-ppc.c index 055b00c0..18fadd6e 100644 --- a/cipher/rijndael-ppc.c +++ b/cipher/rijndael-ppc.c @@ -201,6 +201,8 @@ _gcry_aes_ppc8_setkey (RIJNDAEL_context *ctx, const byte *key) } wipememory(tk_vu32, sizeof(tk_vu32)); + + clear_vec_regs(); } @@ -208,6 +210,8 @@ void PPC_OPT_ATTR _gcry_aes_ppc8_prepare_decryption (RIJNDAEL_context *ctx) { internal_aes_ppc_prepare_decryption (ctx); + + clear_vec_regs(); } diff --git a/cipher/sha256-ppc.c b/cipher/sha256-ppc.c index e5839a84..bcc08dad 100644 --- a/cipher/sha256-ppc.c +++ b/cipher/sha256-ppc.c @@ -25,6 +25,7 @@ defined(USE_SHA256) && \ __GNUC__ >= 4 +#include "simd-common-ppc.h" #include #include "bufhelp.h" @@ -590,6 +591,8 @@ sha256_transform_ppc(u32 state[8], const unsigned char *data, size_t nblks) vec_vsx_st (h0_h3, 4 * 0, state); vec_vsx_st (h4_h7, 4 * 4, state); + clear_vec_regs(); + return sizeof(w2) + sizeof(w); } diff --git a/cipher/sha512-ppc.c b/cipher/sha512-ppc.c index d213c241..ed9486ee 100644 --- a/cipher/sha512-ppc.c +++ b/cipher/sha512-ppc.c @@ -25,6 +25,7 @@ defined(USE_SHA512) && \ __GNUC__ >= 4 +#include "simd-common-ppc.h" #include #include "bufhelp.h" @@ -705,6 +706,8 @@ sha512_transform_ppc(u64 state[8], const unsigned char *data, size_t nblks) vec_u64_store (h4, 8 * 4, (unsigned long long *)state); vec_u64_store (h6, 8 * 6, (unsigned long long *)state); + clear_vec_regs(); + return sizeof(w) + sizeof(w2); } diff --git a/cipher/simd-common-ppc.h b/cipher/simd-common-ppc.h new file mode 100644 index 00000000..620a3b51 --- /dev/null +++ b/cipher/simd-common-ppc.h @@ -0,0 +1,77 @@ +/* simd-common-ppc.h - Common macros for PowerPC SIMD code + * + * Copyright (C) 2024 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#ifndef GCRY_SIMD_COMMON_PPC_H +#define GCRY_SIMD_COMMON_PPC_H + +#include + +#define memory_barrier_with_vec(a) __asm__("" : "+wa"(a) :: "memory") + +#define clear_vec_regs() __asm__ volatile("xxlxor 0, 0, 0\n" \ + "xxlxor 1, 1, 1\n" \ + "xxlxor 2, 2, 2\n" \ + "xxlxor 3, 3, 3\n" \ + "xxlxor 4, 4, 4\n" \ + "xxlxor 5, 5, 5\n" \ + "xxlxor 6, 6, 6\n" \ + "xxlxor 7, 7, 7\n" \ + "xxlxor 8, 8, 8\n" \ + "xxlxor 9, 9, 9\n" \ + "xxlxor 10, 10, 10\n" \ + "xxlxor 11, 11, 11\n" \ + "xxlxor 12, 12, 12\n" \ + "xxlxor 13, 13, 13\n" \ + "xxlxor 32, 32, 32\n" \ + "xxlxor 33, 33, 33\n" \ + "xxlxor 34, 34, 34\n" \ + "xxlxor 35, 35, 35\n" \ + "xxlxor 36, 36, 36\n" \ + "xxlxor 37, 37, 37\n" \ + "xxlxor 38, 38, 38\n" \ + "xxlxor 39, 39, 39\n" \ + "xxlxor 40, 40, 40\n" \ + "xxlxor 41, 41, 41\n" \ + "xxlxor 42, 42, 42\n" \ + "xxlxor 43, 43, 43\n" \ + "xxlxor 44, 44, 44\n" \ + "xxlxor 45, 45, 45\n" \ + "xxlxor 46, 46, 46\n" \ + "xxlxor 47, 47, 47\n" \ + "xxlxor 48, 48, 48\n" \ + "xxlxor 49, 49, 49\n" \ + "xxlxor 50, 50, 50\n" \ + "xxlxor 51, 51, 51\n" \ + ::: "vs0", "vs1", "vs2", "vs3", \ + "vs4", "vs5", "vs6", "vs7", \ + "vs8", "vs9", "vs10", "vs11", \ + "vs12", "vs13", \ + /* vs14-vs31 (f14-f31) are */ \ + /* ABI callee saved. */ \ + "vs32", "vs33", "vs34", "vs35", \ + "vs36", "vs37", "vs38", "vs39", \ + "vs40", "vs41", "vs42", "vs43", \ + "vs44", "vs45", "vs46", "vs47", \ + "vs48", "vs49", "vs50", "vs51", \ + /* vs52-vs63 (v20-v31) are */ \ + /* ABI callee saved. */ \ + "memory") + +#endif /* GCRY_SIMD_COMMON_PPC_H */ diff --git a/cipher/sm4-ppc.c b/cipher/sm4-ppc.c index bb2c55e0..2b26c39d 100644 --- a/cipher/sm4-ppc.c +++ b/cipher/sm4-ppc.c @@ -25,6 +25,7 @@ defined(HAVE_GCC_INLINE_ASM_PPC_ALTIVEC) && \ !defined(WORDS_BIGENDIAN) && (__GNUC__ >= 4) +#include "simd-common-ppc.h" #include #include "bufhelp.h" @@ -298,25 +299,28 @@ sm4_ppc_crypt_blk1_16(u32 *rk, byte *out, const byte *in, size_t nblks) if (nblks >= 16) { sm4_ppc_crypt_blk16(rk, out, in); - return; } - - while (nblks >= 8) + else { - sm4_ppc_crypt_blk8(rk, out, in); - in += 8 * 16; - out += 8 * 16; - nblks -= 8; + while (nblks >= 8) + { + sm4_ppc_crypt_blk8(rk, out, in); + in += 8 * 16; + out += 8 * 16; + nblks -= 8; + } + + while (nblks) + { + size_t currblks = nblks > 4 ? 4 : nblks; + sm4_ppc_crypt_blk1_4(rk, out, in, currblks); + in += currblks * 16; + out += currblks * 16; + nblks -= currblks; + } } - while (nblks) - { - size_t currblks = nblks > 4 ? 4 : nblks; - sm4_ppc_crypt_blk1_4(rk, out, in, currblks); - in += currblks * 16; - out += currblks * 16; - nblks -= currblks; - } + clear_vec_regs(); } ASM_FUNC_ATTR_NOINLINE FUNC_ATTR_TARGET_P8 void -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:46 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:46 +0200 Subject: [PATCH 02/11] sm4-aarch64-sve: clear volatile vectors registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-2-jussi.kivilinna@iki.fi> * cipher/asm-common-aarch64.h (CLEAR_ALL_REGS): New. * cipher/sm4-armv9-aarch64-sve-ce.S (_gcry_sm4_armv9_sve_ce_cbc_dec, _gcry_sm4_armv9_sve_ce_cfb_dec) (_gcry_sm4_armv9_sve_ce_ctr_enc): Add CLEAR_ALL_REGS. -- Signed-off-by: Jussi Kivilinna --- cipher/asm-common-aarch64.h | 9 +++++++++ cipher/sm4-armv9-aarch64-sve-ce.S | 3 +++ 2 files changed, 12 insertions(+) diff --git a/cipher/asm-common-aarch64.h b/cipher/asm-common-aarch64.h index ff65ea6a..dde7366c 100644 --- a/cipher/asm-common-aarch64.h +++ b/cipher/asm-common-aarch64.h @@ -125,6 +125,15 @@ #define CLEAR_REG(reg) movi reg.16b, #0; +#define CLEAR_ALL_REGS() \ + CLEAR_REG(v0); CLEAR_REG(v1); CLEAR_REG(v2); CLEAR_REG(v3); \ + CLEAR_REG(v4); CLEAR_REG(v5); CLEAR_REG(v6); \ + /* v8-v15 are ABI callee saved. */ \ + CLEAR_REG(v16); CLEAR_REG(v17); CLEAR_REG(v18); CLEAR_REG(v19); \ + CLEAR_REG(v20); CLEAR_REG(v21); CLEAR_REG(v22); CLEAR_REG(v23); \ + CLEAR_REG(v24); CLEAR_REG(v25); CLEAR_REG(v26); CLEAR_REG(v27); \ + CLEAR_REG(v28); CLEAR_REG(v29); CLEAR_REG(v30); CLEAR_REG(v31); + #define VPUSH_ABI \ stp d8, d9, [sp, #-16]!; \ CFI_ADJUST_CFA_OFFSET(16); \ diff --git a/cipher/sm4-armv9-aarch64-sve-ce.S b/cipher/sm4-armv9-aarch64-sve-ce.S index cdf20719..7367cd28 100644 --- a/cipher/sm4-armv9-aarch64-sve-ce.S +++ b/cipher/sm4-armv9-aarch64-sve-ce.S @@ -618,6 +618,7 @@ _gcry_sm4_armv9_sve_ce_cbc_dec: st1 {RIVv.16b}, [x3]; VPOP_ABI; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv9_sve_ce_cbc_dec,.-_gcry_sm4_armv9_sve_ce_cbc_dec;) @@ -792,6 +793,7 @@ _gcry_sm4_armv9_sve_ce_cfb_dec: st1 {RIVv.16b}, [x3]; VPOP_ABI; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv9_sve_ce_cfb_dec,.-_gcry_sm4_armv9_sve_ce_cfb_dec;) @@ -948,6 +950,7 @@ _gcry_sm4_armv9_sve_ce_ctr_enc: rev x8, x8; stp x7, x8, [x3]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm4_armv9_sve_ce_ctr_enc,.-_gcry_sm4_armv9_sve_ce_ctr_enc;) -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:49 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:49 +0200 Subject: [PATCH 05/11] sm3-aarch64-ce: clear volatile vector registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-5-jussi.kivilinna@iki.fi> * cipher/sm3-armv8-aarch64-ce.S: Add CLEAR_ALL_REGS. -- Signed-off-by: Jussi Kivilinna --- cipher/sm3-armv8-aarch64-ce.S | 1 + 1 file changed, 1 insertion(+) diff --git a/cipher/sm3-armv8-aarch64-ce.S b/cipher/sm3-armv8-aarch64-ce.S index 5f5f599d..6b678971 100644 --- a/cipher/sm3-armv8-aarch64-ce.S +++ b/cipher/sm3-armv8-aarch64-ce.S @@ -214,6 +214,7 @@ _gcry_sm3_transform_armv8_ce: ext CTX2.16b, CTX2.16b, CTX2.16b, #8; st1 {CTX1.4s, CTX2.4s}, [x0]; + CLEAR_ALL_REGS(); ret_spec_stop; CFI_ENDPROC(); ELF(.size _gcry_sm3_transform_armv8_ce, .-_gcry_sm3_transform_armv8_ce;) -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:52 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:52 +0200 Subject: [PATCH 08/11] whirlpool-sse2-amd64: clear vectors registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-8-jussi.kivilinna@iki.fi> * cipher/whirlpool-sse2-amd64.S (CLEAR_REG): New. (_gcry_whirlpool_transform_amd64): Clear vectors registers at exit. -- Signed-off-by: Jussi Kivilinna --- cipher/whirlpool-sse2-amd64.S | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/cipher/whirlpool-sse2-amd64.S b/cipher/whirlpool-sse2-amd64.S index 37648faa..39959a45 100644 --- a/cipher/whirlpool-sse2-amd64.S +++ b/cipher/whirlpool-sse2-amd64.S @@ -27,6 +27,8 @@ .text +#define CLEAR_REG(v) pxor v, v + /* look-up table offsets on RTAB */ #define RC (0) #define C0 (RC + (8 * 10)) @@ -338,6 +340,24 @@ _gcry_whirlpool_transform_amd64: CFI_RESTORE(%r15); addq $STACK_MAX, %rsp; CFI_ADJUST_CFA_OFFSET(-STACK_MAX); + + CLEAR_REG(%xmm0); + CLEAR_REG(%xmm1); + CLEAR_REG(%xmm2); + CLEAR_REG(%xmm3); + CLEAR_REG(%xmm4); + CLEAR_REG(%xmm5); + CLEAR_REG(%xmm6); + CLEAR_REG(%xmm7); + CLEAR_REG(%xmm8); + CLEAR_REG(%xmm9); + CLEAR_REG(%xmm10); + CLEAR_REG(%xmm11); + CLEAR_REG(%xmm12); + CLEAR_REG(%xmm13); + CLEAR_REG(%xmm14); + CLEAR_REG(%xmm15); + .Lskip: movl $(STACK_MAX + 8), %eax; ret_spec_stop; -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:53 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:53 +0200 Subject: [PATCH 09/11] salsa20-amd64: clear vectors registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-9-jussi.kivilinna@iki.fi> * cipher/salsa20-amd64.S (CLEAR_REG): New. (_gcry_salsa20_amd64_encrypt_blocks): Clear vectors registers at exit. -- Signed-off-by: Jussi Kivilinna --- cipher/salsa20-amd64.S | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/cipher/salsa20-amd64.S b/cipher/salsa20-amd64.S index 6efb75e0..ce1495c4 100644 --- a/cipher/salsa20-amd64.S +++ b/cipher/salsa20-amd64.S @@ -30,6 +30,8 @@ #include "asm-common-amd64.h" +#define CLEAR_REG(v) pxor v, v + .text .align 16 @@ -926,6 +928,22 @@ _gcry_salsa20_amd64_encrypt_blocks: CFI_DEF_CFA_REGISTER(%rsp) pop %rbx CFI_POP(%rbx) + CLEAR_REG(%xmm0); + CLEAR_REG(%xmm1); + CLEAR_REG(%xmm2); + CLEAR_REG(%xmm3); + CLEAR_REG(%xmm4); + CLEAR_REG(%xmm5); + CLEAR_REG(%xmm6); + CLEAR_REG(%xmm7); + CLEAR_REG(%xmm8); + CLEAR_REG(%xmm9); + CLEAR_REG(%xmm10); + CLEAR_REG(%xmm11); + CLEAR_REG(%xmm12); + CLEAR_REG(%xmm13); + CLEAR_REG(%xmm14); + CLEAR_REG(%xmm15); ret_spec_stop CFI_RESTORE_STATE(); .L_bytes_are_128_or_192: -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:47 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:47 +0200 Subject: [PATCH 03/11] sm4-aarch64: clear volatile vectors registers In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-3-jussi.kivilinna@iki.fi> * cipher/sm4-aarch64.S (clear_volatile_vec_regs): New. (_gcry_sm4_aarch64_crypt_blk1_8, _gcry_sm4_aarch64_crypt) (_gcry_sm4_aarch64_cbc_dec, _gcry_sm4_aarch64_cfb_dec) (_gcry_sm4_aarch64_ctr_enc): Add clear_volatile_vec_regs. -- Signed-off-by: Jussi Kivilinna --- cipher/sm4-aarch64.S | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/cipher/sm4-aarch64.S b/cipher/sm4-aarch64.S index cce6fcc4..bab4b4df 100644 --- a/cipher/sm4-aarch64.S +++ b/cipher/sm4-aarch64.S @@ -110,6 +110,12 @@ ELF(.size _gcry_sm4_aarch64_consts,.-_gcry_sm4_aarch64_consts) zip1 s2.2d, RTMP3.2d, RTMP1.2d; \ zip2 s3.2d, RTMP3.2d, RTMP1.2d; +#define clear_volatile_vec_regs() \ + CLEAR_REG(v0); CLEAR_REG(v1); CLEAR_REG(v2); CLEAR_REG(v3); \ + CLEAR_REG(v4); CLEAR_REG(v5); CLEAR_REG(v6); \ + /* v8-v15 are ABI callee saved. */ \ + /* v16-v31 are loaded with non-secret (SM4 sbox). */ + .text @@ -385,6 +391,7 @@ _gcry_sm4_aarch64_crypt_blk1_8: .Lblk8_store_output_done: VPOP_ABI; + clear_volatile_vec_regs(); ldp x29, x30, [sp], #16; CFI_ADJUST_CFA_OFFSET(-16); CFI_RESTORE(x29); @@ -427,6 +434,7 @@ _gcry_sm4_aarch64_crypt: .Lcrypt_end: VPOP_ABI; + clear_volatile_vec_regs(); ldp x29, x30, [sp], #16; CFI_ADJUST_CFA_OFFSET(-16); CFI_RESTORE(x29); @@ -491,6 +499,7 @@ _gcry_sm4_aarch64_cbc_dec: st1 {RIV.16b}, [x3]; VPOP_ABI; + clear_volatile_vec_regs(); ldp x29, x30, [sp], #16; CFI_ADJUST_CFA_OFFSET(-16); CFI_RESTORE(x29); @@ -554,6 +563,7 @@ _gcry_sm4_aarch64_cfb_dec: st1 {v0.16b}, [x3]; VPOP_ABI; + clear_volatile_vec_regs(); ldp x29, x30, [sp], #16; CFI_ADJUST_CFA_OFFSET(-16); CFI_RESTORE(x29); @@ -633,6 +643,7 @@ _gcry_sm4_aarch64_ctr_enc: stp x7, x8, [x3]; VPOP_ABI; + clear_volatile_vec_regs(); ldp x29, x30, [sp], #16; CFI_ADJUST_CFA_OFFSET(-16); CFI_RESTORE(x29); -- 2.45.2 From jussi.kivilinna at iki.fi Sun Nov 3 20:56:54 2024 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 3 Nov 2024 21:56:54 +0200 Subject: [PATCH 10/11] rijndael-ppc: fix 'may be used uninitialized' warnings In-Reply-To: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> References: <20241103195657.3336817-1-jussi.kivilinna@iki.fi> Message-ID: <20241103195657.3336817-10-jussi.kivilinna@iki.fi> * cipher/rijndael-ppc-common.h (PRELOAD_ROUND_KEYS_ALL): Load rkey10-rkey13 with zero value by default. -- Signed-off-by: Jussi Kivilinna --- cipher/rijndael-ppc-common.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/cipher/rijndael-ppc-common.h b/cipher/rijndael-ppc-common.h index fc8ee526..bd2ad8b1 100644 --- a/cipher/rijndael-ppc-common.h +++ b/cipher/rijndael-ppc-common.h @@ -136,6 +136,7 @@ typedef union #define PRELOAD_ROUND_KEYS_ALL(nrounds) \ do { \ + static const block preload_zero = { 0 }; \ rkey0 = ALIGNED_LOAD (rk, 0); \ rkey1 = ALIGNED_LOAD (rk, 1); \ rkey2 = ALIGNED_LOAD (rk, 2); \ @@ -146,6 +147,10 @@ typedef union rkey7 = ALIGNED_LOAD (rk, 7); \ rkey8 = ALIGNED_LOAD (rk, 8); \ rkey9 = ALIGNED_LOAD (rk, 9); \ + rkey10 = preload_zero; \ + rkey11 = preload_zero; \ + rkey12 = preload_zero; \ + rkey13 = preload_zero; \ if (nrounds >= 12) \ { \ rkey10 = ALIGNED_LOAD (rk, 10); \ -- 2.45.2 From anonymous-book at protonmail.com Sat Nov 16 09:38:22 2024 From: anonymous-book at protonmail.com (Alex Badjurov) Date: Sat, 16 Nov 2024 08:38:22 +0000 Subject: build problem Message-ID: Hello Having a build problem. Read the readme and no any configure flags helping me. Here is the log: https://termbin.com/jo48 ----------- Sent with [Proton Mail](https://proton.me/mail/home) secure email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gniibe at fsij.org Tue Nov 19 03:25:36 2024 From: gniibe at fsij.org (NIIBE Yutaka) Date: Tue, 19 Nov 2024 11:25:36 +0900 Subject: build problem In-Reply-To: References: Message-ID: <878qtg9cyn.fsf@akagi.fsij.org> Alex Badjurov wrote: > /usr/bin/ld: ../src/.libs/libgcrypt.so: undefined reference to `gpgrt_logv_domain at GPG_ERROR_1.0' > /usr/bin/ld: ../src/.libs/libgcrypt.so: undefined reference to `gpgrt_add_post_log_func at GPG_ERROR_1.0' > collect2: error: ld returned 1 exit status For building libgcrypt master (development version), libgpg-error version 1.49 or later is required. It seems that you have relevant version somewhere, for header files, but not for linkage. Please check your libgpg-error library installation. -- From gniibe at fsij.org Wed Nov 20 08:23:30 2024 From: gniibe at fsij.org (NIIBE Yutaka) Date: Wed, 20 Nov 2024 16:23:30 +0900 Subject: FIPS 140 service indicator revamp In-Reply-To: <2035896.hr6ThXoNJa@tauon.atsec.com> References: <002642d8-4401-4fdb-975e-a5350293db95@atsec.com> <875xp7mmbe.fsf@akagi.fsij.org> <2035896.hr6ThXoNJa@tauon.atsec.com> Message-ID: <8734jmqsgd.fsf@akagi.fsij.org> Hello, Using the function gcry_kdf_derive as an example (as David Sugar kindly suggested), here are my patches against master, as of today. I didn't touch the code for the rejection by argument with an error GPG_ERR_INV_VALUE, where it does not compute the value under FIPS mode. (If computation should be done, we need to change the _gcry_kdf_derive function not to reject by GPG_ERR_INV_VALUE, but let it continue the computing, setting the service indicator.) In t-kdf.c, I'm not sure if checking the computed value makes sense in check_fips_gcry_kdf_derive for GCRY_KDF_SCRYPT. (I modified the case of GCRY_KDF_SCRYPT, so that checking goes well.) Please tell us your comments/opinions/whatever. -- -------------- next part -------------- A non-text attachment was scrubbed... Name: fips-140-service-indicator-revamp-gniibe-00.patch Type: text/x-diff Size: 9253 bytes Desc: not available URL: From gniibe at fsij.org Thu Nov 21 02:13:05 2024 From: gniibe at fsij.org (NIIBE Yutaka) Date: Thu, 21 Nov 2024 10:13:05 +0900 Subject: FIPS 140 service indicator revamp In-Reply-To: <4260523.fjovo90rOp@tauon.atsec.com> References: <2035896.hr6ThXoNJa@tauon.atsec.com> <8734jmqsgd.fsf@akagi.fsij.org> <4260523.fjovo90rOp@tauon.atsec.com> Message-ID: <87r0752xum.fsf@akagi.fsij.org> Stephan Mueller wrote: > Could you please help me finding the function _gcry_thread_context_set_fsi? It's in a branch: https://dev.gnupg.org/source/libgcrypt/history/gniibe%252Ft7340/ In the commit: https://dev.gnupg.org/rCfa87b8de038878704c381ac1e63cafee78e9b798 In the file: https://dev.gnupg.org/source/libgcrypt/browse/gniibe%252Ft7340/src/fips.c$81 My intention is that those two functions may be provided by non-libgcrypt implementation. In that case, replacing the functions by use of ERRNO is possible. Writhing the code, I thought about general cases; There could be multiple causes for FIPS 140 compliance. If we will change _gcry_kdf_derive with GCRY_KDF_PBKDF2 to finish its computation under FIPS mode, possible causes are: - passphrase length too small - salt length too small - iteration number is too small - minimum key size is too small I wonder if all of these should be recorded within the FIPS service indicator. It could be done by implementing a chain of causes (while it's only "unsigned long" value now in my branch). Shall I do something like: - recording a chain of causes within the FIPS service indicator (heavy implementation) - ERRNO implementation as an alternative (#ifdef, perhaps), only retains the last cause (or the first one) (light implementation) My opinion is that exposing implementation detail (for ERRNO use) is not good, at least for starting point. -- From wk at gnupg.org Thu Nov 21 09:29:09 2024 From: wk at gnupg.org (Werner Koch) Date: Thu, 21 Nov 2024 09:29:09 +0100 Subject: FIPS 140 service indicator revamp In-Reply-To: <8734jmqsgd.fsf@akagi.fsij.org> (NIIBE Yutaka via Gcrypt-devel's message of "Wed, 20 Nov 2024 16:23:30 +0900") References: <002642d8-4401-4fdb-975e-a5350293db95@atsec.com> <875xp7mmbe.fsf@akagi.fsij.org> <2035896.hr6ThXoNJa@tauon.atsec.com> <8734jmqsgd.fsf@akagi.fsij.org> Message-ID: <87ttc17zxm.fsf@jacob.g10code.de> Hi! Thanks for working on that. + default: + /* Record the reason of failure, in the indicator. + * Or putting GPG_ERR_NOT_SUPPORTED would be enough. */ + _gcry_thread_context_set_fsi ((unsigned long)((algo << 16) | subalgo)); + break; I think it is sufficient to return an error code and leave the subalgo alone. This would allow to use a pointer to a gpg_error_t variable which is our common way of returning errors. Or we change gpg_err_code_t _gcry_fips_indicator (unsigned long *p); which currently always returns 0 to also or alternatively return the service indicator value. To avoid introducing a new function this is not easy but in 1.12 we could add a new function as an alternative way to return the value. Or we use an inline function. To return a general failure (independent of the fsi state) we can use any other error code. We may - if really needed - also define a new gpg_err_source_t value for FIPS to clarify that the error comes from the fips interface. Regarding debugging FIPS problems, log_debug or syslog can be used after the library has been put into a verbose logging state. This would allow to track down which problem causes the FIPS failure. Shalom-Salam, Werner -- The pioneers of a warless world are the youth that refuse military service. - A. Einstein -------------- next part -------------- A non-text attachment was scrubbed... Name: openpgp-digital-signature.asc Type: application/pgp-signature Size: 247 bytes Desc: not available URL: