From gniibe at fsij.org Mon Jun 1 09:39:30 2020 From: gniibe at fsij.org (NIIBE Yutaka) Date: Mon, 01 Jun 2020 16:39:30 +0900 Subject: gcry_mpi_invm succeeds if the inverse does not exist In-Reply-To: <6f38bc98-85fe-12b1-9cce-8e96e699e378@iki.fi> References: <87zhbeh921.fsf@iwagami.gniibe.org> <51fc5b09-7b2f-ce2e-bb8c-f653ba907446@iki.fi> <87tv0k9vwj.fsf@jumper.gniibe.org> <6f38bc98-85fe-12b1-9cce-8e96e699e378@iki.fi> Message-ID: <87d06jmdzh.fsf@iwagami.gniibe.org> Jussi Kivilinna wrote: > Cryptofuzz is reporting another heap-buffer-overflow issue in > _gcry_mpi_invm. I've attached reproducer, original from Guido and > as patch applied to tests/basic.c. My fix of 69b55f87053ce2494cd4b38dc600f867bc4355be was not enough. I just push another change: 6f8b1d4cb798375e6d830fd6b73c71da93ee5f3f Thank you for your report. -- From mandar.apte409 at gmail.com Tue Jun 2 13:27:23 2020 From: mandar.apte409 at gmail.com (Mandar Apte) Date: Tue, 2 Jun 2020 16:57:23 +0530 Subject: Decrypt using BcryptDecrypt Message-ID: Hello team, I am trying out Libgcrypt 1.85 APIs for AES 256 encryption in CBC mode on Fedora computer. I have a windows 10 computer on which I have installed oracle virtual box and running a Fedora OS machine in it. Firstly, I tried encryption and decryption on Fedora using Libgcrypt APIs. It worked so easy and Smooth with no error and data loss. Since nowadays cross platform capability has become a MUST point in software world, I am also trying to test encryption and encryption in cross platform scenario. I am trying to encrypt file on fedora using Libgcrypt APIs, and decrypt that encrypted file on windows. On windows I am using Bcrypt library which also supports AES 256 in CBC mode. The problem I am facing right now is, I am getting an error from BcryptDecrypt() function on windows when I try to decrypt the file encrypted on Fedora box. Though the surprising thing is when I pass the entire encrypted file content all at once to BcryptDecrypt() it is able to decrypt the data correctly with no data loss, but it still returns error code "-1073741762 (0xC000003E) which means as" STATUS_DATA_ERROR" in windows. Hence, I wanted to check, if the Libgcrypt APIs are doing padding internally since I am not passing any such instruction to the Libgcrypt library explicitly? I am kind of stuck in this since 2 weeks now. I tried all possible things, checked endianess, byte size etc on both Fedora and windows computer. I need some help here to know internal behaviour if Libgcrypt library. Please help me. Thank you in advance. Best Regards, Mandar -------------- next part -------------- An HTML attachment was scrubbed... URL: From jussi.kivilinna at iki.fi Wed Jun 3 22:08:37 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 3 Jun 2020 23:08:37 +0300 Subject: [PATCH 1/2] Disable all assembly modules with --disable-asm Message-ID: <20200603200838.562876-1-jussi.kivilinna@iki.fi> * configure.ac (try_asm_modules): Update description, "MPI" => "MPI and cipher". (gcry_cv_gcc_arm_platform_as_ok, gcry_cv_gcc_aarch64_platform_as_ok) (gcry_cv_gcc_inline_asm_ssse3, gcry_cv_gcc_inline_asm_pclmul) (gcry_cv_gcc_inline_asm_shaext, gcry_cv_gcc_inline_asm_sse41) (gcry_cv_gcc_inline_asm_avx, gcry_cv_gcc_inline_asm_avx2) (gcry_cv_gcc_inline_asm_bmi2, gcry_cv_gcc_amd64_platform_as_ok) (gcry_cv_gcc_platform_as_ok_for_intel_syntax) (gcry_cv_cc_arm_arch_is_v6, gcry_cv_gcc_inline_asm_neon) (gcry_cv_gcc_inline_asm_aarch32_crypto) (gcry_cv_gcc_inline_asm_aarch64_neon) (gcry_cv_gcc_inline_asm_aarch64_crypto) (gcry_cv_cc_ppc_altivec, gcry_cv_gcc_inline_asm_ppc_altivec) (gcry_cv_gcc_inline_asm_ppc_arch_3_00): Check for "try_asm_modules". * mpi/config.links: Set "mpi_cpu_arch" to "disabled" with --disable-asm. -- Signed-off-by: Jussi Kivilinna --- configure.ac | 86 +++++++++++++++++++++++++++++++----------------- mpi/config.links | 1 + 2 files changed, 57 insertions(+), 30 deletions(-) diff --git a/configure.ac b/configure.ac index 3bf0179e..0c9100bf 100644 --- a/configure.ac +++ b/configure.ac @@ -535,10 +535,10 @@ AM_CONDITIONAL(USE_RANDOM_DAEMON, test x$use_random_daemon = xyes) # Implementation of --disable-asm. -AC_MSG_CHECKING([whether MPI assembler modules are requested]) +AC_MSG_CHECKING([whether MPI and cipher assembler modules are requested]) AC_ARG_ENABLE([asm], AC_HELP_STRING([--disable-asm], - [Disable MPI assembler modules]), + [Disable MPI and cipher assembler modules]), [try_asm_modules=$enableval], [try_asm_modules=yes]) AC_MSG_RESULT($try_asm_modules) @@ -1140,9 +1140,12 @@ fi # AC_CACHE_CHECK([whether GCC assembler is compatible for ARM assembly implementations], [gcry_cv_gcc_arm_platform_as_ok], - [gcry_cv_gcc_arm_platform_as_ok=no - AC_COMPILE_IFELSE([AC_LANG_SOURCE( - [[__asm__( + [if test "$try_asm_modules" != "yes" ; then + gcry_cv_gcc_arm_platform_as_ok="n/a" + else + gcry_cv_gcc_arm_platform_as_ok=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[__asm__( /* Test if assembler supports UAL syntax. */ ".syntax unified\n\t" ".arm\n\t" /* our assembly code is in ARM mode */ @@ -1153,8 +1156,9 @@ AC_CACHE_CHECK([whether GCC assembler is compatible for ARM assembly implementat /* Test if '.type' and '.size' are supported. */ ".size asmfunc,.-asmfunc;\n\t" ".type asmfunc,%function;\n\t" - );]])], - [gcry_cv_gcc_arm_platform_as_ok=yes])]) + );]])], + [gcry_cv_gcc_arm_platform_as_ok=yes]) + fi]) if test "$gcry_cv_gcc_arm_platform_as_ok" = "yes" ; then AC_DEFINE(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS,1, [Defined if underlying assembler is compatible with ARM assembly implementations]) @@ -1168,15 +1172,19 @@ fi # AC_CACHE_CHECK([whether GCC assembler is compatible for ARMv8/Aarch64 assembly implementations], [gcry_cv_gcc_aarch64_platform_as_ok], - [gcry_cv_gcc_aarch64_platform_as_ok=no - AC_COMPILE_IFELSE([AC_LANG_SOURCE( - [[__asm__( + [if test "$try_asm_modules" != "yes" ; then + gcry_cv_gcc_aarch64_platform_as_ok="n/a" + else + gcry_cv_gcc_aarch64_platform_as_ok=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[__asm__( "asmfunc:\n\t" "eor x0, x0, x30, ror #12;\n\t" "add x0, x0, x30, asr #12;\n\t" "eor v0.16b, v0.16b, v31.16b;\n\t" - );]])], - [gcry_cv_gcc_aarch64_platform_as_ok=yes])]) + );]])], + [gcry_cv_gcc_aarch64_platform_as_ok=yes]) + fi]) if test "$gcry_cv_gcc_aarch64_platform_as_ok" = "yes" ; then AC_DEFINE(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS,1, [Defined if underlying assembler is compatible with ARMv8/Aarch64 assembly implementations]) @@ -1383,7 +1391,8 @@ CFLAGS=$_gcc_cflags_save; # AC_CACHE_CHECK([whether GCC inline assembler supports SSSE3 instructions], [gcry_cv_gcc_inline_asm_ssse3], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_ssse3="n/a" else gcry_cv_gcc_inline_asm_ssse3=no @@ -1406,7 +1415,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports PCLMUL instructions], [gcry_cv_gcc_inline_asm_pclmul], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_pclmul="n/a" else gcry_cv_gcc_inline_asm_pclmul=no @@ -1427,7 +1437,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports SHA Extensions instructions], [gcry_cv_gcc_inline_asm_shaext], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_shaext="n/a" else gcry_cv_gcc_inline_asm_shaext=no @@ -1454,7 +1465,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports SSE4.1 instructions], [gcry_cv_gcc_inline_asm_sse41], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_sse41="n/a" else gcry_cv_gcc_inline_asm_sse41=no @@ -1476,7 +1488,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports AVX instructions], [gcry_cv_gcc_inline_asm_avx], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_avx="n/a" else gcry_cv_gcc_inline_asm_avx=no @@ -1497,7 +1510,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports AVX2 instructions], [gcry_cv_gcc_inline_asm_avx2], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_avx2="n/a" else gcry_cv_gcc_inline_asm_avx2=no @@ -1518,7 +1532,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports BMI2 instructions], [gcry_cv_gcc_inline_asm_bmi2], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_bmi2="n/a" else gcry_cv_gcc_inline_asm_bmi2=no @@ -1579,7 +1594,8 @@ fi if test $amd64_as_feature_detection = yes; then AC_CACHE_CHECK([whether GCC assembler is compatible for amd64 assembly implementations], [gcry_cv_gcc_amd64_platform_as_ok], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_amd64_platform_as_ok="n/a" else gcry_cv_gcc_amd64_platform_as_ok=no @@ -1629,7 +1645,8 @@ fi # AC_CACHE_CHECK([whether GCC assembler is compatible for Intel syntax assembly implementations], [gcry_cv_gcc_platform_as_ok_for_intel_syntax], - [if test "$mpi_cpu_arch" != "x86" ; then + [if test "$mpi_cpu_arch" != "x86" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_platform_as_ok_for_intel_syntax="n/a" else gcry_cv_gcc_platform_as_ok_for_intel_syntax=no @@ -1666,7 +1683,8 @@ fi # AC_CACHE_CHECK([whether compiler is configured for ARMv6 or newer architecture], [gcry_cv_cc_arm_arch_is_v6], - [if test "$mpi_cpu_arch" != "arm" ; then + [if test "$mpi_cpu_arch" != "arm" || + test "$try_asm_modules" != "yes" ; then gcry_cv_cc_arm_arch_is_v6="n/a" else gcry_cv_cc_arm_arch_is_v6=no @@ -1699,7 +1717,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports NEON instructions], [gcry_cv_gcc_inline_asm_neon], - [if test "$mpi_cpu_arch" != "arm" ; then + [if test "$mpi_cpu_arch" != "arm" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_neon="n/a" else gcry_cv_gcc_inline_asm_neon=no @@ -1727,7 +1746,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports AArch32 Crypto Extension instructions], [gcry_cv_gcc_inline_asm_aarch32_crypto], - [if test "$mpi_cpu_arch" != "arm" ; then + [if test "$mpi_cpu_arch" != "arm" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_aarch32_crypto="n/a" else gcry_cv_gcc_inline_asm_aarch32_crypto=no @@ -1771,7 +1791,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 NEON instructions], [gcry_cv_gcc_inline_asm_aarch64_neon], - [if test "$mpi_cpu_arch" != "aarch64" ; then + [if test "$mpi_cpu_arch" != "aarch64" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_aarch64_neon="n/a" else gcry_cv_gcc_inline_asm_aarch64_neon=no @@ -1796,7 +1817,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 Crypto Extension instructions], [gcry_cv_gcc_inline_asm_aarch64_crypto], - [if test "$mpi_cpu_arch" != "aarch64" ; then + [if test "$mpi_cpu_arch" != "aarch64" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_aarch64_crypto="n/a" else gcry_cv_gcc_inline_asm_aarch64_crypto=no @@ -1842,7 +1864,8 @@ fi # AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX intrinsics], [gcry_cv_cc_ppc_altivec], - [if test "$mpi_cpu_arch" != "ppc" ; then + [if test "$mpi_cpu_arch" != "ppc" || + test "$try_asm_modules" != "yes" ; then gcry_cv_cc_ppc_altivec="n/a" else gcry_cv_cc_ppc_altivec=no @@ -1868,7 +1891,8 @@ _gcc_cflags_save=$CFLAGS CFLAGS="$CFLAGS -maltivec -mvsx -mcrypto" if test "$gcry_cv_cc_ppc_altivec" = "no" && - test "$mpi_cpu_arch" = "ppc" ; then + test "$mpi_cpu_arch" = "ppc" && + test "$try_asm_modules" == "yes" ; then AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX/crypto intrinsics with extra GCC flags], [gcry_cv_cc_ppc_altivec_cflags], [gcry_cv_cc_ppc_altivec_cflags=no @@ -1903,7 +1927,8 @@ CFLAGS=$_gcc_cflags_save; # AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto instructions], [gcry_cv_gcc_inline_asm_ppc_altivec], - [if test "$mpi_cpu_arch" != "ppc" ; then + [if test "$mpi_cpu_arch" != "ppc" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_ppc_altivec="n/a" else gcry_cv_gcc_inline_asm_ppc_altivec=no @@ -1933,7 +1958,8 @@ fi # AC_CACHE_CHECK([whether GCC inline assembler supports PowerISA 3.00 instructions], [gcry_cv_gcc_inline_asm_ppc_arch_3_00], - [if test "$mpi_cpu_arch" != "ppc" ; then + [if test "$mpi_cpu_arch" != "ppc" || + test "$try_asm_modules" != "yes" ; then gcry_cv_gcc_inline_asm_ppc_arch_3_00="n/a" else gcry_cv_gcc_inline_asm_ppc_arch_3_00=no diff --git a/mpi/config.links b/mpi/config.links index 4f43b732..ce6822db 100644 --- a/mpi/config.links +++ b/mpi/config.links @@ -375,6 +375,7 @@ if test "$try_asm_modules" != "yes" ; then path="" mpi_sflags="" mpi_extra_modules="" + mpi_cpu_arch="disabled" fi # Make sure that mpi_cpu_arch is not the empty string. -- 2.25.1 From jussi.kivilinna at iki.fi Wed Jun 3 22:08:38 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Wed, 3 Jun 2020 23:08:38 +0300 Subject: [PATCH 2/2] rijndael: fix UBSAN warning on left shift by 24 places with type 'int' In-Reply-To: <20200603200838.562876-1-jussi.kivilinna@iki.fi> References: <20200603200838.562876-1-jussi.kivilinna@iki.fi> Message-ID: <20200603200838.562876-2-jussi.kivilinna@iki.fi> * cipher/rijndael.c (do_encrypt_fn, do_decrypt_fn): Cast final sbox/inv_sbox look-ups to 'u32' type. -- Fixes following type of UBSAN errors seen from generic C-implementation of rijndael: runtime error: left shift of by 24 places cannot be represented\ in type 'int' where is greater than 127. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 64 +++++++++++++++++++++++------------------------ 1 file changed, 32 insertions(+), 32 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index a1c4cfc1..3e9bae55 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -886,28 +886,28 @@ do_encrypt_fn (const RIJNDAEL_context *ctx, unsigned char *b, /* Last round is special. */ - sb[0] = (sbox[(byte)(sa[0] >> (0 * 8)) * 4]) << (0 * 8); - sb[3] = (sbox[(byte)(sa[0] >> (1 * 8)) * 4]) << (1 * 8); - sb[2] = (sbox[(byte)(sa[0] >> (2 * 8)) * 4]) << (2 * 8); - sb[1] = (sbox[(byte)(sa[0] >> (3 * 8)) * 4]) << (3 * 8); + sb[0] = ((u32)sbox[(byte)(sa[0] >> (0 * 8)) * 4]) << (0 * 8); + sb[3] = ((u32)sbox[(byte)(sa[0] >> (1 * 8)) * 4]) << (1 * 8); + sb[2] = ((u32)sbox[(byte)(sa[0] >> (2 * 8)) * 4]) << (2 * 8); + sb[1] = ((u32)sbox[(byte)(sa[0] >> (3 * 8)) * 4]) << (3 * 8); sa[0] = rk[r][0] ^ sb[0]; - sb[1] ^= (sbox[(byte)(sa[1] >> (0 * 8)) * 4]) << (0 * 8); - sa[0] ^= (sbox[(byte)(sa[1] >> (1 * 8)) * 4]) << (1 * 8); - sb[3] ^= (sbox[(byte)(sa[1] >> (2 * 8)) * 4]) << (2 * 8); - sb[2] ^= (sbox[(byte)(sa[1] >> (3 * 8)) * 4]) << (3 * 8); + sb[1] ^= ((u32)sbox[(byte)(sa[1] >> (0 * 8)) * 4]) << (0 * 8); + sa[0] ^= ((u32)sbox[(byte)(sa[1] >> (1 * 8)) * 4]) << (1 * 8); + sb[3] ^= ((u32)sbox[(byte)(sa[1] >> (2 * 8)) * 4]) << (2 * 8); + sb[2] ^= ((u32)sbox[(byte)(sa[1] >> (3 * 8)) * 4]) << (3 * 8); sa[1] = rk[r][1] ^ sb[1]; - sb[2] ^= (sbox[(byte)(sa[2] >> (0 * 8)) * 4]) << (0 * 8); - sa[1] ^= (sbox[(byte)(sa[2] >> (1 * 8)) * 4]) << (1 * 8); - sa[0] ^= (sbox[(byte)(sa[2] >> (2 * 8)) * 4]) << (2 * 8); - sb[3] ^= (sbox[(byte)(sa[2] >> (3 * 8)) * 4]) << (3 * 8); + sb[2] ^= ((u32)sbox[(byte)(sa[2] >> (0 * 8)) * 4]) << (0 * 8); + sa[1] ^= ((u32)sbox[(byte)(sa[2] >> (1 * 8)) * 4]) << (1 * 8); + sa[0] ^= ((u32)sbox[(byte)(sa[2] >> (2 * 8)) * 4]) << (2 * 8); + sb[3] ^= ((u32)sbox[(byte)(sa[2] >> (3 * 8)) * 4]) << (3 * 8); sa[2] = rk[r][2] ^ sb[2]; - sb[3] ^= (sbox[(byte)(sa[3] >> (0 * 8)) * 4]) << (0 * 8); - sa[2] ^= (sbox[(byte)(sa[3] >> (1 * 8)) * 4]) << (1 * 8); - sa[1] ^= (sbox[(byte)(sa[3] >> (2 * 8)) * 4]) << (2 * 8); - sa[0] ^= (sbox[(byte)(sa[3] >> (3 * 8)) * 4]) << (3 * 8); + sb[3] ^= ((u32)sbox[(byte)(sa[3] >> (0 * 8)) * 4]) << (0 * 8); + sa[2] ^= ((u32)sbox[(byte)(sa[3] >> (1 * 8)) * 4]) << (1 * 8); + sa[1] ^= ((u32)sbox[(byte)(sa[3] >> (2 * 8)) * 4]) << (2 * 8); + sa[0] ^= ((u32)sbox[(byte)(sa[3] >> (3 * 8)) * 4]) << (3 * 8); sa[3] = rk[r][3] ^ sb[3]; buf_put_le32(b + 0, sa[0]); @@ -1286,28 +1286,28 @@ do_decrypt_fn (const RIJNDAEL_context *ctx, unsigned char *b, sa[3] = rk[1][3] ^ sb[3]; /* Last round is special. */ - sb[0] = inv_sbox[(byte)(sa[0] >> (0 * 8))] << (0 * 8); - sb[1] = inv_sbox[(byte)(sa[0] >> (1 * 8))] << (1 * 8); - sb[2] = inv_sbox[(byte)(sa[0] >> (2 * 8))] << (2 * 8); - sb[3] = inv_sbox[(byte)(sa[0] >> (3 * 8))] << (3 * 8); + sb[0] = (u32)inv_sbox[(byte)(sa[0] >> (0 * 8))] << (0 * 8); + sb[1] = (u32)inv_sbox[(byte)(sa[0] >> (1 * 8))] << (1 * 8); + sb[2] = (u32)inv_sbox[(byte)(sa[0] >> (2 * 8))] << (2 * 8); + sb[3] = (u32)inv_sbox[(byte)(sa[0] >> (3 * 8))] << (3 * 8); sa[0] = sb[0] ^ rk[0][0]; - sb[1] ^= inv_sbox[(byte)(sa[1] >> (0 * 8))] << (0 * 8); - sb[2] ^= inv_sbox[(byte)(sa[1] >> (1 * 8))] << (1 * 8); - sb[3] ^= inv_sbox[(byte)(sa[1] >> (2 * 8))] << (2 * 8); - sa[0] ^= inv_sbox[(byte)(sa[1] >> (3 * 8))] << (3 * 8); + sb[1] ^= (u32)inv_sbox[(byte)(sa[1] >> (0 * 8))] << (0 * 8); + sb[2] ^= (u32)inv_sbox[(byte)(sa[1] >> (1 * 8))] << (1 * 8); + sb[3] ^= (u32)inv_sbox[(byte)(sa[1] >> (2 * 8))] << (2 * 8); + sa[0] ^= (u32)inv_sbox[(byte)(sa[1] >> (3 * 8))] << (3 * 8); sa[1] = sb[1] ^ rk[0][1]; - sb[2] ^= inv_sbox[(byte)(sa[2] >> (0 * 8))] << (0 * 8); - sb[3] ^= inv_sbox[(byte)(sa[2] >> (1 * 8))] << (1 * 8); - sa[0] ^= inv_sbox[(byte)(sa[2] >> (2 * 8))] << (2 * 8); - sa[1] ^= inv_sbox[(byte)(sa[2] >> (3 * 8))] << (3 * 8); + sb[2] ^= (u32)inv_sbox[(byte)(sa[2] >> (0 * 8))] << (0 * 8); + sb[3] ^= (u32)inv_sbox[(byte)(sa[2] >> (1 * 8))] << (1 * 8); + sa[0] ^= (u32)inv_sbox[(byte)(sa[2] >> (2 * 8))] << (2 * 8); + sa[1] ^= (u32)inv_sbox[(byte)(sa[2] >> (3 * 8))] << (3 * 8); sa[2] = sb[2] ^ rk[0][2]; - sb[3] ^= inv_sbox[(byte)(sa[3] >> (0 * 8))] << (0 * 8); - sa[0] ^= inv_sbox[(byte)(sa[3] >> (1 * 8))] << (1 * 8); - sa[1] ^= inv_sbox[(byte)(sa[3] >> (2 * 8))] << (2 * 8); - sa[2] ^= inv_sbox[(byte)(sa[3] >> (3 * 8))] << (3 * 8); + sb[3] ^= (u32)inv_sbox[(byte)(sa[3] >> (0 * 8))] << (0 * 8); + sa[0] ^= (u32)inv_sbox[(byte)(sa[3] >> (1 * 8))] << (1 * 8); + sa[1] ^= (u32)inv_sbox[(byte)(sa[3] >> (2 * 8))] << (2 * 8); + sa[2] ^= (u32)inv_sbox[(byte)(sa[3] >> (3 * 8))] << (3 * 8); sa[3] = sb[3] ^ rk[0][3]; buf_put_le32(b + 0, sa[0]); -- 2.25.1 From wk at gnupg.org Fri Jun 5 10:32:30 2020 From: wk at gnupg.org (Werner Koch) Date: Fri, 05 Jun 2020 10:32:30 +0200 Subject: Decrypt using BcryptDecrypt In-Reply-To: (Mandar Apte via Gcrypt-devel's message of "Tue, 2 Jun 2020 16:57:23 +0530") References: Message-ID: <87k10l6hgh.fsf@wheatstone.g10code.de> On Tue, 2 Jun 2020 16:57, Mandar Apte said: > On windows I am using Bcrypt library which also supports AES 256 in CBC > mode. FWIW, Libgcrypt runs very well on Windows. > Hence, I wanted to check, if the Libgcrypt APIs are doing padding > internally since I am not passing any such instruction to the Libgcrypt > library explicitly? No, Libgcrypt does not do any padding and it expects complete blocks. gcry_cipher_get_algo_blklen() tells you the block length of the cipher algorithm. There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but in this case you need to pass the entire plaintext/ciphertext to the encrypt/decrypt function; there is no way to do this incremental. For the standard padding as used in CMS (S/MIME), you need to handle the padding in your code; here is a snippet if (last_block_is_incomplete) { int i, int npad = blklen - (buflen % blklen); p = buffer; for (n=buflen, i=0; n < bufsize && i < npad; n++, i++) p[n] = npad; gcry_cipher_encrypt (chd, buffer, n, buffer, n); } Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 227 bytes Desc: not available URL: From mandar.apte409 at gmail.com Fri Jun 5 15:46:47 2020 From: mandar.apte409 at gmail.com (Mandar Apte) Date: Fri, 5 Jun 2020 19:16:47 +0530 Subject: Decrypt using BcryptDecrypt In-Reply-To: <87k10l6hgh.fsf@wheatstone.g10code.de> References: <87k10l6hgh.fsf@wheatstone.g10code.de> Message-ID: Hello Werner, Thank you very much for the response. The way you have shown in the email chain below, I had done same thing in my code as well. Also, I am passing the data of block length size only to gcry_cipher_encrypt and gcry_cipher_decrypt APIs. Now, my goal is to check, if the AES256 encryption/decryption is same for libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on windows. I am pretty sure if I use windows version of Libgcrypt my problem wont be there at all. I think I myself have to handle the padding while encrypting using Libgcrypt library APIs. Since, I have to handle padding in my code, is there any APIs in libgcrypt with which I ensure that I am padding the data in standard way? Are there any APIs in Libgcrypt using which I can get padded data along with my plain text data which I can encrypt using gcry_cipher_encrypt? Thank you in advance. Best Regards, Mandar On Fri, 5 Jun 2020, 2:05 pm Werner Koch, wrote: > On Tue, 2 Jun 2020 16:57, Mandar Apte said: > > On windows I am using Bcrypt library which also supports AES 256 in CBC > > mode. > > FWIW, Libgcrypt runs very well on Windows. > > > Hence, I wanted to check, if the Libgcrypt APIs are doing padding > > internally since I am not passing any such instruction to the Libgcrypt > > library explicitly? > > No, Libgcrypt does not do any padding and it expects complete blocks. > gcry_cipher_get_algo_blklen() tells you the block length of the cipher > algorithm. > > There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but > in this case you need to pass the entire plaintext/ciphertext to the > encrypt/decrypt function; there is no way to do this incremental. > > For the standard padding as used in CMS (S/MIME), you need to handle the > padding in your code; here is a snippet > > if (last_block_is_incomplete) > { > int i, > int npad = blklen - (buflen % blklen); > > p = buffer; > for (n=buflen, i=0; n < bufsize && i < npad; n++, i++) > p[n] = npad; > gcry_cipher_encrypt (chd, buffer, n, buffer, n); > } > > > > Shalom-Salam, > > Werner > > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tianjia.zhang at linux.alibaba.com Mon Jun 8 13:00:50 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Mon, 8 Jun 2020 19:00:50 +0800 Subject: [PATCH 0/1] Add SM4 symmetric cipher algorithm Message-ID: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> SM4 (GBT.32907-2016) is a cryptographic standard issued by the Organization of State Commercial Administration of China (OSCCA) as an authorized cryptographic algorithms for the use within China. SMS4 was originally created for use in protecting wireless networks, and is mandated in the Chinese National Standard for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure) (GB.15629.11-2003). Tianjia Zhang (1): Add SM4 symmetric cipher algorithm cipher/Makefile.am | 1 + cipher/cipher.c | 8 ++ cipher/sm4.c | 270 +++++++++++++++++++++++++++++++++++++++++++++ configure.ac | 7 ++ src/cipher.h | 1 + src/gcrypt.h.in | 3 +- tests/basic.c | 3 + 7 files changed, 292 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4.c -- 2.17.1 From tianjia.zhang at linux.alibaba.com Mon Jun 8 13:00:51 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Mon, 8 Jun 2020 19:00:51 +0800 Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm In-Reply-To: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. * cipher/cipher.c (cipher_list, cipher_list_algo301): Add _gcry_cipher_spec_sm4. * cipher/sm4.c: New. * configure.ac (available_ciphers): Add sm4. * src/cipher.h: Add declarations for SM4. * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. * tests/basic.c (check_ciphers): Add sm4 check. Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 1 + cipher/cipher.c | 8 ++ cipher/sm4.c | 270 +++++++++++++++++++++++++++++++++++++++++++++ configure.ac | 7 ++ src/cipher.h | 1 + src/gcrypt.h.in | 3 +- tests/basic.c | 3 + 7 files changed, 292 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4.c diff --git a/cipher/Makefile.am b/cipher/Makefile.am index ef83cc74..56661dcd 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ + sm4.c \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/cipher.c b/cipher/cipher.c index edcb421a..dfb083a0 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] = #endif #if USE_CHACHA20 &_gcry_cipher_spec_chacha20, +#endif +#if USE_SM4 + &_gcry_cipher_spec_sm4, #endif NULL }; @@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] = &_gcry_cipher_spec_gost28147_mesh, #else NULL, +#endif +#if USE_SM4 + &_gcry_cipher_spec_sm4, +#else + NULL, #endif }; diff --git a/cipher/sm4.c b/cipher/sm4.c new file mode 100644 index 00000000..a1bdca10 --- /dev/null +++ b/cipher/sm4.c @@ -0,0 +1,270 @@ +/* sm4.c - SM4 Cipher Algorithm + * Copyright (C) 2020 Alibaba Group. + * Copyright (C) 2020 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include +#include +#include + +#include "types.h" /* for byte and u32 typedefs */ +#include "bithelp.h" +#include "g10lib.h" +#include "cipher.h" + +typedef struct +{ + u32 rkey_enc[32]; + u32 rkey_dec[32]; +} SM4_context; + +static const u32 fk[4] = { + 0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc +}; + +static const byte sbox[256] = { + 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, + 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05, + 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, + 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99, + 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, + 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62, + 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, + 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6, + 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, + 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8, + 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, + 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35, + 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, + 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87, + 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, + 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e, + 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, + 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1, + 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, + 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3, + 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, + 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f, + 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, + 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51, + 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, + 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8, + 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, + 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0, + 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, + 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84, + 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, + 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 +}; + +static const u32 ck[] = { + 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269, + 0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9, + 0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249, + 0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9, + 0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229, + 0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299, + 0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209, + 0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 +}; + +static u32 sm4_t_non_lin_sub(u32 x) +{ + int i; + byte *b = (byte *)&x; + + for (i = 0; i < 4; ++i) + b[i] = sbox[b[i]]; + + return x; +} + +static u32 sm4_key_lin_sub(u32 x) +{ + return x ^ rol(x, 13) ^ rol(x, 23); +} + +static u32 sm4_enc_lin_sub(u32 x) +{ + return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24); +} + +static u32 sm4_key_sub(u32 x) +{ + return sm4_key_lin_sub(sm4_t_non_lin_sub(x)); +} + +static u32 sm4_enc_sub(u32 x) +{ + return sm4_enc_lin_sub(sm4_t_non_lin_sub(x)); +} + +static u32 sm4_round(const u32 *x, const u32 rk) +{ + return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk); +} + +static gcry_err_code_t +sm4_expand_key (SM4_context *ctx, const u32 *key, const unsigned keylen) +{ + u32 rk[4], t; + int i; + + if (keylen != 16) + return GPG_ERR_INV_KEYLEN; + + for (i = 0; i < 4; ++i) + rk[i] = be_bswap32(key[i]) ^ fk[i]; + + for (i = 0; i < 32; ++i) + { + t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]); + ctx->rkey_enc[i] = t; + rk[0] = rk[1]; + rk[1] = rk[2]; + rk[2] = rk[3]; + rk[3] = t; + } + + for (i = 0; i < 32; ++i) + ctx->rkey_dec[i] = ctx->rkey_enc[31 - i]; + + return 0; +} + +static gcry_err_code_t +sm4_setkey (void *context, const byte *key, const unsigned keylen, + gcry_cipher_hd_t hd) +{ + SM4_context *ctx = context; + (void)hd; + return sm4_expand_key (ctx, (const u32 *)key, keylen); +} + +static void +sm4_do_crypt (const u32 *rk, u32 *out, const u32 *in) +{ + u32 x[4], t; + int i; + + for (i = 0; i < 4; ++i) + x[i] = be_bswap32(in[i]); + + for (i = 0; i < 32; ++i) + { + t = sm4_round(x, rk[i]); + x[0] = x[1]; + x[1] = x[2]; + x[2] = x[3]; + x[3] = t; + } + + for (i = 0; i < 4; ++i) + out[i] = be_bswap32(x[3 - i]); +} + +static unsigned int +sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) +{ + SM4_context *ctx = context; + + sm4_do_crypt (ctx->rkey_enc, (u32 *)outbuf, (const u32 *)inbuf); + return 0; +} + +static unsigned int +sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) +{ + SM4_context *ctx = context; + + sm4_do_crypt (ctx->rkey_dec, (u32 *)outbuf, (const u32 *)inbuf); + return 0; +} + +static const char * +sm4_selftest (void) +{ + SM4_context ctx; + byte scratch[16]; + + static const byte plaintext[16] = { + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, + }; + static const byte key[16] = { + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, + }; + static const byte ciphertext[16] = { + 0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E, + 0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46 + }; + + sm4_setkey (&ctx, key, sizeof (key), NULL); + sm4_encrypt (&ctx, scratch, plaintext); + if (memcmp (scratch, ciphertext, sizeof (ciphertext))) + return "SM4 test encryption failed."; + sm4_decrypt (&ctx, scratch, scratch); + if (memcmp (scratch, plaintext, sizeof (plaintext))) + return "SM4 test decryption failed."; + + return NULL; +} + +static gpg_err_code_t +run_selftests (int algo, int extended, selftest_report_func_t report) +{ + const char *what; + const char *errtxt; + + (void)extended; + + if (algo != GCRY_CIPHER_SM4) + return GPG_ERR_CIPHER_ALGO; + + what = "selftest"; + errtxt = sm4_selftest (); + if (errtxt) + goto failed; + + return 0; + + failed: + if (report) + report ("cipher", GCRY_CIPHER_SM4, what, errtxt); + return GPG_ERR_SELFTEST_FAILED; +} + +static gcry_cipher_oid_spec_t sm4_oids[] = + { + { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB }, + { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC }, + { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB }, + { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB }, + { NULL } + }; + +gcry_cipher_spec_t _gcry_cipher_spec_sm4 = + { + GCRY_CIPHER_SM4, {0, 0}, + "SM4", NULL, sm4_oids, 16, 128, + sizeof (SM4_context), + sm4_setkey, sm4_encrypt, sm4_decrypt, + NULL, NULL, + run_selftests + }; diff --git a/configure.ac b/configure.ac index 3bf0179e..472758b5 100644 --- a/configure.ac +++ b/configure.ac @@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host" # Definitions for symmetric ciphers. available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed" available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20" +available_ciphers="$available_ciphers sm4" enabled_ciphers="" # Definitions for public-key ciphers. @@ -2533,6 +2534,12 @@ if test "$found" = "1" ; then fi fi +LIST_MEMBER(sm4, $enabled_ciphers) +if test "$found" = "1" ; then + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo" + AC_DEFINE(USE_SM4, 1, [Defined if this module should be included]) +fi + LIST_MEMBER(dsa, $enabled_pubkey_ciphers) if test "$found" = "1" ; then GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo" diff --git a/src/cipher.h b/src/cipher.h index 20ccb8c5..c49bbda5 100644 --- a/src/cipher.h +++ b/src/cipher.h @@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12; extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147; extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh; extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20; +extern gcry_cipher_spec_t _gcry_cipher_spec_sm4; /* Declarations for the digest specifications. */ extern gcry_md_spec_t _gcry_digest_spec_crc32; diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in index c0132189..9ddef17b 100644 --- a/src/gcrypt.h.in +++ b/src/gcrypt.h.in @@ -946,7 +946,8 @@ enum gcry_cipher_algos GCRY_CIPHER_SALSA20R12 = 314, GCRY_CIPHER_GOST28147 = 315, GCRY_CIPHER_CHACHA20 = 316, - GCRY_CIPHER_GOST28147_MESH = 317 /* GOST 28247 with optional CryptoPro keymeshing */ + GCRY_CIPHER_GOST28147_MESH = 317, /* GOST 28247 with optional CryptoPro keymeshing */ + GCRY_CIPHER_SM4 = 318 }; /* The Rijndael algorithm is basically AES, so provide some macros. */ diff --git a/tests/basic.c b/tests/basic.c index 2dee1bee..6f2945a5 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -9444,6 +9444,9 @@ check_ciphers (void) #if USE_GOST28147 GCRY_CIPHER_GOST28147, GCRY_CIPHER_GOST28147_MESH, +#endif +#if USE_SM4 + GCRY_CIPHER_SM4, #endif 0 }; -- 2.17.1 From jussi.kivilinna at iki.fi Tue Jun 9 19:59:51 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 9 Jun 2020 20:59:51 +0300 Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm In-Reply-To: <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com> References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com> Message-ID: <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi> Hello, Patch looks mostly good. I have add few comments below. On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote: > * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. > * cipher/cipher.c (cipher_list, cipher_list_algo301): > Add _gcry_cipher_spec_sm4. > * cipher/sm4.c: New. > * configure.ac (available_ciphers): Add sm4. > * src/cipher.h: Add declarations for SM4. > * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. > * tests/basic.c (check_ciphers): Add sm4 check. Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c. > > Signed-off-by: Tianjia Zhang > --- > cipher/Makefile.am | 1 + > cipher/cipher.c | 8 ++ > cipher/sm4.c | 270 +++++++++++++++++++++++++++++++++++++++++++++ > configure.ac | 7 ++ > src/cipher.h | 1 + > src/gcrypt.h.in | 3 +- > tests/basic.c | 3 + > 7 files changed, 292 insertions(+), 1 deletion(-) > create mode 100644 cipher/sm4.c > > diff --git a/cipher/Makefile.am b/cipher/Makefile.am > index ef83cc74..56661dcd 100644 > --- a/cipher/Makefile.am > +++ b/cipher/Makefile.am > @@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \ > scrypt.c \ > seed.c \ > serpent.c serpent-sse2-amd64.S \ > + sm4.c \ > serpent-avx2-amd64.S serpent-armv7-neon.S \ > sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ > sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ > diff --git a/cipher/cipher.c b/cipher/cipher.c > index edcb421a..dfb083a0 100644 > --- a/cipher/cipher.c > +++ b/cipher/cipher.c > @@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] = > #endif > #if USE_CHACHA20 > &_gcry_cipher_spec_chacha20, > +#endif > +#if USE_SM4 > + &_gcry_cipher_spec_sm4, > #endif > NULL > }; > @@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] = > &_gcry_cipher_spec_gost28147_mesh, > #else > NULL, > +#endif > +#if USE_SM4 > + &_gcry_cipher_spec_sm4, > +#else > + NULL, > #endif > }; > > diff --git a/cipher/sm4.c b/cipher/sm4.c > new file mode 100644 > index 00000000..a1bdca10 > --- /dev/null > +++ b/cipher/sm4.c > @@ -0,0 +1,270 @@ > +/* sm4.c - SM4 Cipher Algorithm > + * Copyright (C) 2020 Alibaba Group. > + * Copyright (C) 2020 Tianjia Zhang > + * > + * This file is part of Libgcrypt. > + * > + * Libgcrypt is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > + * published by the Free Software Foundation; either version 2.1 of > + * the License, or (at your option) any later version. > + * > + * Libgcrypt is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this program; if not, see . > + */ > + > +#include > +#include > +#include > + > +#include "types.h" /* for byte and u32 typedefs */ > +#include "bithelp.h" > +#include "g10lib.h" > +#include "cipher.h" > + > +typedef struct > +{ > + u32 rkey_enc[32]; > + u32 rkey_dec[32]; > +} SM4_context; > + > +static const u32 fk[4] = { > + 0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc > +}; > + > +static const byte sbox[256] = { > + 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, > + 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05, > + 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, > + 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99, > + 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, > + 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62, > + 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, > + 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6, > + 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, > + 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8, > + 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, > + 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35, > + 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, > + 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87, > + 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, > + 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e, > + 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, > + 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1, > + 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, > + 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3, > + 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, > + 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f, > + 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, > + 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51, > + 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, > + 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8, > + 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, > + 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0, > + 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, > + 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84, > + 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, > + 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 > +}; > + > +static const u32 ck[] = { > + 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269, > + 0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9, > + 0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249, > + 0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9, > + 0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229, > + 0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299, > + 0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209, > + 0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 > +}; > + > +static u32 sm4_t_non_lin_sub(u32 x) > +{ > + int i; > + byte *b = (byte *)&x; > + > + for (i = 0; i < 4; ++i) > + b[i] = sbox[b[i]]; > + > + return x; > +} > + > +static u32 sm4_key_lin_sub(u32 x) > +{ > + return x ^ rol(x, 13) ^ rol(x, 23); > +} > + > +static u32 sm4_enc_lin_sub(u32 x) > +{ > + return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24); > +} > + > +static u32 sm4_key_sub(u32 x) > +{ > + return sm4_key_lin_sub(sm4_t_non_lin_sub(x)); > +} > + > +static u32 sm4_enc_sub(u32 x) > +{ > + return sm4_enc_lin_sub(sm4_t_non_lin_sub(x)); > +} > + > +static u32 sm4_round(const u32 *x, const u32 rk) > +{ > + return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk); > +} > + > +static gcry_err_code_t > +sm4_expand_key (SM4_context *ctx, const u32 *key, const unsigned keylen) > +{ > + u32 rk[4], t; > + int i; > + > + if (keylen != 16) > + return GPG_ERR_INV_KEYLEN; > + > + for (i = 0; i < 4; ++i) > + rk[i] = be_bswap32(key[i]) ^ fk[i]; > + > + for (i = 0; i < 32; ++i) > + { > + t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]); > + ctx->rkey_enc[i] = t; > + rk[0] = rk[1]; > + rk[1] = rk[2]; > + rk[2] = rk[3]; > + rk[3] = t; > + } > + > + for (i = 0; i < 32; ++i) > + ctx->rkey_dec[i] = ctx->rkey_enc[31 - i]; > + > + return 0; > +} > + > +static gcry_err_code_t > +sm4_setkey (void *context, const byte *key, const unsigned keylen, > + gcry_cipher_hd_t hd) > +{ > + SM4_context *ctx = context; > + (void)hd; > + return sm4_expand_key (ctx, (const u32 *)key, keylen); Casting byte pointer to word pointer here. 'key' here might not be aligned 4 bytes and can then cause seg-fault in 'sm4_expand_key' on architectures that do not handle unaligned memory accesses automatically. It's better to change 'sm4_expand_key' take 'key' as byte pointer and use 'buf_get_be32' for reading from it. > +} > + > +static void > +sm4_do_crypt (const u32 *rk, u32 *out, const u32 *in) Likewise, better to use byte pointer for 'out' and 'in' here and use 'buf_get_be32' and 'buf_put_be32' for reading and writing. > +{ > + u32 x[4], t; > + int i; > + > + for (i = 0; i < 4; ++i) > + x[i] = be_bswap32(in[i]); > + > + for (i = 0; i < 32; ++i) > + { > + t = sm4_round(x, rk[i]); > + x[0] = x[1]; > + x[1] = x[2]; > + x[2] = x[3]; > + x[3] = t; > + } > + > + for (i = 0; i < 4; ++i) > + out[i] = be_bswap32(x[3 - i]); > +} > + > +static unsigned int > +sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) > +{ > + SM4_context *ctx = context; > + > + sm4_do_crypt (ctx->rkey_enc, (u32 *)outbuf, (const u32 *)inbuf); > + return 0; > +} > + > +static unsigned int > +sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) > +{ > + SM4_context *ctx = context; > + > + sm4_do_crypt (ctx->rkey_dec, (u32 *)outbuf, (const u32 *)inbuf); > + return 0; > +} Encrypt/decrypt functions should return 'stack burn' depth in bytes. Good estimate for this is size of variables + size of arguments + size of call return pointer. So in this case, "4*6+sizeof(void*)*4". > + > +static const char * > +sm4_selftest (void) > +{ > + SM4_context ctx; > + byte scratch[16]; > + > + static const byte plaintext[16] = { > + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, > + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, > + }; > + static const byte key[16] = { > + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, > + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, > + }; > + static const byte ciphertext[16] = { > + 0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E, > + 0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46 > + }; > + > + sm4_setkey (&ctx, key, sizeof (key), NULL); > + sm4_encrypt (&ctx, scratch, plaintext); > + if (memcmp (scratch, ciphertext, sizeof (ciphertext))) > + return "SM4 test encryption failed."; > + sm4_decrypt (&ctx, scratch, scratch); > + if (memcmp (scratch, plaintext, sizeof (plaintext))) > + return "SM4 test decryption failed."; > + > + return NULL; > +} > + > +static gpg_err_code_t > +run_selftests (int algo, int extended, selftest_report_func_t report) > +{ > + const char *what; > + const char *errtxt; > + > + (void)extended; > + > + if (algo != GCRY_CIPHER_SM4) > + return GPG_ERR_CIPHER_ALGO; > + > + what = "selftest"; > + errtxt = sm4_selftest (); > + if (errtxt) > + goto failed; > + > + return 0; > + > + failed: > + if (report) > + report ("cipher", GCRY_CIPHER_SM4, what, errtxt); > + return GPG_ERR_SELFTEST_FAILED; > +} > + > +static gcry_cipher_oid_spec_t sm4_oids[] = > + { > + { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB }, > + { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC }, > + { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB }, > + { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB }, > + { NULL } > + }; > + > +gcry_cipher_spec_t _gcry_cipher_spec_sm4 = > + { > + GCRY_CIPHER_SM4, {0, 0}, > + "SM4", NULL, sm4_oids, 16, 128, > + sizeof (SM4_context), > + sm4_setkey, sm4_encrypt, sm4_decrypt, > + NULL, NULL, > + run_selftests > + }; > diff --git a/configure.ac b/configure.ac > index 3bf0179e..472758b5 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host" > # Definitions for symmetric ciphers. > available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed" > available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20" > +available_ciphers="$available_ciphers sm4" > enabled_ciphers="" > > # Definitions for public-key ciphers. > @@ -2533,6 +2534,12 @@ if test "$found" = "1" ; then > fi > fi > > +LIST_MEMBER(sm4, $enabled_ciphers) > +if test "$found" = "1" ; then > + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo" > + AC_DEFINE(USE_SM4, 1, [Defined if this module should be included]) > +fi > + > LIST_MEMBER(dsa, $enabled_pubkey_ciphers) > if test "$found" = "1" ; then > GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo" > diff --git a/src/cipher.h b/src/cipher.h > index 20ccb8c5..c49bbda5 100644 > --- a/src/cipher.h > +++ b/src/cipher.h > @@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12; > extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147; > extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh; > extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20; > +extern gcry_cipher_spec_t _gcry_cipher_spec_sm4; > > /* Declarations for the digest specifications. */ > extern gcry_md_spec_t _gcry_digest_spec_crc32; > diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in > index c0132189..9ddef17b 100644 > --- a/src/gcrypt.h.in > +++ b/src/gcrypt.h.in > @@ -946,7 +946,8 @@ enum gcry_cipher_algos > GCRY_CIPHER_SALSA20R12 = 314, > GCRY_CIPHER_GOST28147 = 315, > GCRY_CIPHER_CHACHA20 = 316, > - GCRY_CIPHER_GOST28147_MESH = 317 /* GOST 28247 with optional CryptoPro keymeshing */ > + GCRY_CIPHER_GOST28147_MESH = 317, /* GOST 28247 with optional CryptoPro keymeshing */ > + GCRY_CIPHER_SM4 = 318 > }; > > /* The Rijndael algorithm is basically AES, so provide some macros. */ > diff --git a/tests/basic.c b/tests/basic.c > index 2dee1bee..6f2945a5 100644 > --- a/tests/basic.c > +++ b/tests/basic.c > @@ -9444,6 +9444,9 @@ check_ciphers (void) > #if USE_GOST28147 > GCRY_CIPHER_GOST28147, > GCRY_CIPHER_GOST28147_MESH, > +#endif > +#if USE_SM4 > + GCRY_CIPHER_SM4, > #endif > 0 > }; > It would be nice to have some extra test-vectors in basic.c for common cipher modes with SM4. There's few such vectors at following Internet-Draft that could be used: https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10#appendix-A.2 -Jussi From mandar.apte409 at gmail.com Wed Jun 10 18:38:09 2020 From: mandar.apte409 at gmail.com (Mandar Apte) Date: Wed, 10 Jun 2020 22:08:09 +0530 Subject: Decrypt using BcryptDecrypt In-Reply-To: References: <87k10l6hgh.fsf@wheatstone.g10code.de> Message-ID: Hello Team, Are there any APIs in Libgcrypt using which I can get padded data along with my plain text data which I can encrypt using gcry_cipher_encrypt? Thanks in advance. Best Regards, Mandar On Fri, 5 Jun 2020, 7:16 pm Mandar Apte, wrote: > Hello Werner, > > Thank you very much for the response. > > The way you have shown in the email chain below, I had done same thing in > my code as well. Also, I am passing the data of block length size only to > gcry_cipher_encrypt and gcry_cipher_decrypt APIs. > Now, my goal is to check, if the AES256 encryption/decryption is same for > libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the > data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on > windows. > > I am pretty sure if I use windows version of Libgcrypt my problem wont be > there at all. > > I think I myself have to handle the padding while encrypting using > Libgcrypt library APIs. > > Since, I have to handle padding in my code, is there any APIs in libgcrypt > with which I ensure that I am padding the data in standard way? > Are there any APIs in Libgcrypt using which I can get padded data along > with my plain text data which I can encrypt using gcry_cipher_encrypt? > > > Thank you in advance. > Best Regards, > Mandar > > > > On Fri, 5 Jun 2020, 2:05 pm Werner Koch, wrote: > >> On Tue, 2 Jun 2020 16:57, Mandar Apte said: >> > On windows I am using Bcrypt library which also supports AES 256 in CBC >> > mode. >> >> FWIW, Libgcrypt runs very well on Windows. >> >> > Hence, I wanted to check, if the Libgcrypt APIs are doing padding >> > internally since I am not passing any such instruction to the Libgcrypt >> > library explicitly? >> >> No, Libgcrypt does not do any padding and it expects complete blocks. >> gcry_cipher_get_algo_blklen() tells you the block length of the cipher >> algorithm. >> >> There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but >> in this case you need to pass the entire plaintext/ciphertext to the >> encrypt/decrypt function; there is no way to do this incremental. >> >> For the standard padding as used in CMS (S/MIME), you need to handle the >> padding in your code; here is a snippet >> >> if (last_block_is_incomplete) >> { >> int i, >> int npad = blklen - (buflen % blklen); >> >> p = buffer; >> for (n=buflen, i=0; n < bufsize && i < npad; n++, i++) >> p[n] = npad; >> gcry_cipher_encrypt (chd, buffer, n, buffer, n); >> } >> >> >> >> Shalom-Salam, >> >> Werner >> >> >> -- >> Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jussi.kivilinna at iki.fi Sat Jun 13 23:34:36 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 14 Jun 2020 00:34:36 +0300 Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm In-Reply-To: <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi> References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com> <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi> Message-ID: On 9.6.2020 20.59, Jussi Kivilinna wrote: > Hello, > > Patch looks mostly good. I have add few comments below. > > On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote: >> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. >> * cipher/cipher.c (cipher_list, cipher_list_algo301): >> Add _gcry_cipher_spec_sm4. >> * cipher/sm4.c: New. >> * configure.ac (available_ciphers): Add sm4. >> * src/cipher.h: Add declarations for SM4. >> * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. >> * tests/basic.c (check_ciphers): Add sm4 check. > > Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c. > Oh, and please add also SM4 to documentation, doc/gcrypt.texi. -Jussi From jussi.kivilinna at iki.fi Sat Jun 13 23:54:56 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sun, 14 Jun 2020 00:54:56 +0300 Subject: [PATCH] doc: add GCRY_MD_SM3, GCRY_MAC_HMAC_SM3 and GCRY_MAC_GOST28147_IMIT Message-ID: <20200613215456.608907-1-jussi.kivilinna@iki.fi> * doc/gcrypt.texi: add GCRY_MD_SM3, GCRY_MAC_HMAC_SM3 and GCRY_MAC_GOST28147_IMIT. -- Signed-off-by: Jussi Kivilinna --- doc/gcrypt.texi | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi index ad5ada87..4eaf6d8d 100644 --- a/doc/gcrypt.texi +++ b/doc/gcrypt.texi @@ -3157,6 +3157,7 @@ are also supported. @cindex MD2, MD4, MD5 @cindex TIGER, TIGER1, TIGER2 @cindex HAVAL + at cindex SM3 @cindex Whirlpool @cindex BLAKE2b-512, BLAKE2b-384, BLAKE2b-256, BLAKE2b-160 @cindex BLAKE2s-256, BLAKE2s-224, BLAKE2s-160, BLAKE2s-128 @@ -3324,6 +3325,9 @@ See RFC 7693 for the specification. This is the BLAKE2s-128 algorithm which yields a message digest of 16 bytes. See RFC 7693 for the specification. + at item GCRY_MD_SM3 +This is the SM3 algorithm which yields a message digest of 32 bytes. + @end table @c end table of hash algorithms @@ -3703,6 +3707,7 @@ provided by Libgcrypt. @cindex HMAC-RIPE-MD-160 @cindex HMAC-MD2, HMAC-MD4, HMAC-MD5 @cindex HMAC-TIGER1 + at cindex HMAC-SM3 @cindex HMAC-Whirlpool @cindex HMAC-Stribog-256, HMAC-Stribog-512 @cindex HMAC-GOSTR-3411-94 @@ -3816,6 +3821,10 @@ algorithm. This is HMAC message authentication algorithm based on the BLAKE2s-128 hash algorithm. + at item GCRY_MAC_HMAC_SM3 +This is HMAC message authentication algorithm based on the SM3 hash +algorithm. + @item GCRY_MAC_CMAC_AES This is CMAC (Cipher-based MAC) message authentication algorithm based on the AES block cipher algorithm. @@ -3904,6 +3913,9 @@ key and one-time nonce. This is Poly1305-SEED message authentication algorithm, used with key and one-time nonce. + at item GCRY_MAC_GOST28147_IMIT +This is MAC construction defined in GOST 28147-89 (see RFC 5830 Section 8). + @end table @c end table of MAC algorithms -- 2.25.1 From mandar.apte409 at gmail.com Mon Jun 15 09:02:23 2020 From: mandar.apte409 at gmail.com (Mandar Apte) Date: Mon, 15 Jun 2020 12:32:23 +0530 Subject: Decrypt using BcryptDecrypt In-Reply-To: References: <87k10l6hgh.fsf@wheatstone.g10code.de> Message-ID: Any help regarding request in below email ? On Wed, 10 Jun 2020, 10:08 pm Mandar Apte, wrote: > Hello Team, > > Are there any APIs in Libgcrypt using which I can get padded > data along with my plain text data which I can encrypt using > gcry_cipher_encrypt? > > > Thanks in advance. > Best Regards, > Mandar > > On Fri, 5 Jun 2020, 7:16 pm Mandar Apte, wrote: > >> Hello Werner, >> >> Thank you very much for the response. >> >> The way you have shown in the email chain below, I had done same thing in >> my code as well. Also, I am passing the data of block length size only to >> gcry_cipher_encrypt and gcry_cipher_decrypt APIs. >> Now, my goal is to check, if the AES256 encryption/decryption is same for >> libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the >> data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on >> windows. >> >> I am pretty sure if I use windows version of Libgcrypt my problem wont be >> there at all. >> >> I think I myself have to handle the padding while encrypting using >> Libgcrypt library APIs. >> >> Since, I have to handle padding in my code, is there any APIs in >> libgcrypt with which I ensure that I am padding the data in standard way? >> > > > Are there any APIs in Libgcrypt using which I can get padded data along >> with my plain text data which I can encrypt using gcry_cipher_encrypt? >> >> >> Thank you in advance. >> Best Regards, >> Mandar >> >> >> >> On Fri, 5 Jun 2020, 2:05 pm Werner Koch, wrote: >> >>> On Tue, 2 Jun 2020 16:57, Mandar Apte said: >>> > On windows I am using Bcrypt library which also supports AES 256 in CBC >>> > mode. >>> >>> FWIW, Libgcrypt runs very well on Windows. >>> >>> > Hence, I wanted to check, if the Libgcrypt APIs are doing padding >>> > internally since I am not passing any such instruction to the Libgcrypt >>> > library explicitly? >>> >>> No, Libgcrypt does not do any padding and it expects complete blocks. >>> gcry_cipher_get_algo_blklen() tells you the block length of the cipher >>> algorithm. >>> >>> There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but >>> in this case you need to pass the entire plaintext/ciphertext to the >>> encrypt/decrypt function; there is no way to do this incremental. >>> >>> For the standard padding as used in CMS (S/MIME), you need to handle the >>> padding in your code; here is a snippet >>> >>> if (last_block_is_incomplete) >>> { >>> int i, >>> int npad = blklen - (buflen % blklen); >>> >>> p = buffer; >>> for (n=buflen, i=0; n < bufsize && i < npad; n++, i++) >>> p[n] = npad; >>> gcry_cipher_encrypt (chd, buffer, n, buffer, n); >>> } >>> >>> >>> >>> Shalom-Salam, >>> >>> Werner >>> >>> >>> -- >>> Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tianjia.zhang at linux.alibaba.com Tue Jun 16 11:09:27 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 16 Jun 2020 17:09:27 +0800 Subject: [PATCH v2 0/2] Add SM4 symmetric cipher algorithm Message-ID: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> SM4 (GBT.32907-2016) is a cryptographic standard issued by the Organization of State Commercial Administration of China (OSCCA) as an authorized cryptographic algorithms for the use within China. SMS4 was originally created for use in protecting wireless networks, and is mandated in the Chinese National Standard for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure) (GB.15629.11-2003). Tianjia Zhang (2): Add SM4 symmetric cipher algorithm tests: Add basic test-vectors for SM4 cipher/Makefile.am | 1 + cipher/cipher.c | 8 ++ cipher/mac-cmac.c | 6 + cipher/mac-internal.h | 3 + cipher/mac.c | 10 +- cipher/sm4.c | 275 ++++++++++++++++++++++++++++++++++++++++++ configure.ac | 7 ++ doc/gcrypt.texi | 6 + src/cipher.h | 1 + src/gcrypt.h.in | 4 +- tests/basic.c | 99 +++++++++++++++ 11 files changed, 418 insertions(+), 2 deletions(-) create mode 100644 cipher/sm4.c -- 2.17.1 From tianjia.zhang at linux.alibaba.com Tue Jun 16 11:09:28 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 16 Jun 2020 17:09:28 +0800 Subject: [PATCH v2 1/2] Add SM4 symmetric cipher algorithm In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20200616090929.102931-2-tianjia.zhang@linux.alibaba.com> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. * cipher/cipher.c (cipher_list, cipher_list_algo301): Add _gcry_cipher_spec_sm4. * cipher/mac-cmac.c: Add cmac SM4. * cipher/mac-internal.h: Declare spec_cmac_sm4. * cipher/mac.c (mac_list, mac_list_algo201): Add cmac SM4. * cipher/sm4.c: New. * configure.ac (available_ciphers): Add sm4. * doc/gcrypt.texi: Add SM4 document. * src/cipher.h: Add declarations for SM4 and cmac SM4. * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. Signed-off-by: Tianjia Zhang --- cipher/Makefile.am | 1 + cipher/cipher.c | 8 ++ cipher/mac-cmac.c | 6 + cipher/mac-internal.h | 3 + cipher/mac.c | 10 +- cipher/sm4.c | 275 ++++++++++++++++++++++++++++++++++++++++++ configure.ac | 7 ++ doc/gcrypt.texi | 6 + src/cipher.h | 1 + src/gcrypt.h.in | 4 +- 10 files changed, 319 insertions(+), 2 deletions(-) create mode 100644 cipher/sm4.c diff --git a/cipher/Makefile.am b/cipher/Makefile.am index ef83cc74..56661dcd 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ + sm4.c \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/cipher.c b/cipher/cipher.c index edcb421a..dfb083a0 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] = #endif #if USE_CHACHA20 &_gcry_cipher_spec_chacha20, +#endif +#if USE_SM4 + &_gcry_cipher_spec_sm4, #endif NULL }; @@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] = &_gcry_cipher_spec_gost28147_mesh, #else NULL, +#endif +#if USE_SM4 + &_gcry_cipher_spec_sm4, +#else + NULL, #endif }; diff --git a/cipher/mac-cmac.c b/cipher/mac-cmac.c index aee5bb63..3fb5b373 100644 --- a/cipher/mac-cmac.c +++ b/cipher/mac-cmac.c @@ -225,3 +225,9 @@ gcry_mac_spec_t _gcry_mac_type_spec_cmac_gost28147 = { &cmac_ops }; #endif +#if USE_SM4 +gcry_mac_spec_t _gcry_mac_type_spec_cmac_sm4 = { + GCRY_MAC_CMAC_SM4, {0, 0}, "CMAC_SM4", + &cmac_ops +}; +#endif diff --git a/cipher/mac-internal.h b/cipher/mac-internal.h index 1936150c..8c13520b 100644 --- a/cipher/mac-internal.h +++ b/cipher/mac-internal.h @@ -229,6 +229,9 @@ extern gcry_mac_spec_t _gcry_mac_type_spec_cmac_gost28147; #if USE_GOST28147 extern gcry_mac_spec_t _gcry_mac_type_spec_gost28147_imit; #endif +#if USE_SM4 +extern gcry_mac_spec_t _gcry_mac_type_spec_cmac_sm4; +#endif /* * The GMAC algorithm specifications (mac-gmac.c). diff --git a/cipher/mac.c b/cipher/mac.c index 0abc0d33..933be74c 100644 --- a/cipher/mac.c +++ b/cipher/mac.c @@ -130,6 +130,9 @@ static gcry_mac_spec_t * const mac_list[] = { &_gcry_mac_type_spec_gost28147_imit, #endif &_gcry_mac_type_spec_poly1305mac, +#if USE_SM4 + &_gcry_mac_type_spec_cmac_sm4, +#endif NULL, }; @@ -300,7 +303,12 @@ static gcry_mac_spec_t * const mac_list_algo201[] = NULL, #endif #if USE_GOST28147 - &_gcry_mac_type_spec_cmac_gost28147 + &_gcry_mac_type_spec_cmac_gost28147, +#else + NULL, +#endif +#if USE_SM4 + &_gcry_mac_type_spec_cmac_sm4 #else NULL #endif diff --git a/cipher/sm4.c b/cipher/sm4.c new file mode 100644 index 00000000..061ee26e --- /dev/null +++ b/cipher/sm4.c @@ -0,0 +1,275 @@ +/* sm4.c - SM4 Cipher Algorithm + * Copyright (C) 2020 Alibaba Group. + * Copyright (C) 2020 Tianjia Zhang + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +#include +#include +#include + +#include "types.h" /* for byte and u32 typedefs */ +#include "bithelp.h" +#include "g10lib.h" +#include "cipher.h" +#include "bufhelp.h" + +typedef struct +{ + u32 rkey_enc[32]; + u32 rkey_dec[32]; +} SM4_context; + +static const u32 fk[4] = { + 0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc +}; + +static const byte sbox[256] = { + 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, + 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05, + 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, + 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99, + 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, + 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62, + 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, + 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6, + 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, + 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8, + 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, + 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35, + 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, + 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87, + 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, + 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e, + 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, + 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1, + 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, + 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3, + 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, + 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f, + 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, + 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51, + 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, + 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8, + 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, + 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0, + 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, + 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84, + 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, + 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 +}; + +static const u32 ck[] = { + 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269, + 0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9, + 0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249, + 0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9, + 0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229, + 0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299, + 0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209, + 0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 +}; + +static u32 sm4_t_non_lin_sub(u32 x) +{ + int i; + byte *b = (byte *)&x; + + for (i = 0; i < 4; ++i) + b[i] = sbox[b[i]]; + + return x; +} + +static u32 sm4_key_lin_sub(u32 x) +{ + return x ^ rol(x, 13) ^ rol(x, 23); +} + +static u32 sm4_enc_lin_sub(u32 x) +{ + return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24); +} + +static u32 sm4_key_sub(u32 x) +{ + return sm4_key_lin_sub(sm4_t_non_lin_sub(x)); +} + +static u32 sm4_enc_sub(u32 x) +{ + return sm4_enc_lin_sub(sm4_t_non_lin_sub(x)); +} + +static u32 sm4_round(const u32 *x, const u32 rk) +{ + return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk); +} + +static gcry_err_code_t +sm4_expand_key (SM4_context *ctx, const byte *key, const unsigned keylen) +{ + u32 rk[4], t; + int i; + + if (keylen != 16) + return GPG_ERR_INV_KEYLEN; + + for (i = 0; i < 4; ++i) + rk[i] = buf_get_be32(&key[i*4]) ^ fk[i]; + + for (i = 0; i < 32; ++i) + { + t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]); + ctx->rkey_enc[i] = t; + rk[0] = rk[1]; + rk[1] = rk[2]; + rk[2] = rk[3]; + rk[3] = t; + } + + for (i = 0; i < 32; ++i) + ctx->rkey_dec[i] = ctx->rkey_enc[31 - i]; + + return 0; +} + +static gcry_err_code_t +sm4_setkey (void *context, const byte *key, const unsigned keylen, + gcry_cipher_hd_t hd) +{ + SM4_context *ctx = context; + int rc = sm4_expand_key (ctx, key, keylen); + (void)hd; + _gcry_burn_stack (4*5 + sizeof(int)*2); + return rc; +} + +static void +sm4_do_crypt (const u32 *rk, byte *out, const byte *in) +{ + u32 x[4], t; + int i; + + for (i = 0; i < 4; ++i) + x[i] = buf_get_be32(&in[i*4]); + + for (i = 0; i < 32; ++i) + { + t = sm4_round(x, rk[i]); + x[0] = x[1]; + x[1] = x[2]; + x[2] = x[3]; + x[3] = t; + } + + for (i = 0; i < 4; ++i) + buf_put_be32(&out[i*4], x[3 - i]); +} + +static unsigned int +sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) +{ + SM4_context *ctx = context; + + sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf); + return /*burn_stack*/ 4*6+sizeof(void*)*4; +} + +static unsigned int +sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) +{ + SM4_context *ctx = context; + + sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf); + return /*burn_stack*/ 4*6+sizeof(void*)*4; +} + +static const char * +sm4_selftest (void) +{ + SM4_context ctx; + byte scratch[16]; + + static const byte plaintext[16] = { + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, + }; + static const byte key[16] = { + 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, + 0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10, + }; + static const byte ciphertext[16] = { + 0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E, + 0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46 + }; + + sm4_setkey (&ctx, key, sizeof (key), NULL); + sm4_encrypt (&ctx, scratch, plaintext); + if (memcmp (scratch, ciphertext, sizeof (ciphertext))) + return "SM4 test encryption failed."; + sm4_decrypt (&ctx, scratch, scratch); + if (memcmp (scratch, plaintext, sizeof (plaintext))) + return "SM4 test decryption failed."; + + return NULL; +} + +static gpg_err_code_t +run_selftests (int algo, int extended, selftest_report_func_t report) +{ + const char *what; + const char *errtxt; + + (void)extended; + + if (algo != GCRY_CIPHER_SM4) + return GPG_ERR_CIPHER_ALGO; + + what = "selftest"; + errtxt = sm4_selftest (); + if (errtxt) + goto failed; + + return 0; + + failed: + if (report) + report ("cipher", GCRY_CIPHER_SM4, what, errtxt); + return GPG_ERR_SELFTEST_FAILED; +} + + +static gcry_cipher_oid_spec_t sm4_oids[] = + { + { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB }, + { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC }, + { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB }, + { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB }, + { "1.2.156.10197.1.104.7", GCRY_CIPHER_MODE_CTR }, + { NULL } + }; + +gcry_cipher_spec_t _gcry_cipher_spec_sm4 = + { + GCRY_CIPHER_SM4, {0, 0}, + "SM4", NULL, sm4_oids, 16, 128, + sizeof (SM4_context), + sm4_setkey, sm4_encrypt, sm4_decrypt, + NULL, NULL, + run_selftests + }; diff --git a/configure.ac b/configure.ac index 0c9100bf..f77476e0 100644 --- a/configure.ac +++ b/configure.ac @@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host" # Definitions for symmetric ciphers. available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed" available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20" +available_ciphers="$available_ciphers sm4" enabled_ciphers="" # Definitions for public-key ciphers. @@ -2559,6 +2560,12 @@ if test "$found" = "1" ; then fi fi +LIST_MEMBER(sm4, $enabled_ciphers) +if test "$found" = "1" ; then + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo" + AC_DEFINE(USE_SM4, 1, [Defined if this module should be included]) +fi + LIST_MEMBER(dsa, $enabled_pubkey_ciphers) if test "$found" = "1" ; then GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo" diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi index ad5ada87..9d1db3d4 100644 --- a/doc/gcrypt.texi +++ b/doc/gcrypt.texi @@ -1641,6 +1641,12 @@ if it has to be used for the selected parameter set. @cindex ChaCha20 This is the ChaCha20 stream cipher. + at item GCRY_CIPHER_SM4 + at cindex SM4 (cipher) +A 128 bit cipher by the State Cryptography Administration +of China (SCA). See + at uref{https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10}. + @end table @node Available cipher modes diff --git a/src/cipher.h b/src/cipher.h index 20ccb8c5..c49bbda5 100644 --- a/src/cipher.h +++ b/src/cipher.h @@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12; extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147; extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh; extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20; +extern gcry_cipher_spec_t _gcry_cipher_spec_sm4; /* Declarations for the digest specifications. */ extern gcry_md_spec_t _gcry_digest_spec_crc32; diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in index c0132189..5668e625 100644 --- a/src/gcrypt.h.in +++ b/src/gcrypt.h.in @@ -946,7 +946,8 @@ enum gcry_cipher_algos GCRY_CIPHER_SALSA20R12 = 314, GCRY_CIPHER_GOST28147 = 315, GCRY_CIPHER_CHACHA20 = 316, - GCRY_CIPHER_GOST28147_MESH = 317 /* GOST 28247 with optional CryptoPro keymeshing */ + GCRY_CIPHER_GOST28147_MESH = 317, /* GOST 28247 with optional CryptoPro keymeshing */ + GCRY_CIPHER_SM4 = 318 }; /* The Rijndael algorithm is basically AES, so provide some macros. */ @@ -1484,6 +1485,7 @@ enum gcry_mac_algos GCRY_MAC_CMAC_RFC2268 = 209, GCRY_MAC_CMAC_IDEA = 210, GCRY_MAC_CMAC_GOST28147 = 211, + GCRY_MAC_CMAC_SM4 = 212, GCRY_MAC_GMAC_AES = 401, GCRY_MAC_GMAC_CAMELLIA = 402, -- 2.17.1 From tianjia.zhang at linux.alibaba.com Tue Jun 16 11:09:29 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 16 Jun 2020 17:09:29 +0800 Subject: [PATCH v2 2/2] tests: Add basic test-vectors for SM4 In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20200616090929.102931-3-tianjia.zhang@linux.alibaba.com> The added test vectors are from: https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10#appendix-A.2 * tests/basic.c (check_ciphers): Add SM4 check and test-vectors. Signed-off-by: Tianjia Zhang --- tests/basic.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) diff --git a/tests/basic.c b/tests/basic.c index 2dee1bee..5acbab84 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -845,6 +845,30 @@ check_ecb_cipher (void) { } } }, + { GCRY_CIPHER_SM4, + "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10", + 0, + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 16, + 32, + "\x5e\xc8\x14\x3d\xe5\x09\xcf\xf7\xb5\x17\x9f\x8f\x47\x4b\x86\x19" + "\x2f\x1d\x30\x5a\x7f\xb1\x7d\xf9\x85\xf8\x1c\x84\x82\x19\x23\x04" }, + { } + } + }, + { GCRY_CIPHER_SM4, + "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef", + 0, + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 16, + 32, + "\xc5\x87\x68\x97\xe4\xa5\x9b\xbb\xa7\x2a\x10\xc8\x38\x72\x24\x5b" + "\x12\xdd\x90\xbc\x2d\x20\x06\x92\xb5\x29\xa4\x15\x5a\xc9\xe6\x00" }, + { } + } + }, }; gcry_cipher_hd_t hde, hdd; unsigned char out[MAX_DATA_LEN]; @@ -2059,6 +2083,38 @@ check_ctr_cipher (void) } }, #endif /*USE_CAST5*/ + { GCRY_CIPHER_SM4, + "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb" + "\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xee\xee\xee\xee\xff\xff\xff\xff\xff\xff\xff\xff" + "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb", + 64, + "\xac\x32\x36\xcb\x97\x0c\xc2\x07\x91\x36\x4c\x39\x5a\x13\x42\xd1" + "\xa3\xcb\xc1\x87\x8c\x6f\x30\xcd\x07\x4c\xce\x38\x5c\xdd\x70\xc7" + "\xf2\x34\xbc\x0e\x24\xc1\x19\x80\xfd\x12\x86\x31\x0c\xe3\x7b\x92" + "\x6e\x02\xfc\xd0\xfa\xa0\xba\xf3\x8b\x29\x33\x85\x1d\x82\x45\x14" }, + + { "", 0, "" } + } + }, + { GCRY_CIPHER_SM4, + "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb" + "\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xee\xee\xee\xee\xff\xff\xff\xff\xff\xff\xff\xff" + "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb", + 64, + "\x5d\xcc\xcd\x25\xb9\x5a\xb0\x74\x17\xa0\x85\x12\xee\x16\x0e\x2f" + "\x8f\x66\x15\x21\xcb\xba\xb4\x4c\xc8\x71\x38\x44\x5b\xc2\x9e\x5c" + "\x0a\xe0\x29\x72\x05\xd6\x27\x04\x17\x3b\x21\x23\x9b\x88\x7f\x6c" + "\x8c\xb5\xb8\x00\x91\x7a\x24\x88\x28\x4b\xde\x9e\x16\xea\x29\x06" }, + + { "", 0, "" } + } + }, { 0, "", "", @@ -2559,6 +2615,26 @@ check_cfb_cipher (void) "1.2.643.2.2.31.2" }, #endif + { GCRY_CIPHER_SM4, 0, + "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 32, + "\xac\x32\x36\xcb\x86\x1d\xd3\x16\xe6\x41\x3b\x4e\x3c\x75\x24\xb7" + "\x69\xd4\xc5\x4e\xd4\x33\xb9\xa0\x34\x60\x09\xbe\xb3\x7b\x2b\x3f" }, + } + }, + { GCRY_CIPHER_SM4, 0, + "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 32, + "\x5d\xcc\xcd\x25\xa8\x4b\xa1\x65\x60\xd7\xf2\x65\x88\x70\x68\x49" + "\x0d\x9b\x86\xff\x20\xc3\xbf\xe1\x15\xff\xa0\x2c\xa6\x19\x2c\xc5" }, + } + }, }; gcry_cipher_hd_t hde, hdd; unsigned char out[MAX_DATA_LEN]; @@ -2753,6 +2829,26 @@ check_ofb_cipher (void) 16, "\x01\x26\x14\x1d\x67\xf3\x7b\xe8\x53\x8f\x5a\x8b\xe7\x40\xe4\x84" } } + }, + { GCRY_CIPHER_SM4, + "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 32, + "\xac\x32\x36\xcb\x86\x1d\xd3\x16\xe6\x41\x3b\x4e\x3c\x75\x24\xb7" + "\x1d\x01\xac\xa2\x48\x7c\xa5\x82\xcb\xf5\x46\x3e\x66\x98\x53\x9b" }, + } + }, + { GCRY_CIPHER_SM4, + "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef", + "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f", + { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd" + "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb", + 32, + "\x5d\xcc\xcd\x25\xa8\x4b\xa1\x65\x60\xd7\xf2\x65\x88\x70\x68\x49" + "\x33\xfa\x16\xbd\x5c\xd9\xc8\x56\xca\xca\xa1\xe1\x01\x89\x7a\x97" }, + } } }; gcry_cipher_hd_t hde, hdd; @@ -9444,6 +9540,9 @@ check_ciphers (void) #if USE_GOST28147 GCRY_CIPHER_GOST28147, GCRY_CIPHER_GOST28147_MESH, +#endif +#if USE_SM4 + GCRY_CIPHER_SM4, #endif 0 }; -- 2.17.1 From tianjia.zhang at linux.alibaba.com Tue Jun 16 11:19:51 2020 From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang) Date: Tue, 16 Jun 2020 17:19:51 +0800 Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm In-Reply-To: References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com> <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com> <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi> Message-ID: <817a569d-d15a-ca45-3eae-ef0e86d1adeb@linux.alibaba.com> On 2020/6/14 5:34, Jussi Kivilinna wrote: > On 9.6.2020 20.59, Jussi Kivilinna wrote: >> Hello, >> >> Patch looks mostly good. I have add few comments below. >> >> On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote: >>> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c. >>> * cipher/cipher.c (cipher_list, cipher_list_algo301): >>> Add _gcry_cipher_spec_sm4. >>> * cipher/sm4.c: New. >>> * configure.ac (available_ciphers): Add sm4. >>> * src/cipher.h: Add declarations for SM4. >>> * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4. >>> * tests/basic.c (check_ciphers): Add sm4 check. >> >> Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c. >> > > Oh, and please add also SM4 to documentation, doc/gcrypt.texi. > > -Jussi > Thanks for your suggestion, I have submitted v2 patch. Thanks and best, Tianjia From jussi.kivilinna at iki.fi Tue Jun 16 21:28:22 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 16 Jun 2020 22:28:22 +0300 Subject: Optimization for SM4 and x86-64/AES-NI implementations In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> Message-ID: <20200616192825.1584395-1-jussi.kivilinna@iki.fi> This patch-set adds optimizations for C implementation of SM4 cipher and AES-NI accelerated AVX and AVX2 assembly implementations. Performance improvement for whole patch-set is presented below. Intermediate results are listed in each patch separately. As summary, on x86-64, generic C implementation is ~2 to ~4 times faster than original C implementation. AES-NI implementations speed-up parallelizable cipher modes and there AES-NI/AVX is ~11 times faster and AES-NI/AVX2 ~18 times faster original C implementation. Benchmark on AMD Ryzen 7 3700X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 17.69 ns/B 53.92 MiB/s 76.50 c/B 4326 ECB dec | 17.74 ns/B 53.77 MiB/s 76.72 c/B 4325 CBC enc | 18.14 ns/B 52.56 MiB/s 78.47 c/B 4325 CBC dec | 18.05 ns/B 52.83 MiB/s 78.09 c/B 4326 CFB enc | 18.19 ns/B 52.44 MiB/s 78.67 c/B 4326 CFB dec | 18.16 ns/B 52.53 MiB/s 78.53 c/B 4326 OFB enc | 16.82 ns/B 56.70 MiB/s 72.96 c/B 4338 OFB dec | 16.87 ns/B 56.53 MiB/s 72.96 c/B 4325 CTR enc | 18.17 ns/B 52.47 MiB/s 78.62 c/B 4326 CTR dec | 18.02 ns/B 52.94 MiB/s 77.92 c/B 4325 XTS enc | 17.70 ns/B 53.87 MiB/s 76.11 c/B 4300 XTS dec | 17.65 ns/B 54.04 MiB/s 76.28 c/B 4323?1 CCM enc | 33.76 ns/B 28.25 MiB/s 146.9 c/B 4350 CCM dec | 34.07 ns/B 27.99 MiB/s 147.4 c/B 4326 CCM auth | 16.97 ns/B 56.19 MiB/s 73.41 c/B 4325 EAX enc | 34.02 ns/B 28.03 MiB/s 147.1 c/B 4325 EAX dec | 36.56 ns/B 26.08 MiB/s 159.1 c/B 4350 EAX auth | 17.02 ns/B 56.03 MiB/s 73.62 c/B 4325 GCM enc | 16.76 ns/B 56.90 MiB/s 72.50 c/B 4325 GCM dec | 18.01 ns/B 52.94 MiB/s 78.37 c/B 4350 GCM auth | 0.120 ns/B 7975 MiB/s 0.517 c/B 4325 OCB enc | 18.19 ns/B 52.43 MiB/s 78.68 c/B 4325 OCB dec | 18.15 ns/B 52.54 MiB/s 78.51 c/B 4325 OCB auth | 16.87 ns/B 56.54 MiB/s 72.95 c/B 4325 After: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 8.32 ns/B 114.6 MiB/s 36.01 c/B 4325 ECB dec | 8.31 ns/B 114.7 MiB/s 35.75 c/B 4300 CBC enc | 8.94 ns/B 106.7 MiB/s 38.67 c/B 4325 CBC dec | 0.984 ns/B 969.2 MiB/s 4.23 c/B 4300 CFB enc | 8.92 ns/B 107.0 MiB/s 38.57 c/B 4325 CFB dec | 0.989 ns/B 964.1 MiB/s 4.23 c/B 4275 OFB enc | 8.45 ns/B 112.8 MiB/s 36.35 c/B 4300 OFB dec | 8.40 ns/B 113.5 MiB/s 36.34 c/B 4325 CTR enc | 1.00 ns/B 952.6 MiB/s 4.31 c/B 4300 CTR dec | 0.999 ns/B 954.9 MiB/s 4.29 c/B 4300 XTS enc | 8.81 ns/B 108.3 MiB/s 38.11 c/B 4326 XTS dec | 8.81 ns/B 108.3 MiB/s 38.09 c/B 4325 CCM enc | 9.93 ns/B 96.07 MiB/s 42.69 c/B 4300 CCM dec | 9.91 ns/B 96.20 MiB/s 42.89 c/B 4326 CCM auth | 8.89 ns/B 107.3 MiB/s 38.45 c/B 4326 EAX enc | 9.91 ns/B 96.27 MiB/s 42.85 c/B 4325 EAX dec | 9.91 ns/B 96.19 MiB/s 42.80 c/B 4317 EAX auth | 8.95 ns/B 106.6 MiB/s 38.71 c/B 4325 GCM enc | 1.11 ns/B 856.8 MiB/s 4.79 c/B 4300 GCM dec | 1.12 ns/B 849.4 MiB/s 4.80 c/B 4275 GCM auth | 0.117 ns/B 8154 MiB/s 0.509 c/B 4350 OCB enc | 0.999 ns/B 954.8 MiB/s 4.29 c/B 4300 OCB dec | 1.00 ns/B 952.1 MiB/s 4.31 c/B 4300 OCB auth | 0.989 ns/B 964.4 MiB/s 4.25 c/B 4300 From jussi.kivilinna at iki.fi Tue Jun 16 21:28:23 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 16 Jun 2020 22:28:23 +0300 Subject: [PATCH 1/3] Optimizations for SM4 cipher In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> <20200616192825.1584395-1-jussi.kivilinna@iki.fi> Message-ID: <20200616192825.1584395-2-jussi.kivilinna@iki.fi> * cipher/cipher.c (_gcry_cipher_open_internal): Add SM4 bulk functions. * cipher/sm4.c (ATTR_ALIGNED_64): New. (sbox): Convert to ... (sbox_table): ... this structure for sbox hardening as is done for AES and GCM. (prefetch_sbox_table): New. (sm4_t_non_lin_sub): Make inline; Optimize sbox access pattern. (sm4_key_lin_sub): Make inline; Tune slightly. (sm4_key_sub, sm4_enc_sub): Make inline. (sm4_round): Make inline; Take 'x' as separate parameters instead of array. (sm4_expand_key): Return void; Drop keylen; Unroll loops by 4; Wipe sensitive variables at end; Move key-length check to 'sm4_setkey'. (sm4_setkey): Add initial self-test step; Add key-length check; Remove burn stack (as variables wiped in 'sm4_expand_key'). (sm4_do_crypt): Return burn stack depth; Unroll loops by 4. (sm4_encrypt, sm4_decrypt): Prefetch sbox table; Return burn stack from 'sm4_do_crypt', as allows tail-call optimization by compiler. (sm4_do_crypt_blks2): New two parallel block function for greater instruction level parallelism. (sm4_crypt_blocks, _gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) (_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New bulk processing functions. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): New bulk processing self-tests. (sm4_selftest): Clear SM4 context before use; Use 'sm4_expand_key' instead of 'sm4_setkey'; Call bulk processing self-tests. * src/cipher.h (_gcry_sm4_ctr_enc, _gcry_sm4_ctr_dec) (_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New. * tests/basic.c (check_ocb_cipher): Add SM4-OCB test vector. -- Benchmark on AMD Ryzen 7 3700X (x86-64): Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 17.69 ns/B 53.92 MiB/s 76.50 c/B 4326 ECB dec | 17.74 ns/B 53.77 MiB/s 76.72 c/B 4325 CBC enc | 18.14 ns/B 52.56 MiB/s 78.47 c/B 4325 CBC dec | 18.05 ns/B 52.83 MiB/s 78.09 c/B 4326 CFB enc | 18.19 ns/B 52.44 MiB/s 78.67 c/B 4326 CFB dec | 18.16 ns/B 52.53 MiB/s 78.53 c/B 4326 OFB enc | 16.82 ns/B 56.70 MiB/s 72.96 c/B 4338 OFB dec | 16.87 ns/B 56.53 MiB/s 72.96 c/B 4325 CTR enc | 18.17 ns/B 52.47 MiB/s 78.62 c/B 4326 CTR dec | 18.02 ns/B 52.94 MiB/s 77.92 c/B 4325 XTS enc | 17.70 ns/B 53.87 MiB/s 76.11 c/B 4300 XTS dec | 17.65 ns/B 54.04 MiB/s 76.28 c/B 4323?1 CCM enc | 33.76 ns/B 28.25 MiB/s 146.9 c/B 4350 CCM dec | 34.07 ns/B 27.99 MiB/s 147.4 c/B 4326 CCM auth | 16.97 ns/B 56.19 MiB/s 73.41 c/B 4325 EAX enc | 34.02 ns/B 28.03 MiB/s 147.1 c/B 4325 EAX dec | 36.56 ns/B 26.08 MiB/s 159.1 c/B 4350 EAX auth | 17.02 ns/B 56.03 MiB/s 73.62 c/B 4325 GCM enc | 16.76 ns/B 56.90 MiB/s 72.50 c/B 4325 GCM dec | 18.01 ns/B 52.94 MiB/s 78.37 c/B 4350 GCM auth | 0.120 ns/B 7975 MiB/s 0.517 c/B 4325 OCB enc | 18.19 ns/B 52.43 MiB/s 78.68 c/B 4325 OCB dec | 18.15 ns/B 52.54 MiB/s 78.51 c/B 4325 OCB auth | 16.87 ns/B 56.54 MiB/s 72.95 c/B 4325 After (non-parallalizeble modes ~2.0x faster, parallel modes ~3.8x): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 8.28 ns/B 115.1 MiB/s 35.84 c/B 4327?1 ECB dec | 8.33 ns/B 114.4 MiB/s 36.13 c/B 4336?1 CBC enc | 8.94 ns/B 106.7 MiB/s 38.66 c/B 4325 CBC dec | 4.78 ns/B 199.7 MiB/s 20.42 c/B 4275 CFB enc | 8.95 ns/B 106.5 MiB/s 38.72 c/B 4325 CFB dec | 4.81 ns/B 198.2 MiB/s 20.57 c/B 4275 OFB enc | 8.48 ns/B 112.5 MiB/s 36.66 c/B 4325 OFB dec | 8.42 ns/B 113.3 MiB/s 36.41 c/B 4325 CTR enc | 4.81 ns/B 198.2 MiB/s 20.69 c/B 4300 CTR dec | 4.80 ns/B 198.8 MiB/s 20.63 c/B 4300 XTS enc | 8.75 ns/B 109.0 MiB/s 37.83 c/B 4325 XTS dec | 8.86 ns/B 107.7 MiB/s 38.30 c/B 4326 CCM enc | 13.74 ns/B 69.42 MiB/s 59.42 c/B 4325 CCM dec | 13.77 ns/B 69.25 MiB/s 59.57 c/B 4326 CCM auth | 8.87 ns/B 107.5 MiB/s 38.36 c/B 4325 EAX enc | 13.76 ns/B 69.29 MiB/s 59.54 c/B 4326 EAX dec | 13.77 ns/B 69.25 MiB/s 59.57 c/B 4325 EAX auth | 8.89 ns/B 107.3 MiB/s 38.44 c/B 4325 GCM enc | 4.96 ns/B 192.3 MiB/s 21.20 c/B 4275 GCM dec | 4.91 ns/B 194.4 MiB/s 21.10 c/B 4300 GCM auth | 0.116 ns/B 8232 MiB/s 0.504 c/B 4351 OCB enc | 4.88 ns/B 195.5 MiB/s 20.86 c/B 4275 OCB dec | 4.85 ns/B 196.6 MiB/s 20.86 c/B 4301 OCB auth | 4.80 ns/B 198.9 MiB/s 20.62 c/B 4301 Benchmark on ARM Cortex-A53 (aarch64): Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 84.08 ns/B 11.34 MiB/s 54.48 c/B 648.0 ECB dec | 84.07 ns/B 11.34 MiB/s 54.47 c/B 648.0 CBC enc | 84.90 ns/B 11.23 MiB/s 55.01 c/B 647.9 CBC dec | 84.69 ns/B 11.26 MiB/s 54.87 c/B 648.0 CFB enc | 84.55 ns/B 11.28 MiB/s 54.79 c/B 648.0 CFB dec | 84.55 ns/B 11.28 MiB/s 54.78 c/B 648.0 OFB enc | 84.45 ns/B 11.29 MiB/s 54.72 c/B 647.9 OFB dec | 84.45 ns/B 11.29 MiB/s 54.72 c/B 648.0 CTR enc | 85.42 ns/B 11.16 MiB/s 55.35 c/B 648.0 CTR dec | 85.42 ns/B 11.16 MiB/s 55.35 c/B 648.0 XTS enc | 88.72 ns/B 10.75 MiB/s 57.49 c/B 648.0 XTS dec | 88.71 ns/B 10.75 MiB/s 57.48 c/B 648.0 CCM enc | 170.2 ns/B 5.60 MiB/s 110.3 c/B 647.9 CCM dec | 170.2 ns/B 5.60 MiB/s 110.3 c/B 648.0 CCM auth | 84.27 ns/B 11.32 MiB/s 54.60 c/B 648.0 EAX enc | 170.6 ns/B 5.59 MiB/s 110.5 c/B 648.0 EAX dec | 170.6 ns/B 5.59 MiB/s 110.5 c/B 648.0 EAX auth | 84.51 ns/B 11.29 MiB/s 54.76 c/B 648.0 GCM enc | 86.99 ns/B 10.96 MiB/s 56.36 c/B 648.0 GCM dec | 87.00 ns/B 10.96 MiB/s 56.37 c/B 648.0 GCM auth | 1.56 ns/B 609.9 MiB/s 1.01 c/B 648.0 OCB enc | 86.77 ns/B 10.99 MiB/s 56.22 c/B 648.0 OCB dec | 86.77 ns/B 10.99 MiB/s 56.22 c/B 648.0 OCB auth | 86.20 ns/B 11.06 MiB/s 55.85 c/B 648.0 After (non-parallalizable modes ~30% faster, parallel modes ~80%): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz ECB enc | 64.85 ns/B 14.71 MiB/s 42.02 c/B 648.0 ECB dec | 64.78 ns/B 14.72 MiB/s 41.98 c/B 648.0 CBC enc | 64.53 ns/B 14.78 MiB/s 41.81 c/B 647.9 CBC dec | 45.09 ns/B 21.15 MiB/s 29.21 c/B 648.0 CFB enc | 64.56 ns/B 14.77 MiB/s 41.84 c/B 648.0 CFB dec | 45.52 ns/B 20.95 MiB/s 29.49 c/B 647.9 OFB enc | 64.14 ns/B 14.87 MiB/s 41.56 c/B 648.0 OFB dec | 64.14 ns/B 14.87 MiB/s 41.56 c/B 648.0 CTR enc | 45.54 ns/B 20.94 MiB/s 29.51 c/B 648.0 CTR dec | 45.53 ns/B 20.95 MiB/s 29.50 c/B 648.0 XTS enc | 67.88 ns/B 14.05 MiB/s 43.98 c/B 648.0 XTS dec | 67.69 ns/B 14.09 MiB/s 43.86 c/B 648.0 CCM enc | 110.6 ns/B 8.62 MiB/s 71.66 c/B 648.0 CCM dec | 110.2 ns/B 8.65 MiB/s 71.42 c/B 648.0 CCM auth | 64.87 ns/B 14.70 MiB/s 42.04 c/B 648.0 EAX enc | 109.9 ns/B 8.68 MiB/s 71.22 c/B 648.0 EAX dec | 109.9 ns/B 8.68 MiB/s 71.22 c/B 648.0 EAX auth | 64.37 ns/B 14.81 MiB/s 41.71 c/B 648.0 GCM enc | 47.07 ns/B 20.26 MiB/s 30.51 c/B 648.0 GCM dec | 47.08 ns/B 20.26 MiB/s 30.51 c/B 648.0 GCM auth | 1.55 ns/B 614.7 MiB/s 1.01 c/B 648.0 OCB enc | 48.38 ns/B 19.71 MiB/s 31.35 c/B 648.0 OCB dec | 48.11 ns/B 19.82 MiB/s 31.17 c/B 648.0 OCB auth | 46.71 ns/B 20.42 MiB/s 30.27 c/B 648.0 Signed-off-by: Jussi Kivilinna --- cipher/cipher.c | 9 + cipher/sm4.c | 709 ++++++++++++++++++++++++++++++++++++++++++------ src/cipher.h | 16 ++ tests/basic.c | 2 + 4 files changed, 648 insertions(+), 88 deletions(-) diff --git a/cipher/cipher.c b/cipher/cipher.c index dfb083a0..c77c9682 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -707,6 +707,15 @@ _gcry_cipher_open_internal (gcry_cipher_hd_t *handle, h->bulk.ocb_auth = _gcry_serpent_ocb_auth; break; #endif /*USE_SERPENT*/ +#ifdef USE_SM4 + case GCRY_CIPHER_SM4: + h->bulk.cbc_dec = _gcry_sm4_cbc_dec; + h->bulk.cfb_dec = _gcry_sm4_cfb_dec; + h->bulk.ctr_enc = _gcry_sm4_ctr_enc; + h->bulk.ocb_crypt = _gcry_sm4_ocb_crypt; + h->bulk.ocb_auth = _gcry_sm4_ocb_auth; + break; +#endif /*USE_SM4*/ #ifdef USE_TWOFISH case GCRY_CIPHER_TWOFISH: case GCRY_CIPHER_TWOFISH128: diff --git a/cipher/sm4.c b/cipher/sm4.c index 061ee26e..621532fa 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -1,6 +1,7 @@ /* sm4.c - SM4 Cipher Algorithm * Copyright (C) 2020 Alibaba Group. * Copyright (C) 2020 Tianjia Zhang + * Copyright (C) 2020 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -27,6 +28,17 @@ #include "g10lib.h" #include "cipher.h" #include "bufhelp.h" +#include "cipher-internal.h" +#include "cipher-selftest.h" + +/* Helper macro to force alignment to 64 bytes. */ +#ifdef HAVE_GCC_ATTRIBUTE_ALIGNED +# define ATTR_ALIGNED_64 __attribute__ ((aligned (64))) +#else +# define ATTR_ALIGNED_64 +#endif + +static const char *sm4_selftest (void); typedef struct { @@ -34,46 +46,60 @@ typedef struct u32 rkey_dec[32]; } SM4_context; -static const u32 fk[4] = { +static const u32 fk[4] = +{ 0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc }; -static const byte sbox[256] = { - 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, - 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05, - 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, - 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99, - 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, - 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62, - 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, - 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6, - 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, - 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8, - 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, - 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35, - 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, - 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87, - 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, - 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e, - 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, - 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1, - 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, - 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3, - 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, - 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f, - 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, - 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51, - 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, - 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8, - 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, - 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0, - 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, - 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84, - 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, - 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 -}; +static struct +{ + volatile u32 counter_head; + u32 cacheline_align[64 / 4 - 1]; + byte S[256]; + volatile u32 counter_tail; +} sbox_table ATTR_ALIGNED_64 = + { + 0, + { 0, }, + { + 0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7, + 0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05, + 0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3, + 0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99, + 0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a, + 0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62, + 0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95, + 0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6, + 0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba, + 0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8, + 0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b, + 0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35, + 0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2, + 0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87, + 0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52, + 0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e, + 0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5, + 0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1, + 0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55, + 0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3, + 0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60, + 0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f, + 0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f, + 0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51, + 0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f, + 0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8, + 0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd, + 0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0, + 0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e, + 0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84, + 0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20, + 0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48 + }, + 0 + }; -static const u32 ck[] = { +static const u32 ck[] = +{ 0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269, 0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9, 0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249, @@ -84,68 +110,96 @@ static const u32 ck[] = { 0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 }; -static u32 sm4_t_non_lin_sub(u32 x) +static inline void prefetch_sbox_table(void) { - int i; - byte *b = (byte *)&x; + const volatile byte *vtab = (void *)&sbox_table; + + /* Modify counters to trigger copy-on-write and unsharing if physical pages + * of look-up table are shared between processes. Modifying counters also + * causes checksums for pages to change and hint same-page merging algorithm + * that these pages are frequently changing. */ + sbox_table.counter_head++; + sbox_table.counter_tail++; + + /* Prefetch look-up table to cache. */ + (void)vtab[0 * 32]; + (void)vtab[1 * 32]; + (void)vtab[2 * 32]; + (void)vtab[3 * 32]; + (void)vtab[4 * 32]; + (void)vtab[5 * 32]; + (void)vtab[6 * 32]; + (void)vtab[7 * 32]; + (void)vtab[8 * 32 - 1]; +} - for (i = 0; i < 4; ++i) - b[i] = sbox[b[i]]; +static inline u32 sm4_t_non_lin_sub(u32 x) +{ + u32 out; - return x; + out = (u32)sbox_table.S[(x >> 0) & 0xff] << 0; + out |= (u32)sbox_table.S[(x >> 8) & 0xff] << 8; + out |= (u32)sbox_table.S[(x >> 16) & 0xff] << 16; + out |= (u32)sbox_table.S[(x >> 24) & 0xff] << 24; + + return out; } -static u32 sm4_key_lin_sub(u32 x) +static inline u32 sm4_key_lin_sub(u32 x) { return x ^ rol(x, 13) ^ rol(x, 23); } -static u32 sm4_enc_lin_sub(u32 x) +static inline u32 sm4_enc_lin_sub(u32 x) { - return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24); + u32 xrol2 = rol(x, 2); + return x ^ xrol2 ^ rol(xrol2, 8) ^ rol(xrol2, 16) ^ rol(x, 24); } -static u32 sm4_key_sub(u32 x) +static inline u32 sm4_key_sub(u32 x) { return sm4_key_lin_sub(sm4_t_non_lin_sub(x)); } -static u32 sm4_enc_sub(u32 x) +static inline u32 sm4_enc_sub(u32 x) { return sm4_enc_lin_sub(sm4_t_non_lin_sub(x)); } -static u32 sm4_round(const u32 *x, const u32 rk) +static inline u32 +sm4_round(const u32 x0, const u32 x1, const u32 x2, const u32 x3, const u32 rk) { - return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk); + return x0 ^ sm4_enc_sub(x1 ^ x2 ^ x3 ^ rk); } -static gcry_err_code_t -sm4_expand_key (SM4_context *ctx, const byte *key, const unsigned keylen) +static void +sm4_expand_key (SM4_context *ctx, const byte *key) { - u32 rk[4], t; + u32 rk[4]; int i; - if (keylen != 16) - return GPG_ERR_INV_KEYLEN; + rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0]; + rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1]; + rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2]; + rk[3] = buf_get_be32(key + 4 * 3) ^ fk[3]; - for (i = 0; i < 4; ++i) - rk[i] = buf_get_be32(&key[i*4]) ^ fk[i]; - - for (i = 0; i < 32; ++i) + for (i = 0; i < 32; i += 4) { - t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]); - ctx->rkey_enc[i] = t; - rk[0] = rk[1]; - rk[1] = rk[2]; - rk[2] = rk[3]; - rk[3] = t; + rk[0] = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i + 0]); + rk[1] = rk[1] ^ sm4_key_sub(rk[2] ^ rk[3] ^ rk[0] ^ ck[i + 1]); + rk[2] = rk[2] ^ sm4_key_sub(rk[3] ^ rk[0] ^ rk[1] ^ ck[i + 2]); + rk[3] = rk[3] ^ sm4_key_sub(rk[0] ^ rk[1] ^ rk[2] ^ ck[i + 3]); + ctx->rkey_enc[i + 0] = rk[0]; + ctx->rkey_enc[i + 1] = rk[1]; + ctx->rkey_enc[i + 2] = rk[2]; + ctx->rkey_enc[i + 3] = rk[3]; + ctx->rkey_dec[31 - i - 0] = rk[0]; + ctx->rkey_dec[31 - i - 1] = rk[1]; + ctx->rkey_dec[31 - i - 2] = rk[2]; + ctx->rkey_dec[31 - i - 3] = rk[3]; } - for (i = 0; i < 32; ++i) - ctx->rkey_dec[i] = ctx->rkey_enc[31 - i]; - - return 0; + wipememory (rk, sizeof(rk)); } static gcry_err_code_t @@ -153,32 +207,53 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, gcry_cipher_hd_t hd) { SM4_context *ctx = context; - int rc = sm4_expand_key (ctx, key, keylen); + static int init = 0; + static const char *selftest_failed = NULL; + (void)hd; - _gcry_burn_stack (4*5 + sizeof(int)*2); - return rc; + + if (!init) + { + init = 1; + selftest_failed = sm4_selftest(); + if (selftest_failed) + log_error("%s\n", selftest_failed); + } + if (selftest_failed) + return GPG_ERR_SELFTEST_FAILED; + + if (keylen != 16) + return GPG_ERR_INV_KEYLEN; + + sm4_expand_key (ctx, key); + return 0; } -static void +static unsigned int sm4_do_crypt (const u32 *rk, byte *out, const byte *in) { - u32 x[4], t; + u32 x[4]; int i; - for (i = 0; i < 4; ++i) - x[i] = buf_get_be32(&in[i*4]); + x[0] = buf_get_be32(in + 0 * 4); + x[1] = buf_get_be32(in + 1 * 4); + x[2] = buf_get_be32(in + 2 * 4); + x[3] = buf_get_be32(in + 3 * 4); - for (i = 0; i < 32; ++i) + for (i = 0; i < 32; i += 4) { - t = sm4_round(x, rk[i]); - x[0] = x[1]; - x[1] = x[2]; - x[2] = x[3]; - x[3] = t; + x[0] = sm4_round(x[0], x[1], x[2], x[3], rk[i + 0]); + x[1] = sm4_round(x[1], x[2], x[3], x[0], rk[i + 1]); + x[2] = sm4_round(x[2], x[3], x[0], x[1], rk[i + 2]); + x[3] = sm4_round(x[3], x[0], x[1], x[2], rk[i + 3]); } - for (i = 0; i < 4; ++i) - buf_put_be32(&out[i*4], x[3 - i]); + buf_put_be32(out + 0 * 4, x[3 - 0]); + buf_put_be32(out + 1 * 4, x[3 - 1]); + buf_put_be32(out + 2 * 4, x[3 - 2]); + buf_put_be32(out + 3 * 4, x[3 - 3]); + + return /*burn_stack*/ 4*6+sizeof(void*)*4; } static unsigned int @@ -186,8 +261,9 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; - sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf); - return /*burn_stack*/ 4*6+sizeof(void*)*4; + prefetch_sbox_table (); + + return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf); } static unsigned int @@ -195,8 +271,453 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf) { SM4_context *ctx = context; - sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf); - return /*burn_stack*/ 4*6+sizeof(void*)*4; + prefetch_sbox_table (); + + return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf); +} + +static unsigned int +sm4_do_crypt_blks2 (const u32 *rk, byte *out, const byte *in) +{ + u32 x[4]; + u32 y[4]; + u32 k; + int i; + + /* Encrypts/Decrypts two blocks for higher instruction level + * parallelism. */ + + x[0] = buf_get_be32(in + 0 * 4); + x[1] = buf_get_be32(in + 1 * 4); + x[2] = buf_get_be32(in + 2 * 4); + x[3] = buf_get_be32(in + 3 * 4); + y[0] = buf_get_be32(in + 4 * 4); + y[1] = buf_get_be32(in + 5 * 4); + y[2] = buf_get_be32(in + 6 * 4); + y[3] = buf_get_be32(in + 7 * 4); + + for (i = 0; i < 32; i += 4) + { + k = rk[i + 0]; + x[0] = sm4_round(x[0], x[1], x[2], x[3], k); + y[0] = sm4_round(y[0], y[1], y[2], y[3], k); + k = rk[i + 1]; + x[1] = sm4_round(x[1], x[2], x[3], x[0], k); + y[1] = sm4_round(y[1], y[2], y[3], y[0], k); + k = rk[i + 2]; + x[2] = sm4_round(x[2], x[3], x[0], x[1], k); + y[2] = sm4_round(y[2], y[3], y[0], y[1], k); + k = rk[i + 3]; + x[3] = sm4_round(x[3], x[0], x[1], x[2], k); + y[3] = sm4_round(y[3], y[0], y[1], y[2], k); + } + + buf_put_be32(out + 0 * 4, x[3 - 0]); + buf_put_be32(out + 1 * 4, x[3 - 1]); + buf_put_be32(out + 2 * 4, x[3 - 2]); + buf_put_be32(out + 3 * 4, x[3 - 3]); + buf_put_be32(out + 4 * 4, y[3 - 0]); + buf_put_be32(out + 5 * 4, y[3 - 1]); + buf_put_be32(out + 6 * 4, y[3 - 2]); + buf_put_be32(out + 7 * 4, y[3 - 3]); + + return /*burn_stack*/ 4*10+sizeof(void*)*4; +} + +static unsigned int +sm4_crypt_blocks (const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) +{ + unsigned int burn_depth = 0; + unsigned int nburn; + + while (num_blks >= 2) + { + nburn = sm4_do_crypt_blks2 (rk, out, in); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + out += 2 * 16; + in += 2 * 16; + num_blks -= 2; + } + + while (num_blks) + { + nburn = sm4_do_crypt (rk, out, in); + burn_depth = nburn > burn_depth ? nburn : burn_depth; + out += 16; + in += 16; + num_blks--; + } + + if (burn_depth) + burn_depth += sizeof(void *) * 5; + return burn_depth; +} + +/* Bulk encryption of complete blocks in CTR mode. This function is only + intended for the bulk encryption feature of cipher.c. CTR is expected to be + of size 16. */ +void +_gcry_sm4_ctr_enc(void *context, unsigned char *ctr, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + SM4_context *ctx = context; + byte *outbuf = outbuf_arg; + const byte *inbuf = inbuf_arg; + int burn_stack_depth = 0; + + /* Process remaining blocks. */ + if (nblocks) + { + unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks); + byte tmpbuf[16 * 8]; + unsigned int tmp_used = 16; + + if (0) + ; + else + { + prefetch_sbox_table (); + crypt_blk1_8 = sm4_crypt_blocks; + } + + /* Process remaining blocks. */ + while (nblocks) + { + size_t curr_blks = nblocks > 8 ? 8 : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + cipher_block_cpy (tmpbuf + 0 * 16, ctr, 16); + for (i = 1; i < curr_blks; i++) + { + cipher_block_cpy (&tmpbuf[i * 16], ctr, 16); + cipher_block_add (&tmpbuf[i * 16], i, 16); + } + cipher_block_add (ctr, curr_blks, 16); + + burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf, + curr_blks); + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor (outbuf, &tmpbuf[i * 16], inbuf, 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + wipememory(tmpbuf, tmp_used); + } + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); +} + +/* Bulk decryption of complete blocks in CBC mode. This function is only + intended for the bulk encryption feature of cipher.c. */ +void +_gcry_sm4_cbc_dec(void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + SM4_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + int burn_stack_depth = 0; + + /* Process remaining blocks. */ + if (nblocks) + { + unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks); + unsigned char savebuf[16 * 8]; + unsigned int tmp_used = 16; + + if (0) + ; + else + { + prefetch_sbox_table (); + crypt_blk1_8 = sm4_crypt_blocks; + } + + /* Process remaining blocks. */ + while (nblocks) + { + size_t curr_blks = nblocks > 8 ? 8 : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + burn_stack_depth = crypt_blk1_8 (ctx->rkey_dec, savebuf, inbuf, + curr_blks); + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_n_copy_2(outbuf, &savebuf[i * 16], iv, inbuf, + 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + wipememory(savebuf, tmp_used); + } + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); +} + +/* Bulk decryption of complete blocks in CFB mode. This function is only + intended for the bulk encryption feature of cipher.c. */ +void +_gcry_sm4_cfb_dec(void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks) +{ + SM4_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + int burn_stack_depth = 0; + + /* Process remaining blocks. */ + if (nblocks) + { + unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks); + unsigned char ivbuf[16 * 8]; + unsigned int tmp_used = 16; + + if (0) + ; + else + { + prefetch_sbox_table (); + crypt_blk1_8 = sm4_crypt_blocks; + } + + /* Process remaining blocks. */ + while (nblocks) + { + size_t curr_blks = nblocks > 8 ? 8 : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + cipher_block_cpy (&ivbuf[0 * 16], iv, 16); + for (i = 1; i < curr_blks; i++) + cipher_block_cpy (&ivbuf[i * 16], &inbuf[(i - 1) * 16], 16); + cipher_block_cpy (iv, &inbuf[(i - 1) * 16], 16); + + burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, ivbuf, ivbuf, + curr_blks); + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor (outbuf, inbuf, &ivbuf[i * 16], 16); + outbuf += 16; + inbuf += 16; + } + + nblocks -= curr_blks; + } + + wipememory(ivbuf, tmp_used); + } + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); +} + +/* Bulk encryption/decryption of complete blocks in OCB mode. */ +size_t +_gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, int encrypt) +{ + SM4_context *ctx = (void *)&c->context.c; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + u64 blkn = c->u_mode.ocb.data_nblocks; + int burn_stack_depth = 0; + + if (nblocks) + { + unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks); + const u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec; + unsigned char tmpbuf[16 * 8]; + unsigned int tmp_used = 16; + + if (0) + ; + else + { + prefetch_sbox_table (); + crypt_blk1_8 = sm4_crypt_blocks; + } + + while (nblocks) + { + size_t curr_blks = nblocks > 8 ? 8 : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + for (i = 0; i < curr_blks; i++) + { + const unsigned char *l = ocb_get_l(c, ++blkn); + + /* Checksum_i = Checksum_{i-1} xor P_i */ + if (encrypt) + cipher_block_xor_1(c->u_ctr.ctr, &inbuf[i * 16], 16); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + cipher_block_xor_2dst (&tmpbuf[i * 16], c->u_iv.iv, l, 16); + cipher_block_xor (&outbuf[i * 16], &inbuf[i * 16], + c->u_iv.iv, 16); + } + + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + crypt_blk1_8 (rk, outbuf, outbuf, curr_blks); + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_1 (&outbuf[i * 16], &tmpbuf[i * 16], 16); + + /* Checksum_i = Checksum_{i-1} xor P_i */ + if (!encrypt) + cipher_block_xor_1(c->u_ctr.ctr, &outbuf[i * 16], 16); + } + + outbuf += curr_blks * 16; + inbuf += curr_blks * 16; + nblocks -= curr_blks; + } + + wipememory(tmpbuf, tmp_used); + } + + c->u_mode.ocb.data_nblocks = blkn; + + if (burn_stack_depth) + _gcry_burn_stack(burn_stack_depth); + + return 0; +} + +/* Bulk authentication of complete blocks in OCB mode. */ +size_t +_gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) +{ + SM4_context *ctx = (void *)&c->context.c; + const unsigned char *abuf = abuf_arg; + u64 blkn = c->u_mode.ocb.aad_nblocks; + + if (nblocks) + { + unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks); + unsigned char tmpbuf[16 * 8]; + unsigned int tmp_used = 16; + + if (0) + ; + else + { + prefetch_sbox_table (); + crypt_blk1_8 = sm4_crypt_blocks; + } + + while (nblocks) + { + size_t curr_blks = nblocks > 8 ? 8 : nblocks; + size_t i; + + if (curr_blks * 16 > tmp_used) + tmp_used = curr_blks * 16; + + for (i = 0; i < curr_blks; i++) + { + const unsigned char *l = ocb_get_l(c, ++blkn); + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + cipher_block_xor_2dst (&tmpbuf[i * 16], + c->u_mode.ocb.aad_offset, l, 16); + cipher_block_xor_1 (&tmpbuf[i * 16], &abuf[i * 16], 16); + } + + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf, curr_blks); + + for (i = 0; i < curr_blks; i++) + { + cipher_block_xor_1 (c->u_mode.ocb.aad_sum, &tmpbuf[i * 16], 16); + } + + abuf += curr_blks * 16; + nblocks -= curr_blks; + } + + wipememory(tmpbuf, tmp_used); + } + + c->u_mode.ocb.aad_nblocks = blkn; + + return 0; +} + +/* Run the self-tests for SM4-CTR, tests IV increment of bulk CTR + encryption. Returns NULL on success. */ +static const char* +selftest_ctr_128 (void) +{ + const int nblocks = 16 - 1; + const int blocksize = 16; + const int context_size = sizeof(SM4_context); + + return _gcry_selftest_helper_ctr("SM4", &sm4_setkey, + &sm4_encrypt, &_gcry_sm4_ctr_enc, nblocks, blocksize, + context_size); +} + +/* Run the self-tests for SM4-CBC, tests bulk CBC decryption. + Returns NULL on success. */ +static const char* +selftest_cbc_128 (void) +{ + const int nblocks = 16 - 1; + const int blocksize = 16; + const int context_size = sizeof(SM4_context); + + return _gcry_selftest_helper_cbc("SM4", &sm4_setkey, + &sm4_encrypt, &_gcry_sm4_cbc_dec, nblocks, blocksize, + context_size); +} + +/* Run the self-tests for SM4-CFB, tests bulk CFB decryption. + Returns NULL on success. */ +static const char* +selftest_cfb_128 (void) +{ + const int nblocks = 16 - 1; + const int blocksize = 16; + const int context_size = sizeof(SM4_context); + + return _gcry_selftest_helper_cfb("SM4", &sm4_setkey, + &sm4_encrypt, &_gcry_sm4_cfb_dec, nblocks, blocksize, + context_size); } static const char * @@ -204,6 +725,7 @@ sm4_selftest (void) { SM4_context ctx; byte scratch[16]; + const char *r; static const byte plaintext[16] = { 0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF, @@ -218,7 +740,9 @@ sm4_selftest (void) 0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46 }; - sm4_setkey (&ctx, key, sizeof (key), NULL); + memset (&ctx, 0, sizeof(ctx)); + + sm4_expand_key (&ctx, key); sm4_encrypt (&ctx, scratch, plaintext); if (memcmp (scratch, ciphertext, sizeof (ciphertext))) return "SM4 test encryption failed."; @@ -226,6 +750,15 @@ sm4_selftest (void) if (memcmp (scratch, plaintext, sizeof (plaintext))) return "SM4 test decryption failed."; + if ( (r = selftest_ctr_128 ()) ) + return r; + + if ( (r = selftest_cbc_128 ()) ) + return r; + + if ( (r = selftest_cfb_128 ()) ) + return r; + return NULL; } diff --git a/src/cipher.h b/src/cipher.h index c49bbda5..decdc4d1 100644 --- a/src/cipher.h +++ b/src/cipher.h @@ -241,6 +241,22 @@ size_t _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, size_t _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks); +/*-- sm4.c --*/ +void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +void _gcry_sm4_cbc_dec (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +void _gcry_sm4_cfb_dec (void *context, unsigned char *iv, + void *outbuf_arg, const void *inbuf_arg, + size_t nblocks); +size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, + const void *inbuf_arg, size_t nblocks, + int encrypt); +size_t _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, + size_t nblocks); + /*-- twofish.c --*/ void _gcry_twofish_ctr_enc (void *context, unsigned char *ctr, void *outbuf_arg, const void *inbuf_arg, diff --git a/tests/basic.c b/tests/basic.c index 5acbab84..8ccb9c66 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -7035,6 +7035,8 @@ check_ocb_cipher (void) "\x99\xeb\x35\xb0\x62\x4e\x7b\xf1\x5e\x9f\xed\x32\x78\x90\x0b\xd0"); check_ocb_cipher_largebuf(GCRY_CIPHER_SERPENT256, 32, "\x71\x66\x2f\x68\xbf\xdd\xcc\xb1\xbf\x81\x56\x5f\x01\x73\xeb\x44"); + check_ocb_cipher_largebuf(GCRY_CIPHER_SM4, 16, + "\x2c\x0b\x31\x0b\xf4\x71\x9b\x01\xf4\x18\x5d\xf1\xe9\x3d\xed\x6b"); /* Check that the AAD data is correctly buffered. */ check_ocb_cipher_splitaad (); -- 2.25.1 From jussi.kivilinna at iki.fi Tue Jun 16 21:28:24 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 16 Jun 2020 22:28:24 +0300 Subject: [PATCH 2/3] Add SM4 x86-64/AES-NI/AVX implementation In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> <20200616192825.1584395-1-jussi.kivilinna@iki.fi> Message-ID: <20200616192825.1584395-3-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'sm4-aesni-avx-amd64.S'. * cipher/sm4-aesni-avx-amd64.S: New. * cipher/sm4.c (USE_AESNI_AVX, ASM_FUNC_ABI): New. (SM4_context) [USE_AESNI_AVX]: Add 'use_aesni_avx'. [USE_AESNI_AVX] (_gcry_sm4_aesni_avx_expand_key) (_gcry_sm4_aesni_avx_crypt_blk1_8, _gcry_sm4_aesni_avx_ctr_enc) (_gcry_sm4_aesni_avx_cbc_dec, _gcry_sm4_aesni_avx_cfb_dec) (_gcry_sm4_aesni_avx_ocb_enc, _gcry_sm4_aesni_avx_ocb_dec) (_gcry_sm4_aesni_avx_ocb_auth): New. (sm4_expand_key) [USE_AESNI_AVX]: Use AES-NI/AVX key setup. (sm4_setkey): Enable AES-NI/AVX if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX]: Add AES-NI/AVX bulk functions. * configure.ac: Add ''sm4-aesni-avx-amd64.lo'. -- This patch adds x86-64/AES-NI/AVX bulk encryption/decryption and key setup for SM4 cipher. Bulk functions process eight blocks in parallel. Benchmark on AMD Ryzen 7 3700X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.94 ns/B 106.7 MiB/s 38.66 c/B 4325 CBC dec | 4.78 ns/B 199.7 MiB/s 20.42 c/B 4275 CFB enc | 8.95 ns/B 106.5 MiB/s 38.72 c/B 4325 CFB dec | 4.81 ns/B 198.2 MiB/s 20.57 c/B 4275 CTR enc | 4.81 ns/B 198.2 MiB/s 20.69 c/B 4300 CTR dec | 4.80 ns/B 198.8 MiB/s 20.63 c/B 4300 GCM auth | 0.116 ns/B 8232 MiB/s 0.504 c/B 4351 OCB enc | 4.88 ns/B 195.5 MiB/s 20.86 c/B 4275 OCB dec | 4.85 ns/B 196.6 MiB/s 20.86 c/B 4301 OCB auth | 4.80 ns/B 198.9 MiB/s 20.62 c/B 4301 After (~3.0x faster): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300 CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275 CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300 CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275 CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300 CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300 OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275 OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275 OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 2 +- cipher/sm4-aesni-avx-amd64.S | 987 +++++++++++++++++++++++++++++++++++ cipher/sm4.c | 232 ++++++++ configure.ac | 7 + 4 files changed, 1227 insertions(+), 1 deletion(-) create mode 100644 cipher/sm4-aesni-avx-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 56661dcd..427922c6 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -107,7 +107,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ - sm4.c \ + sm4.c sm4-aesni-avx-amd64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aesni-avx-amd64.S b/cipher/sm4-aesni-avx-amd64.S new file mode 100644 index 00000000..3610b98c --- /dev/null +++ b/cipher/sm4-aesni-avx-amd64.S @@ -0,0 +1,987 @@ +/* sm4-avx-aesni-amd64.S - AES-NI/AVX implementation of SM4 cipher + * + * Copyright (C) 2020 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +/* Based on SM4 AES-NI work by Markku-Juhani O. Saarinen at: + * https://github.com/mjosaarinen/sm4ni + */ + +#include + +#ifdef __x86_64 +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT) + +#include "asm-common-amd64.h" + +/* vector registers */ +#define RX0 %xmm0 +#define RX1 %xmm1 +#define MASK_4BIT %xmm2 +#define RTMP0 %xmm3 +#define RTMP1 %xmm4 +#define RTMP2 %xmm5 +#define RTMP3 %xmm6 +#define RTMP4 %xmm7 + +#define RA0 %xmm8 +#define RA1 %xmm9 +#define RA2 %xmm10 +#define RA3 %xmm11 + +#define RB0 %xmm12 +#define RB1 %xmm13 +#define RB2 %xmm14 +#define RB3 %xmm15 + +#define RNOT %xmm0 +#define RBSWAP %xmm1 + +/********************************************************************** + helper macros + **********************************************************************/ + +/* Transpose four 32-bit words between 128-bit vectors. */ +#define transpose_4x4(x0, x1, x2, x3, t1, t2) \ + vpunpckhdq x1, x0, t2; \ + vpunpckldq x1, x0, x0; \ + \ + vpunpckldq x3, x2, t1; \ + vpunpckhdq x3, x2, x2; \ + \ + vpunpckhqdq t1, x0, x1; \ + vpunpcklqdq t1, x0, x0; \ + \ + vpunpckhqdq x2, t2, x3; \ + vpunpcklqdq x2, t2, x2; + +/* post-SubByte transform. */ +#define transform_pre(x, lo_t, hi_t, mask4bit, tmp0) \ + vpand x, mask4bit, tmp0; \ + vpandn x, mask4bit, x; \ + vpsrld $4, x, x; \ + \ + vpshufb tmp0, lo_t, tmp0; \ + vpshufb x, hi_t, x; \ + vpxor tmp0, x, x; + +/* post-SubByte transform. Note: x has been XOR'ed with mask4bit by + * 'vaeslastenc' instruction. */ +#define transform_post(x, lo_t, hi_t, mask4bit, tmp0) \ + vpandn mask4bit, x, tmp0; \ + vpsrld $4, x, x; \ + vpand x, mask4bit, x; \ + \ + vpshufb tmp0, lo_t, tmp0; \ + vpshufb x, hi_t, x; \ + vpxor tmp0, x, x; + +/********************************************************************** + 4-way && 8-way SM4 with AES-NI and AVX + **********************************************************************/ + +.text +.align 16 + +/* + * Following four affine transform look-up tables are from work by + * Markku-Juhani O. Saarinen, at https://github.com/mjosaarinen/sm4ni + * + * These allow exposing SM4 S-Box from AES SubByte. + */ + +/* pre-SubByte affine transform, from SM4 field to AES field. */ +.Lpre_tf_lo_s: + .quad 0x9197E2E474720701, 0xC7C1B4B222245157 +.Lpre_tf_hi_s: + .quad 0xE240AB09EB49A200, 0xF052B91BF95BB012 + +/* post-SubByte affine transform, from AES field to SM4 field. */ +.Lpost_tf_lo_s: + .quad 0x5B67F2CEA19D0834, 0xEDD14478172BBE82 +.Lpost_tf_hi_s: + .quad 0xAE7201DD73AFDC00, 0x11CDBE62CC1063BF + +/* For isolating SubBytes from AESENCLAST, inverse shift row */ +.Linv_shift_row: + .byte 0x00, 0x0d, 0x0a, 0x07, 0x04, 0x01, 0x0e, 0x0b + .byte 0x08, 0x05, 0x02, 0x0f, 0x0c, 0x09, 0x06, 0x03 + +/* Inverse shift row + Rotate left by 8 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_8: + .byte 0x07, 0x00, 0x0d, 0x0a, 0x0b, 0x04, 0x01, 0x0e + .byte 0x0f, 0x08, 0x05, 0x02, 0x03, 0x0c, 0x09, 0x06 + +/* Inverse shift row + Rotate left by 16 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_16: + .byte 0x0a, 0x07, 0x00, 0x0d, 0x0e, 0x0b, 0x04, 0x01 + .byte 0x02, 0x0f, 0x08, 0x05, 0x06, 0x03, 0x0c, 0x09 + +/* Inverse shift row + Rotate left by 24 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_24: + .byte 0x0d, 0x0a, 0x07, 0x00, 0x01, 0x0e, 0x0b, 0x04 + .byte 0x05, 0x02, 0x0f, 0x08, 0x09, 0x06, 0x03, 0x0c + +/* For CTR-mode IV byteswap */ +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 + +/* For input word byte-swap */ +.Lbswap32_mask: + .byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 + +.align 4 +/* 4-bit mask */ +.L0f0f0f0f: + .long 0x0f0f0f0f + +.align 8 +.globl _gcry_sm4_aesni_avx_expand_key +ELF(.type _gcry_sm4_aesni_avx_expand_key, at function;) +_gcry_sm4_aesni_avx_expand_key: + /* input: + * %rdi: 128-bit key + * %rsi: rkey_enc + * %rdx: rkey_dec + * %rcx: fk array + * %r8: ck array + */ + CFI_STARTPROC(); + + vmovd 0*4(%rdi), RA0; + vmovd 1*4(%rdi), RA1; + vmovd 2*4(%rdi), RA2; + vmovd 3*4(%rdi), RA3; + + vmovdqa .Lbswap32_mask rRIP, RTMP2; + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + + vmovd 0*4(%rcx), RB0; + vmovd 1*4(%rcx), RB1; + vmovd 2*4(%rcx), RB2; + vmovd 3*4(%rcx), RB3; + vpxor RB0, RA0, RA0; + vpxor RB1, RA1, RA1; + vpxor RB2, RA2, RA2; + vpxor RB3, RA3, RA3; + + vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT; + vmovdqa .Lpre_tf_lo_s rRIP, RTMP4; + vmovdqa .Lpre_tf_hi_s rRIP, RB0; + vmovdqa .Lpost_tf_lo_s rRIP, RB1; + vmovdqa .Lpost_tf_hi_s rRIP, RB2; + vmovdqa .Linv_shift_row rRIP, RB3; + +#define ROUND(round, s0, s1, s2, s3) \ + vbroadcastss (4*(round))(%r8), RX0; \ + vpxor s1, RX0, RX0; \ + vpxor s2, RX0, RX0; \ + vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + transform_pre(RX0, RTMP4, RB0, MASK_4BIT, RTMP0); \ + vaesenclast MASK_4BIT, RX0, RX0; \ + transform_post(RX0, RB1, RB2, MASK_4BIT, RTMP0); \ + \ + /* linear part */ \ + vpshufb RB3, RX0, RX0; \ + vpxor RX0, s0, s0; /* s0 ^ x */ \ + vpslld $13, RX0, RTMP0; \ + vpsrld $19, RX0, RTMP1; \ + vpslld $23, RX0, RTMP2; \ + vpsrld $9, RX0, RTMP3; \ + vpxor RTMP0, RTMP1, RTMP1; \ + vpxor RTMP2, RTMP3, RTMP3; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,13) */ \ + vpxor RTMP3, s0, s0; /* s0 ^ x ^ rol(x,13) ^ rol(x,23) */ + + leaq (32*4)(%r8), %rax; + leaq (32*4)(%rdx), %rdx; +.align 16 +.Lroundloop_expand_key: + leaq (-4*4)(%rdx), %rdx; + ROUND(0, RA0, RA1, RA2, RA3); + ROUND(1, RA1, RA2, RA3, RA0); + ROUND(2, RA2, RA3, RA0, RA1); + ROUND(3, RA3, RA0, RA1, RA2); + leaq (4*4)(%r8), %r8; + vmovd RA0, (0*4)(%rsi); + vmovd RA1, (1*4)(%rsi); + vmovd RA2, (2*4)(%rsi); + vmovd RA3, (3*4)(%rsi); + vmovd RA0, (3*4)(%rdx); + vmovd RA1, (2*4)(%rdx); + vmovd RA2, (1*4)(%rdx); + vmovd RA3, (0*4)(%rdx); + leaq (4*4)(%rsi), %rsi; + cmpq %rax, %r8; + jne .Lroundloop_expand_key; + +#undef ROUND + + vzeroall; + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_expand_key,.-_gcry_sm4_aesni_avx_expand_key;) + +.align 8 +ELF(.type sm4_aesni_avx_crypt_blk1_4, at function;) +sm4_aesni_avx_crypt_blk1_4: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..4 blocks) + * %rdx: src (1..4 blocks) + * %rcx: num blocks (1..4) + */ + CFI_STARTPROC(); + + vmovdqu 0*16(%rdx), RA0; + vmovdqa RA0, RA1; + vmovdqa RA0, RA2; + vmovdqa RA0, RA3; + cmpq $2, %rcx; + jb .Lblk4_load_input_done; + vmovdqu 1*16(%rdx), RA1; + je .Lblk4_load_input_done; + vmovdqu 2*16(%rdx), RA2; + cmpq $3, %rcx; + je .Lblk4_load_input_done; + vmovdqu 3*16(%rdx), RA3; + +.Lblk4_load_input_done: + + vmovdqa .Lbswap32_mask rRIP, RTMP2; + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + + vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT; + vmovdqa .Lpre_tf_lo_s rRIP, RTMP4; + vmovdqa .Lpre_tf_hi_s rRIP, RB0; + vmovdqa .Lpost_tf_lo_s rRIP, RB1; + vmovdqa .Lpost_tf_hi_s rRIP, RB2; + vmovdqa .Linv_shift_row rRIP, RB3; + vmovdqa .Linv_shift_row_rol_8 rRIP, RTMP2; + vmovdqa .Linv_shift_row_rol_16 rRIP, RTMP3; + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + +#define ROUND(round, s0, s1, s2, s3) \ + vbroadcastss (4*(round))(%rdi), RX0; \ + vpxor s1, RX0, RX0; \ + vpxor s2, RX0, RX0; \ + vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + transform_pre(RX0, RTMP4, RB0, MASK_4BIT, RTMP0); \ + vaesenclast MASK_4BIT, RX0, RX0; \ + transform_post(RX0, RB1, RB2, MASK_4BIT, RTMP0); \ + \ + /* linear part */ \ + vpshufb RB3, RX0, RTMP0; \ + vpxor RTMP0, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP2, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \ + vpshufb RTMP3, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb .Linv_shift_row_rol_24 rRIP, RX0, RTMP1; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0, RTMP1; \ + vpsrld $30, RTMP0, RTMP0; \ + vpxor RTMP0, s0, s0; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk4: + ROUND(0, RA0, RA1, RA2, RA3); + ROUND(1, RA1, RA2, RA3, RA0); + ROUND(2, RA2, RA3, RA0, RA1); + ROUND(3, RA3, RA0, RA1, RA2); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk4; + +#undef ROUND + + vmovdqa .Lbswap128_mask rRIP, RTMP2; + + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + + vmovdqu RA0, 0*16(%rsi); + cmpq $2, %rcx; + jb .Lblk4_store_output_done; + vmovdqu RA1, 1*16(%rsi); + je .Lblk4_store_output_done; + vmovdqu RA2, 2*16(%rsi); + cmpq $3, %rcx; + je .Lblk4_store_output_done; + vmovdqu RA3, 3*16(%rsi); + +.Lblk4_store_output_done: + vzeroall; + xorl %eax, %eax; + ret; + CFI_ENDPROC(); +ELF(.size sm4_aesni_avx_crypt_blk1_4,.-sm4_aesni_avx_crypt_blk1_4;) + +.align 8 +ELF(.type __sm4_crypt_blk8, at function;) +__sm4_crypt_blk8: + /* input: + * %rdi: round key array, CTX + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel + * ciphertext blocks + * output: + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext + * blocks + */ + CFI_STARTPROC(); + + vmovdqa .Lbswap32_mask rRIP, RTMP2; + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT; + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + +#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \ + vbroadcastss (4*(round))(%rdi), RX0; \ + vmovdqa .Lpre_tf_lo_s rRIP, RTMP4; \ + vmovdqa .Lpre_tf_hi_s rRIP, RTMP1; \ + vmovdqa RX0, RX1; \ + vpxor s1, RX0, RX0; \ + vpxor s2, RX0, RX0; \ + vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \ + vmovdqa .Lpost_tf_lo_s rRIP, RTMP2; \ + vmovdqa .Lpost_tf_hi_s rRIP, RTMP3; \ + vpxor r1, RX1, RX1; \ + vpxor r2, RX1, RX1; \ + vpxor r3, RX1, RX1; /* r1 ^ r2 ^ r3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + transform_pre(RX0, RTMP4, RTMP1, MASK_4BIT, RTMP0); \ + transform_pre(RX1, RTMP4, RTMP1, MASK_4BIT, RTMP0); \ + vmovdqa .Linv_shift_row rRIP, RTMP4; \ + vaesenclast MASK_4BIT, RX0, RX0; \ + vaesenclast MASK_4BIT, RX1, RX1; \ + transform_post(RX0, RTMP2, RTMP3, MASK_4BIT, RTMP0); \ + transform_post(RX1, RTMP2, RTMP3, MASK_4BIT, RTMP0); \ + \ + /* linear part */ \ + vpshufb RTMP4, RX0, RTMP0; \ + vpxor RTMP0, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP4, RX1, RTMP2; \ + vmovdqa .Linv_shift_row_rol_8 rRIP, RTMP4; \ + vpxor RTMP2, r0, r0; /* r0 ^ x */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vmovdqa .Linv_shift_row_rol_16 rRIP, RTMP4; \ + vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vmovdqa .Linv_shift_row_rol_24 rRIP, RTMP4; \ + vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0, RTMP1; \ + vpsrld $30, RTMP0, RTMP0; \ + vpxor RTMP0, s0, s0; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP2, RTMP3; \ + vpsrld $30, RTMP2, RTMP2; \ + vpxor RTMP2, r0, r0; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk8: + ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3); + ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0); + ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1); + ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk8; + +#undef ROUND + + vmovdqa .Lbswap128_mask rRIP, RTMP2; + + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + ret; + CFI_ENDPROC(); +ELF(.size __sm4_crypt_blk8,.-__sm4_crypt_blk8;) + +.align 8 +.globl _gcry_sm4_aesni_avx_crypt_blk1_8 +ELF(.type _gcry_sm4_aesni_avx_crypt_blk1_8, at function;) +_gcry_sm4_aesni_avx_crypt_blk1_8: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (1..8 blocks) + * %rdx: src (1..8 blocks) + * %rcx: num blocks (1..8) + */ + CFI_STARTPROC(); + + cmpq $5, %rcx; + jb sm4_aesni_avx_crypt_blk1_4; + vmovdqu (0 * 16)(%rdx), RA0; + vmovdqu (1 * 16)(%rdx), RA1; + vmovdqu (2 * 16)(%rdx), RA2; + vmovdqu (3 * 16)(%rdx), RA3; + vmovdqu (4 * 16)(%rdx), RB0; + vmovdqa RB0, RB1; + vmovdqa RB0, RB2; + vmovdqa RB0, RB3; + je .Lblk8_load_input_done; + vmovdqu (5 * 16)(%rdx), RB1; + cmpq $7, %rcx; + jb .Lblk8_load_input_done; + vmovdqu (6 * 16)(%rdx), RB2; + je .Lblk8_load_input_done; + vmovdqu (7 * 16)(%rdx), RB3; + +.Lblk8_load_input_done: + call __sm4_crypt_blk8; + + cmpq $6, %rcx; + vmovdqu RA0, (0 * 16)(%rsi); + vmovdqu RA1, (1 * 16)(%rsi); + vmovdqu RA2, (2 * 16)(%rsi); + vmovdqu RA3, (3 * 16)(%rsi); + vmovdqu RB0, (4 * 16)(%rsi); + jb .Lblk8_store_output_done; + vmovdqu RB1, (5 * 16)(%rsi); + je .Lblk8_store_output_done; + vmovdqu RB2, (6 * 16)(%rsi); + cmpq $7, %rcx; + je .Lblk8_store_output_done; + vmovdqu RB3, (7 * 16)(%rsi); + +.Lblk8_store_output_done: + vzeroall; + xorl %eax, %eax; + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_crypt_blk1_8,.-_gcry_sm4_aesni_avx_crypt_blk1_8;) + +.align 8 +.globl _gcry_sm4_aesni_avx_ctr_enc +ELF(.type _gcry_sm4_aesni_avx_ctr_enc, at function;) +_gcry_sm4_aesni_avx_ctr_enc: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (8 blocks) + * %rdx: src (8 blocks) + * %rcx: iv (big endian, 128bit) + */ + CFI_STARTPROC(); + + /* load IV and byteswap */ + vmovdqu (%rcx), RA0; + + vmovdqa .Lbswap128_mask rRIP, RBSWAP; + vpshufb RBSWAP, RA0, RTMP0; /* be => le */ + + vpcmpeqd RNOT, RNOT, RNOT; + vpsrldq $8, RNOT, RNOT; /* low: -1, high: 0 */ + +#define inc_le128(x, minus_one, tmp) \ + vpcmpeqq minus_one, x, tmp; \ + vpsubq minus_one, x, x; \ + vpslldq $8, tmp, tmp; \ + vpsubq tmp, x, x; + + /* construct IVs */ + inc_le128(RTMP0, RNOT, RTMP2); /* +1 */ + vpshufb RBSWAP, RTMP0, RA1; + inc_le128(RTMP0, RNOT, RTMP2); /* +2 */ + vpshufb RBSWAP, RTMP0, RA2; + inc_le128(RTMP0, RNOT, RTMP2); /* +3 */ + vpshufb RBSWAP, RTMP0, RA3; + inc_le128(RTMP0, RNOT, RTMP2); /* +4 */ + vpshufb RBSWAP, RTMP0, RB0; + inc_le128(RTMP0, RNOT, RTMP2); /* +5 */ + vpshufb RBSWAP, RTMP0, RB1; + inc_le128(RTMP0, RNOT, RTMP2); /* +6 */ + vpshufb RBSWAP, RTMP0, RB2; + inc_le128(RTMP0, RNOT, RTMP2); /* +7 */ + vpshufb RBSWAP, RTMP0, RB3; + inc_le128(RTMP0, RNOT, RTMP2); /* +8 */ + vpshufb RBSWAP, RTMP0, RTMP1; + + /* store new IV */ + vmovdqu RTMP1, (%rcx); + + call __sm4_crypt_blk8; + + vpxor (0 * 16)(%rdx), RA0, RA0; + vpxor (1 * 16)(%rdx), RA1, RA1; + vpxor (2 * 16)(%rdx), RA2, RA2; + vpxor (3 * 16)(%rdx), RA3, RA3; + vpxor (4 * 16)(%rdx), RB0, RB0; + vpxor (5 * 16)(%rdx), RB1, RB1; + vpxor (6 * 16)(%rdx), RB2, RB2; + vpxor (7 * 16)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 16)(%rsi); + vmovdqu RA1, (1 * 16)(%rsi); + vmovdqu RA2, (2 * 16)(%rsi); + vmovdqu RA3, (3 * 16)(%rsi); + vmovdqu RB0, (4 * 16)(%rsi); + vmovdqu RB1, (5 * 16)(%rsi); + vmovdqu RB2, (6 * 16)(%rsi); + vmovdqu RB3, (7 * 16)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_ctr_enc,.-_gcry_sm4_aesni_avx_ctr_enc;) + +.align 8 +.globl _gcry_sm4_aesni_avx_cbc_dec +ELF(.type _gcry_sm4_aesni_avx_cbc_dec, at function;) +_gcry_sm4_aesni_avx_cbc_dec: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (8 blocks) + * %rdx: src (8 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + vmovdqu (0 * 16)(%rdx), RA0; + vmovdqu (1 * 16)(%rdx), RA1; + vmovdqu (2 * 16)(%rdx), RA2; + vmovdqu (3 * 16)(%rdx), RA3; + vmovdqu (4 * 16)(%rdx), RB0; + vmovdqu (5 * 16)(%rdx), RB1; + vmovdqu (6 * 16)(%rdx), RB2; + vmovdqu (7 * 16)(%rdx), RB3; + + call __sm4_crypt_blk8; + + vmovdqu (7 * 16)(%rdx), RNOT; + vpxor (%rcx), RA0, RA0; + vpxor (0 * 16)(%rdx), RA1, RA1; + vpxor (1 * 16)(%rdx), RA2, RA2; + vpxor (2 * 16)(%rdx), RA3, RA3; + vpxor (3 * 16)(%rdx), RB0, RB0; + vpxor (4 * 16)(%rdx), RB1, RB1; + vpxor (5 * 16)(%rdx), RB2, RB2; + vpxor (6 * 16)(%rdx), RB3, RB3; + vmovdqu RNOT, (%rcx); /* store new IV */ + + vmovdqu RA0, (0 * 16)(%rsi); + vmovdqu RA1, (1 * 16)(%rsi); + vmovdqu RA2, (2 * 16)(%rsi); + vmovdqu RA3, (3 * 16)(%rsi); + vmovdqu RB0, (4 * 16)(%rsi); + vmovdqu RB1, (5 * 16)(%rsi); + vmovdqu RB2, (6 * 16)(%rsi); + vmovdqu RB3, (7 * 16)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_cbc_dec,.-_gcry_sm4_aesni_avx_cbc_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx_cfb_dec +ELF(.type _gcry_sm4_aesni_avx_cfb_dec, at function;) +_gcry_sm4_aesni_avx_cfb_dec: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (8 blocks) + * %rdx: src (8 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + /* Load input */ + vmovdqu (%rcx), RA0; + vmovdqu 0 * 16(%rdx), RA1; + vmovdqu 1 * 16(%rdx), RA2; + vmovdqu 2 * 16(%rdx), RA3; + vmovdqu 3 * 16(%rdx), RB0; + vmovdqu 4 * 16(%rdx), RB1; + vmovdqu 5 * 16(%rdx), RB2; + vmovdqu 6 * 16(%rdx), RB3; + + /* Update IV */ + vmovdqu 7 * 16(%rdx), RNOT; + vmovdqu RNOT, (%rcx); + + call __sm4_crypt_blk8; + + vpxor (0 * 16)(%rdx), RA0, RA0; + vpxor (1 * 16)(%rdx), RA1, RA1; + vpxor (2 * 16)(%rdx), RA2, RA2; + vpxor (3 * 16)(%rdx), RA3, RA3; + vpxor (4 * 16)(%rdx), RB0, RB0; + vpxor (5 * 16)(%rdx), RB1, RB1; + vpxor (6 * 16)(%rdx), RB2, RB2; + vpxor (7 * 16)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 16)(%rsi); + vmovdqu RA1, (1 * 16)(%rsi); + vmovdqu RA2, (2 * 16)(%rsi); + vmovdqu RA3, (3 * 16)(%rsi); + vmovdqu RB0, (4 * 16)(%rsi); + vmovdqu RB1, (5 * 16)(%rsi); + vmovdqu RB2, (6 * 16)(%rsi); + vmovdqu RB3, (7 * 16)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_cfb_dec,.-_gcry_sm4_aesni_avx_cfb_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx_ocb_enc +ELF(.type _gcry_sm4_aesni_avx_ocb_enc, at function;) + +_gcry_sm4_aesni_avx_ocb_enc: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (8 blocks) + * %rdx: src (8 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[8]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rcx), RTMP0; + vmovdqu (%r8), RTMP1; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, lreg, xreg) \ + vmovdqu (n * 16)(%rdx), xreg; \ + vpxor (lreg), RTMP0, RTMP0; \ + vpxor xreg, RTMP1, RTMP1; \ + vpxor RTMP0, xreg, xreg; \ + vmovdqu RTMP0, (n * 16)(%rsi); + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, RA0); + OCB_INPUT(1, %r11, RA1); + OCB_INPUT(2, %r12, RA2); + OCB_INPUT(3, %r13, RA3); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, RB0); + OCB_INPUT(5, %r11, RB1); + OCB_INPUT(6, %r12, RB2); + OCB_INPUT(7, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0, (%rcx); + vmovdqu RTMP1, (%r8); + + movq (0 * 8)(%rsp), %r10; + CFI_RESTORE(%r10); + movq (1 * 8)(%rsp), %r11; + CFI_RESTORE(%r11); + movq (2 * 8)(%rsp), %r12; + CFI_RESTORE(%r12); + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r13); + + call __sm4_crypt_blk8; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vpxor (0 * 16)(%rsi), RA0, RA0; + vpxor (1 * 16)(%rsi), RA1, RA1; + vpxor (2 * 16)(%rsi), RA2, RA2; + vpxor (3 * 16)(%rsi), RA3, RA3; + vpxor (4 * 16)(%rsi), RB0, RB0; + vpxor (5 * 16)(%rsi), RB1, RB1; + vpxor (6 * 16)(%rsi), RB2, RB2; + vpxor (7 * 16)(%rsi), RB3, RB3; + + vmovdqu RA0, (0 * 16)(%rsi); + vmovdqu RA1, (1 * 16)(%rsi); + vmovdqu RA2, (2 * 16)(%rsi); + vmovdqu RA3, (3 * 16)(%rsi); + vmovdqu RB0, (4 * 16)(%rsi); + vmovdqu RB1, (5 * 16)(%rsi); + vmovdqu RB2, (6 * 16)(%rsi); + vmovdqu RB3, (7 * 16)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_ocb_enc,.-_gcry_sm4_aesni_avx_ocb_enc;) + +.align 8 +.globl _gcry_sm4_aesni_avx_ocb_dec +ELF(.type _gcry_sm4_aesni_avx_ocb_dec, at function;) + +_gcry_sm4_aesni_avx_ocb_dec: + /* input: + * %rdi: round key array, CTX + * %rsi: dst (8 blocks) + * %rdx: src (8 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[8]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + movdqu (%rcx), RTMP0; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i) */ + +#define OCB_INPUT(n, lreg, xreg) \ + vmovdqu (n * 16)(%rdx), xreg; \ + vpxor (lreg), RTMP0, RTMP0; \ + vpxor RTMP0, xreg, xreg; \ + vmovdqu RTMP0, (n * 16)(%rsi); + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, RA0); + OCB_INPUT(1, %r11, RA1); + OCB_INPUT(2, %r12, RA2); + OCB_INPUT(3, %r13, RA3); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, RB0); + OCB_INPUT(5, %r11, RB1); + OCB_INPUT(6, %r12, RB2); + OCB_INPUT(7, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0, (%rcx); + + movq (0 * 8)(%rsp), %r10; + CFI_RESTORE(%r10); + movq (1 * 8)(%rsp), %r11; + CFI_RESTORE(%r11); + movq (2 * 8)(%rsp), %r12; + CFI_RESTORE(%r12); + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r13); + + call __sm4_crypt_blk8; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vmovdqu (%r8), RTMP0; + + vpxor (0 * 16)(%rsi), RA0, RA0; + vpxor (1 * 16)(%rsi), RA1, RA1; + vpxor (2 * 16)(%rsi), RA2, RA2; + vpxor (3 * 16)(%rsi), RA3, RA3; + vpxor (4 * 16)(%rsi), RB0, RB0; + vpxor (5 * 16)(%rsi), RB1, RB1; + vpxor (6 * 16)(%rsi), RB2, RB2; + vpxor (7 * 16)(%rsi), RB3, RB3; + + /* Checksum_i = Checksum_{i-1} xor P_i */ + + vmovdqu RA0, (0 * 16)(%rsi); + vpxor RA0, RTMP0, RTMP0; + vmovdqu RA1, (1 * 16)(%rsi); + vpxor RA1, RTMP0, RTMP0; + vmovdqu RA2, (2 * 16)(%rsi); + vpxor RA2, RTMP0, RTMP0; + vmovdqu RA3, (3 * 16)(%rsi); + vpxor RA3, RTMP0, RTMP0; + vmovdqu RB0, (4 * 16)(%rsi); + vpxor RB0, RTMP0, RTMP0; + vmovdqu RB1, (5 * 16)(%rsi); + vpxor RB1, RTMP0, RTMP0; + vmovdqu RB2, (6 * 16)(%rsi); + vpxor RB2, RTMP0, RTMP0; + vmovdqu RB3, (7 * 16)(%rsi); + vpxor RB3, RTMP0, RTMP0; + + vmovdqu RTMP0, (%r8); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_ocb_dec,.-_gcry_sm4_aesni_avx_ocb_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx_ocb_auth +ELF(.type _gcry_sm4_aesni_avx_ocb_auth, at function;) + +_gcry_sm4_aesni_avx_ocb_auth: + /* input: + * %rdi: round key array, CTX + * %rsi: abuf (8 blocks) + * %rdx: offset + * %rcx: checksum + * %r8 : L pointers (void *L[8]) + */ + CFI_STARTPROC(); + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rdx), RTMP0; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + +#define OCB_INPUT(n, lreg, xreg) \ + vmovdqu (n * 16)(%rsi), xreg; \ + vpxor (lreg), RTMP0, RTMP0; \ + vpxor RTMP0, xreg, xreg; + movq (0 * 8)(%r8), %r10; + movq (1 * 8)(%r8), %r11; + movq (2 * 8)(%r8), %r12; + movq (3 * 8)(%r8), %r13; + OCB_INPUT(0, %r10, RA0); + OCB_INPUT(1, %r11, RA1); + OCB_INPUT(2, %r12, RA2); + OCB_INPUT(3, %r13, RA3); + movq (4 * 8)(%r8), %r10; + movq (5 * 8)(%r8), %r11; + movq (6 * 8)(%r8), %r12; + movq (7 * 8)(%r8), %r13; + OCB_INPUT(4, %r10, RB0); + OCB_INPUT(5, %r11, RB1); + OCB_INPUT(6, %r12, RB2); + OCB_INPUT(7, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0, (%rdx); + + movq (0 * 8)(%rsp), %r10; + CFI_RESTORE(%r10); + movq (1 * 8)(%rsp), %r11; + CFI_RESTORE(%r11); + movq (2 * 8)(%rsp), %r12; + CFI_RESTORE(%r12); + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r13); + + call __sm4_crypt_blk8; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vmovdqu (%rcx), RTMP0; + vpxor RB0, RA0, RA0; + vpxor RB1, RA1, RA1; + vpxor RB2, RA2, RA2; + vpxor RB3, RA3, RA3; + + vpxor RTMP0, RA3, RA3; + vpxor RA2, RA0, RA0; + vpxor RA3, RA1, RA1; + + vpxor RA1, RA0, RA0; + vmovdqu RA0, (%rcx); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx_ocb_auth,.-_gcry_sm4_aesni_avx_ocb_auth;) + +#endif /*defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)*/ +#endif /*__x86_64*/ diff --git a/cipher/sm4.c b/cipher/sm4.c index 621532fa..da75cf87 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -38,12 +38,35 @@ # define ATTR_ALIGNED_64 #endif +/* USE_AESNI_AVX inidicates whether to compile with Intel AES-NI/AVX code. */ +#undef USE_AESNI_AVX +#if defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT) +# if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define USE_AESNI_AVX 1 +# endif +#endif + +/* Assembly implementations use SystemV ABI, ABI conversion and additional + * stack to store XMM6-XMM15 needed on Win64. */ +#undef ASM_FUNC_ABI +#if defined(USE_AESNI_AVX) +# ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS +# define ASM_FUNC_ABI __attribute__((sysv_abi)) +# else +# define ASM_FUNC_ABI +# endif +#endif + static const char *sm4_selftest (void); typedef struct { u32 rkey_enc[32]; u32 rkey_dec[32]; +#ifdef USE_AESNI_AVX + unsigned int use_aesni_avx:1; +#endif } SM4_context; static const u32 fk[4] = @@ -110,6 +133,45 @@ static const u32 ck[] = 0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279 }; +#ifdef USE_AESNI_AVX +extern void _gcry_sm4_aesni_avx_expand_key(const byte *key, u32 *rk_enc, + u32 *rk_dec, const u32 *fk, + const u32 *ck) ASM_FUNC_ABI; + +extern unsigned int +_gcry_sm4_aesni_avx_crypt_blk1_8(const u32 *rk, byte *out, const byte *in, + unsigned int num_blks) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, byte *ctr) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_ocb_enc(const u32 *rk_enc, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[8]) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_ocb_dec(const u32 *rk_dec, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[8]) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx_ocb_auth(const u32 *rk_enc, + const unsigned char *abuf, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[8]) ASM_FUNC_ABI; +#endif /* USE_AESNI_AVX */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -178,6 +240,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key) u32 rk[4]; int i; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + _gcry_sm4_aesni_avx_expand_key (key, ctx->rkey_enc, ctx->rkey_dec, + fk, ck); + return; + } +#endif + rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0]; rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1]; rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2]; @@ -209,8 +280,10 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, SM4_context *ctx = context; static int init = 0; static const char *selftest_failed = NULL; + unsigned int hwf = _gcry_get_hw_features (); (void)hd; + (void)hwf; if (!init) { @@ -225,6 +298,10 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, if (keylen != 16) return GPG_ERR_INV_KEYLEN; +#ifdef USE_AESNI_AVX + ctx->use_aesni_avx = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX); +#endif + sm4_expand_key (ctx, key); return 0; } @@ -367,6 +444,21 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, const byte *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + /* Process data in 8 block chunks. */ + while (nblocks >= 8) + { + _gcry_sm4_aesni_avx_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr); + + nblocks -= 8; + outbuf += 8 * 16; + inbuf += 8 * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -377,6 +469,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8; + } +#endif else { prefetch_sbox_table (); @@ -432,6 +530,21 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + /* Process data in 8 block chunks. */ + while (nblocks >= 8) + { + _gcry_sm4_aesni_avx_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv); + + nblocks -= 8; + outbuf += 8 * 16; + inbuf += 8 * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -442,6 +555,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8; + } +#endif else { prefetch_sbox_table (); @@ -490,6 +609,21 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + /* Process data in 8 block chunks. */ + while (nblocks >= 8) + { + _gcry_sm4_aesni_avx_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv); + + nblocks -= 8; + outbuf += 8 * 16; + inbuf += 8 * 16; + } + } +#endif + /* Process remaining blocks. */ if (nblocks) { @@ -500,6 +634,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8; + } +#endif else { prefetch_sbox_table (); @@ -551,6 +691,48 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, u64 blkn = c->u_mode.ocb.data_nblocks; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + u64 Ls[8]; + unsigned int n = 8 - (blkn % 8); + u64 *l; + + if (nblocks >= 8) + { + /* Use u64 to store pointers for x32 support (assembly function + * assumes 64-bit pointers). */ + Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + l = &Ls[(7 + n) % 8]; + + /* Process data in 8 block chunks. */ + while (nblocks >= 8) + { + blkn += 8; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8); + + if (encrypt) + _gcry_sm4_aesni_avx_ocb_enc(ctx->rkey_enc, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + else + _gcry_sm4_aesni_avx_ocb_dec(ctx->rkey_dec, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + + nblocks -= 8; + outbuf += 8 * 16; + inbuf += 8 * 16; + } + } + } +#endif + if (nblocks) { unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, @@ -561,6 +743,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8; + } +#endif else { prefetch_sbox_table (); @@ -625,6 +813,44 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) const unsigned char *abuf = abuf_arg; u64 blkn = c->u_mode.ocb.aad_nblocks; +#ifdef USE_AESNI_AVX + if (ctx->use_aesni_avx) + { + u64 Ls[8]; + unsigned int n = 8 - (blkn % 8); + u64 *l; + + if (nblocks >= 8) + { + /* Use u64 to store pointers for x32 support (assembly function + * assumes 64-bit pointers). */ + Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + l = &Ls[(7 + n) % 8]; + + /* Process data in 8 block chunks. */ + while (nblocks >= 8) + { + blkn += 8; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8); + + _gcry_sm4_aesni_avx_ocb_auth(ctx->rkey_enc, abuf, + c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, Ls); + + nblocks -= 8; + abuf += 8 * 16; + } + } + } +#endif + if (nblocks) { unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in, @@ -634,6 +860,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) if (0) ; +#ifdef USE_AESNI_AVX + else if (ctx->use_aesni_avx) + { + crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8; + } +#endif else { prefetch_sbox_table (); diff --git a/configure.ac b/configure.ac index f77476e0..2458acfc 100644 --- a/configure.ac +++ b/configure.ac @@ -2564,6 +2564,13 @@ LIST_MEMBER(sm4, $enabled_ciphers) if test "$found" = "1" ; then GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo" AC_DEFINE(USE_SM4, 1, [Defined if this module should be included]) + + case "${host}" in + x86_64-*-*) + # Build with the assembly implementation + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx-amd64.lo" + ;; + esac fi LIST_MEMBER(dsa, $enabled_pubkey_ciphers) -- 2.25.1 From jussi.kivilinna at iki.fi Tue Jun 16 21:28:25 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 16 Jun 2020 22:28:25 +0300 Subject: [PATCH 3/3] Add SM4 x86-64/AES-NI/AVX2 implementation In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> <20200616192825.1584395-1-jussi.kivilinna@iki.fi> Message-ID: <20200616192825.1584395-4-jussi.kivilinna@iki.fi> * cipher/Makefile.am: Add 'sm4-aesni-avx2-amd64.S'. * cipher/sm4-aesni-avx2-amd64.S: New. * cipher/sm4.c (USE_AESNI_AVX2): New. (SM4_context) [USE_AESNI_AVX2]: Add 'use_aesni_avx2'. [USE_AESNI_AVX2] (_gcry_sm4_aesni_avx2_ctr_enc) (_gcry_sm4_aesni_avx2_cbc_dec, _gcry_sm4_aesni_avx2_cfb_dec) (_gcry_sm4_aesni_avx2_ocb_enc, _gcry_sm4_aesni_avx2_ocb_dec) (_gcry_sm4_aesni_avx_ocb_auth): New. (sm4_setkey): Enable AES-NI/AVX2 if supported by HW. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX2]: Add AES-NI/AVX2 bulk functions. * configure.ac: Add ''sm4-aesni-avx2-amd64.lo'. -- This patch adds x86-64/AES-NI/AVX2 bulk encryption/decryption. Bulk functions process 16 blocks in parallel. Benchmark on AMD Ryzen 7 3700X: Before: SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.98 ns/B 106.2 MiB/s 38.62 c/B 4300 CBC dec | 1.55 ns/B 613.7 MiB/s 6.64 c/B 4275 CFB enc | 8.96 ns/B 106.4 MiB/s 38.52 c/B 4300 CFB dec | 1.54 ns/B 617.4 MiB/s 6.60 c/B 4275 CTR enc | 1.57 ns/B 607.8 MiB/s 6.75 c/B 4300 CTR dec | 1.57 ns/B 608.9 MiB/s 6.74 c/B 4300 OCB enc | 1.58 ns/B 603.8 MiB/s 6.75 c/B 4275 OCB dec | 1.57 ns/B 605.7 MiB/s 6.73 c/B 4275 OCB auth | 1.53 ns/B 624.5 MiB/s 6.57 c/B 4300 After (~56% faster than AES-NI/AVX impl.): SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC enc | 8.93 ns/B 106.8 MiB/s 38.61 c/B 4326 CBC dec | 0.984 ns/B 969.5 MiB/s 4.23 c/B 4300 CFB enc | 8.93 ns/B 106.8 MiB/s 38.62 c/B 4325 CFB dec | 0.983 ns/B 970.3 MiB/s 4.23 c/B 4300 CTR enc | 0.998 ns/B 955.1 MiB/s 4.29 c/B 4300 CTR dec | 0.996 ns/B 957.4 MiB/s 4.28 c/B 4300 OCB enc | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300 OCB dec | 1.00 ns/B 951.8 MiB/s 4.31 c/B 4300 OCB auth | 0.993 ns/B 960.2 MiB/s 4.28 c/B 4304?2 Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 2 +- cipher/sm4-aesni-avx2-amd64.S | 851 ++++++++++++++++++++++++++++++++++ cipher/sm4.c | 186 +++++++- configure.ac | 1 + 4 files changed, 1038 insertions(+), 2 deletions(-) create mode 100644 cipher/sm4-aesni-avx2-amd64.S diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 427922c6..4798d456 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -107,7 +107,7 @@ EXTRA_libcipher_la_SOURCES = \ scrypt.c \ seed.c \ serpent.c serpent-sse2-amd64.S \ - sm4.c sm4-aesni-avx-amd64.S \ + sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \ serpent-avx2-amd64.S serpent-armv7-neon.S \ sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \ sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \ diff --git a/cipher/sm4-aesni-avx2-amd64.S b/cipher/sm4-aesni-avx2-amd64.S new file mode 100644 index 00000000..6e46c0dc --- /dev/null +++ b/cipher/sm4-aesni-avx2-amd64.S @@ -0,0 +1,851 @@ +/* sm4-avx2-amd64.S - AVX2 implementation of SM4 cipher + * + * Copyright (C) 2020 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, see . + */ + +/* Based on SM4 AES-NI work by Markku-Juhani O. Saarinen at: + * https://github.com/mjosaarinen/sm4ni + */ + +#include + +#ifdef __x86_64 +#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \ + defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) + +#include "asm-common-amd64.h" + +/* vector registers */ +#define RX0 %ymm0 +#define RX1 %ymm1 +#define MASK_4BIT %ymm2 +#define RTMP0 %ymm3 +#define RTMP1 %ymm4 +#define RTMP2 %ymm5 +#define RTMP3 %ymm6 +#define RTMP4 %ymm7 + +#define RA0 %ymm8 +#define RA1 %ymm9 +#define RA2 %ymm10 +#define RA3 %ymm11 + +#define RB0 %ymm12 +#define RB1 %ymm13 +#define RB2 %ymm14 +#define RB3 %ymm15 + +#define RNOT %ymm0 +#define RBSWAP %ymm1 + +#define RX0x %xmm0 +#define RX1x %xmm1 +#define MASK_4BITx %xmm2 + +#define RNOTx %xmm0 +#define RBSWAPx %xmm1 + +#define RTMP0x %xmm3 +#define RTMP1x %xmm4 +#define RTMP2x %xmm5 +#define RTMP3x %xmm6 +#define RTMP4x %xmm7 + +/********************************************************************** + helper macros + **********************************************************************/ + +/* Transpose four 32-bit words between 128-bit vector lanes. */ +#define transpose_4x4(x0, x1, x2, x3, t1, t2) \ + vpunpckhdq x1, x0, t2; \ + vpunpckldq x1, x0, x0; \ + \ + vpunpckldq x3, x2, t1; \ + vpunpckhdq x3, x2, x2; \ + \ + vpunpckhqdq t1, x0, x1; \ + vpunpcklqdq t1, x0, x0; \ + \ + vpunpckhqdq x2, t2, x3; \ + vpunpcklqdq x2, t2, x2; + +/* post-SubByte transform. */ +#define transform_pre(x, lo_t, hi_t, mask4bit, tmp0) \ + vpand x, mask4bit, tmp0; \ + vpandn x, mask4bit, x; \ + vpsrld $4, x, x; \ + \ + vpshufb tmp0, lo_t, tmp0; \ + vpshufb x, hi_t, x; \ + vpxor tmp0, x, x; + +/* post-SubByte transform. Note: x has been XOR'ed with mask4bit by + * 'vaeslastenc' instruction. */ +#define transform_post(x, lo_t, hi_t, mask4bit, tmp0) \ + vpandn mask4bit, x, tmp0; \ + vpsrld $4, x, x; \ + vpand x, mask4bit, x; \ + \ + vpshufb tmp0, lo_t, tmp0; \ + vpshufb x, hi_t, x; \ + vpxor tmp0, x, x; + +/********************************************************************** + 16-way SM4 with AES-NI and AVX + **********************************************************************/ + +.text +.align 16 + +/* + * Following four affine transform look-up tables are from work by + * Markku-Juhani O. Saarinen, at https://github.com/mjosaarinen/sm4ni + * + * These allow exposing SM4 S-Box from AES SubByte. + */ + +/* pre-SubByte affine transform, from SM4 field to AES field. */ +.Lpre_tf_lo_s: + .quad 0x9197E2E474720701, 0xC7C1B4B222245157 +.Lpre_tf_hi_s: + .quad 0xE240AB09EB49A200, 0xF052B91BF95BB012 + +/* post-SubByte affine transform, from AES field to SM4 field. */ +.Lpost_tf_lo_s: + .quad 0x5B67F2CEA19D0834, 0xEDD14478172BBE82 +.Lpost_tf_hi_s: + .quad 0xAE7201DD73AFDC00, 0x11CDBE62CC1063BF + +/* For isolating SubBytes from AESENCLAST, inverse shift row */ +.Linv_shift_row: + .byte 0x00, 0x0d, 0x0a, 0x07, 0x04, 0x01, 0x0e, 0x0b + .byte 0x08, 0x05, 0x02, 0x0f, 0x0c, 0x09, 0x06, 0x03 + +/* Inverse shift row + Rotate left by 8 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_8: + .byte 0x07, 0x00, 0x0d, 0x0a, 0x0b, 0x04, 0x01, 0x0e + .byte 0x0f, 0x08, 0x05, 0x02, 0x03, 0x0c, 0x09, 0x06 + +/* Inverse shift row + Rotate left by 16 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_16: + .byte 0x0a, 0x07, 0x00, 0x0d, 0x0e, 0x0b, 0x04, 0x01 + .byte 0x02, 0x0f, 0x08, 0x05, 0x06, 0x03, 0x0c, 0x09 + +/* Inverse shift row + Rotate left by 24 bits on 32-bit words with vpshufb */ +.Linv_shift_row_rol_24: + .byte 0x0d, 0x0a, 0x07, 0x00, 0x01, 0x0e, 0x0b, 0x04 + .byte 0x05, 0x02, 0x0f, 0x08, 0x09, 0x06, 0x03, 0x0c + +/* For CTR-mode IV byteswap */ +.Lbswap128_mask: + .byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 + +/* For input word byte-swap */ +.Lbswap32_mask: + .byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 + +.align 4 +/* 4-bit mask */ +.L0f0f0f0f: + .long 0x0f0f0f0f + +.align 8 +ELF(.type __sm4_crypt_blk16, at function;) +__sm4_crypt_blk16: + /* input: + * %rdi: ctx, CTX + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel + * plaintext blocks + * output: + * RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel + * ciphertext blocks + */ + CFI_STARTPROC(); + + vbroadcasti128 .Lbswap32_mask rRIP, RTMP2; + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + vpbroadcastd .L0f0f0f0f rRIP, MASK_4BIT; + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + +#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \ + vpbroadcastd (4*(round))(%rdi), RX0; \ + vbroadcasti128 .Lpre_tf_lo_s rRIP, RTMP4; \ + vbroadcasti128 .Lpre_tf_hi_s rRIP, RTMP1; \ + vmovdqa RX0, RX1; \ + vpxor s1, RX0, RX0; \ + vpxor s2, RX0, RX0; \ + vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \ + vbroadcasti128 .Lpost_tf_lo_s rRIP, RTMP2; \ + vbroadcasti128 .Lpost_tf_hi_s rRIP, RTMP3; \ + vpxor r1, RX1, RX1; \ + vpxor r2, RX1, RX1; \ + vpxor r3, RX1, RX1; /* r1 ^ r2 ^ r3 ^ rk */ \ + \ + /* sbox, non-linear part */ \ + transform_pre(RX0, RTMP4, RTMP1, MASK_4BIT, RTMP0); \ + transform_pre(RX1, RTMP4, RTMP1, MASK_4BIT, RTMP0); \ + vextracti128 $1, RX0, RTMP4x; \ + vextracti128 $1, RX1, RTMP0x; \ + vaesenclast MASK_4BITx, RX0x, RX0x; \ + vaesenclast MASK_4BITx, RTMP4x, RTMP4x; \ + vaesenclast MASK_4BITx, RX1x, RX1x; \ + vaesenclast MASK_4BITx, RTMP0x, RTMP0x; \ + vinserti128 $1, RTMP4x, RX0, RX0; \ + vbroadcasti128 .Linv_shift_row rRIP, RTMP4; \ + vinserti128 $1, RTMP0x, RX1, RX1; \ + transform_post(RX0, RTMP2, RTMP3, MASK_4BIT, RTMP0); \ + transform_post(RX1, RTMP2, RTMP3, MASK_4BIT, RTMP0); \ + \ + /* linear part */ \ + vpshufb RTMP4, RX0, RTMP0; \ + vpxor RTMP0, s0, s0; /* s0 ^ x */ \ + vpshufb RTMP4, RX1, RTMP2; \ + vbroadcasti128 .Linv_shift_row_rol_8 rRIP, RTMP4; \ + vpxor RTMP2, r0, r0; /* r0 ^ x */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vbroadcasti128 .Linv_shift_row_rol_16 rRIP, RTMP4; \ + vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vbroadcasti128 .Linv_shift_row_rol_24 rRIP, RTMP4; \ + vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) ^ rol(x,16) */ \ + vpshufb RTMP4, RX0, RTMP1; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP0, RTMP1; \ + vpsrld $30, RTMP0, RTMP0; \ + vpxor RTMP0, s0, s0; \ + vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \ + vpshufb RTMP4, RX1, RTMP3; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,24) */ \ + vpslld $2, RTMP2, RTMP3; \ + vpsrld $30, RTMP2, RTMP2; \ + vpxor RTMP2, r0, r0; \ + vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ + + leaq (32*4)(%rdi), %rax; +.align 16 +.Lroundloop_blk8: + ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3); + ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0); + ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1); + ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2); + leaq (4*4)(%rdi), %rdi; + cmpq %rax, %rdi; + jne .Lroundloop_blk8; + +#undef ROUND + + vbroadcasti128 .Lbswap128_mask rRIP, RTMP2; + + transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1); + transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1); + vpshufb RTMP2, RA0, RA0; + vpshufb RTMP2, RA1, RA1; + vpshufb RTMP2, RA2, RA2; + vpshufb RTMP2, RA3, RA3; + vpshufb RTMP2, RB0, RB0; + vpshufb RTMP2, RB1, RB1; + vpshufb RTMP2, RB2, RB2; + vpshufb RTMP2, RB3, RB3; + + ret; + CFI_ENDPROC(); +ELF(.size __sm4_crypt_blk16,.-__sm4_crypt_blk16;) + +#define inc_le128(x, minus_one, tmp) \ + vpcmpeqq minus_one, x, tmp; \ + vpsubq minus_one, x, x; \ + vpslldq $8, tmp, tmp; \ + vpsubq tmp, x, x; + +.align 8 +.globl _gcry_sm4_aesni_avx2_ctr_enc +ELF(.type _gcry_sm4_aesni_avx2_ctr_enc, at function;) +_gcry_sm4_aesni_avx2_ctr_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv (big endian, 128bit) + */ + CFI_STARTPROC(); + + movq 8(%rcx), %rax; + bswapq %rax; + + vzeroupper; + + vbroadcasti128 .Lbswap128_mask rRIP, RTMP3; + vpcmpeqd RNOT, RNOT, RNOT; + vpsrldq $8, RNOT, RNOT; /* ab: -1:0 ; cd: -1:0 */ + vpaddq RNOT, RNOT, RTMP2; /* ab: -2:0 ; cd: -2:0 */ + + /* load IV and byteswap */ + vmovdqu (%rcx), RTMP4x; + vpshufb RTMP3x, RTMP4x, RTMP4x; + vmovdqa RTMP4x, RTMP0x; + inc_le128(RTMP4x, RNOTx, RTMP1x); + vinserti128 $1, RTMP4x, RTMP0, RTMP0; + vpshufb RTMP3, RTMP0, RA0; /* +1 ; +0 */ + + /* check need for handling 64-bit overflow and carry */ + cmpq $(0xffffffffffffffff - 16), %rax; + ja .Lhandle_ctr_carry; + + /* construct IVs */ + vpsubq RTMP2, RTMP0, RTMP0; /* +3 ; +2 */ + vpshufb RTMP3, RTMP0, RA1; + vpsubq RTMP2, RTMP0, RTMP0; /* +5 ; +4 */ + vpshufb RTMP3, RTMP0, RA2; + vpsubq RTMP2, RTMP0, RTMP0; /* +7 ; +6 */ + vpshufb RTMP3, RTMP0, RA3; + vpsubq RTMP2, RTMP0, RTMP0; /* +9 ; +8 */ + vpshufb RTMP3, RTMP0, RB0; + vpsubq RTMP2, RTMP0, RTMP0; /* +11 ; +10 */ + vpshufb RTMP3, RTMP0, RB1; + vpsubq RTMP2, RTMP0, RTMP0; /* +13 ; +12 */ + vpshufb RTMP3, RTMP0, RB2; + vpsubq RTMP2, RTMP0, RTMP0; /* +15 ; +14 */ + vpshufb RTMP3, RTMP0, RB3; + vpsubq RTMP2, RTMP0, RTMP0; /* +16 */ + vpshufb RTMP3x, RTMP0x, RTMP0x; + + jmp .Lctr_carry_done; + +.Lhandle_ctr_carry: + /* construct IVs */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA1; /* +3 ; +2 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA2; /* +5 ; +4 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RA3; /* +7 ; +6 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB0; /* +9 ; +8 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB1; /* +11 ; +10 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB2; /* +13 ; +12 */ + inc_le128(RTMP0, RNOT, RTMP1); + inc_le128(RTMP0, RNOT, RTMP1); + vpshufb RTMP3, RTMP0, RB3; /* +15 ; +14 */ + inc_le128(RTMP0, RNOT, RTMP1); + vextracti128 $1, RTMP0, RTMP0x; + vpshufb RTMP3x, RTMP0x, RTMP0x; /* +16 */ + +.align 4 +.Lctr_carry_done: + /* store new IV */ + vmovdqu RTMP0x, (%rcx); + + call __sm4_crypt_blk16; + + vpxor (0 * 32)(%rdx), RA0, RA0; + vpxor (1 * 32)(%rdx), RA1, RA1; + vpxor (2 * 32)(%rdx), RA2, RA2; + vpxor (3 * 32)(%rdx), RA3, RA3; + vpxor (4 * 32)(%rdx), RB0, RB0; + vpxor (5 * 32)(%rdx), RB1, RB1; + vpxor (6 * 32)(%rdx), RB2, RB2; + vpxor (7 * 32)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_ctr_enc,.-_gcry_sm4_aesni_avx2_ctr_enc;) + +.align 8 +.globl _gcry_sm4_aesni_avx2_cbc_dec +ELF(.type _gcry_sm4_aesni_avx2_cbc_dec, at function;) +_gcry_sm4_aesni_avx2_cbc_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + vzeroupper; + + vmovdqu (0 * 32)(%rdx), RA0; + vmovdqu (1 * 32)(%rdx), RA1; + vmovdqu (2 * 32)(%rdx), RA2; + vmovdqu (3 * 32)(%rdx), RA3; + vmovdqu (4 * 32)(%rdx), RB0; + vmovdqu (5 * 32)(%rdx), RB1; + vmovdqu (6 * 32)(%rdx), RB2; + vmovdqu (7 * 32)(%rdx), RB3; + + call __sm4_crypt_blk16; + + vmovdqu (%rcx), RNOTx; + vinserti128 $1, (%rdx), RNOT, RNOT; + vpxor RNOT, RA0, RA0; + vpxor (0 * 32 + 16)(%rdx), RA1, RA1; + vpxor (1 * 32 + 16)(%rdx), RA2, RA2; + vpxor (2 * 32 + 16)(%rdx), RA3, RA3; + vpxor (3 * 32 + 16)(%rdx), RB0, RB0; + vpxor (4 * 32 + 16)(%rdx), RB1, RB1; + vpxor (5 * 32 + 16)(%rdx), RB2, RB2; + vpxor (6 * 32 + 16)(%rdx), RB3, RB3; + vmovdqu (7 * 32 + 16)(%rdx), RNOTx; + vmovdqu RNOTx, (%rcx); /* store new IV */ + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_cbc_dec,.-_gcry_sm4_aesni_avx2_cbc_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx2_cfb_dec +ELF(.type _gcry_sm4_aesni_avx2_cfb_dec, at function;) +_gcry_sm4_aesni_avx2_cfb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: iv + */ + CFI_STARTPROC(); + + vzeroupper; + + /* Load input */ + vmovdqu (%rcx), RNOTx; + vinserti128 $1, (%rdx), RNOT, RA0; + vmovdqu (0 * 32 + 16)(%rdx), RA1; + vmovdqu (1 * 32 + 16)(%rdx), RA2; + vmovdqu (2 * 32 + 16)(%rdx), RA3; + vmovdqu (3 * 32 + 16)(%rdx), RB0; + vmovdqu (4 * 32 + 16)(%rdx), RB1; + vmovdqu (5 * 32 + 16)(%rdx), RB2; + vmovdqu (6 * 32 + 16)(%rdx), RB3; + + /* Update IV */ + vmovdqu (7 * 32 + 16)(%rdx), RNOTx; + vmovdqu RNOTx, (%rcx); + + call __sm4_crypt_blk16; + + vpxor (0 * 32)(%rdx), RA0, RA0; + vpxor (1 * 32)(%rdx), RA1, RA1; + vpxor (2 * 32)(%rdx), RA2, RA2; + vpxor (3 * 32)(%rdx), RA3, RA3; + vpxor (4 * 32)(%rdx), RB0, RB0; + vpxor (5 * 32)(%rdx), RB1, RB1; + vpxor (6 * 32)(%rdx), RB2, RB2; + vpxor (7 * 32)(%rdx), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_cfb_dec,.-_gcry_sm4_aesni_avx2_cfb_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx2_ocb_enc +ELF(.type _gcry_sm4_aesni_avx2_ocb_enc, at function;) + +_gcry_sm4_aesni_avx2_ocb_enc: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + vzeroupper; + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rcx), RTMP0x; + vmovdqu (%r8), RTMP1x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Checksum_i = Checksum_{i-1} xor P_i */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rdx), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RTMP1, RTMP1; \ + vpxor yreg, RNOT, yreg; \ + vmovdqu RNOT, (n * 32)(%rsi); + + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r9), %r10; + movq (9 * 8)(%r9), %r11; + movq (10 * 8)(%r9), %r12; + movq (11 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r9), %r10; + movq (13 * 8)(%r9), %r11; + movq (14 * 8)(%r9), %r12; + movq (15 * 8)(%r9), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vextracti128 $1, RTMP1, RNOTx; + vmovdqu RTMP0x, (%rcx); + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%r8); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vpxor (0 * 32)(%rsi), RA0, RA0; + vpxor (1 * 32)(%rsi), RA1, RA1; + vpxor (2 * 32)(%rsi), RA2, RA2; + vpxor (3 * 32)(%rsi), RA3, RA3; + vpxor (4 * 32)(%rsi), RB0, RB0; + vpxor (5 * 32)(%rsi), RB1, RB1; + vpxor (6 * 32)(%rsi), RB2, RB2; + vpxor (7 * 32)(%rsi), RB3, RB3; + + vmovdqu RA0, (0 * 32)(%rsi); + vmovdqu RA1, (1 * 32)(%rsi); + vmovdqu RA2, (2 * 32)(%rsi); + vmovdqu RA3, (3 * 32)(%rsi); + vmovdqu RB0, (4 * 32)(%rsi); + vmovdqu RB1, (5 * 32)(%rsi); + vmovdqu RB2, (6 * 32)(%rsi); + vmovdqu RB3, (7 * 32)(%rsi); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_ocb_enc,.-_gcry_sm4_aesni_avx2_ocb_enc;) + +.align 8 +.globl _gcry_sm4_aesni_avx2_ocb_dec +ELF(.type _gcry_sm4_aesni_avx2_ocb_dec, at function;) + +_gcry_sm4_aesni_avx2_ocb_dec: + /* input: + * %rdi: ctx, CTX + * %rsi: dst (16 blocks) + * %rdx: src (16 blocks) + * %rcx: offset + * %r8 : checksum + * %r9 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + vzeroupper; + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rcx), RTMP0x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rdx), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RNOT, yreg; \ + vmovdqu RNOT, (n * 32)(%rsi); + + movq (0 * 8)(%r9), %r10; + movq (1 * 8)(%r9), %r11; + movq (2 * 8)(%r9), %r12; + movq (3 * 8)(%r9), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r9), %r10; + movq (5 * 8)(%r9), %r11; + movq (6 * 8)(%r9), %r12; + movq (7 * 8)(%r9), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r9), %r10; + movq (9 * 8)(%r9), %r11; + movq (10 * 8)(%r9), %r12; + movq (11 * 8)(%r9), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r9), %r10; + movq (13 * 8)(%r9), %r11; + movq (14 * 8)(%r9), %r12; + movq (15 * 8)(%r9), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0x, (%rcx); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vmovdqu (%r8), RTMP1x; + + vpxor (0 * 32)(%rsi), RA0, RA0; + vpxor (1 * 32)(%rsi), RA1, RA1; + vpxor (2 * 32)(%rsi), RA2, RA2; + vpxor (3 * 32)(%rsi), RA3, RA3; + vpxor (4 * 32)(%rsi), RB0, RB0; + vpxor (5 * 32)(%rsi), RB1, RB1; + vpxor (6 * 32)(%rsi), RB2, RB2; + vpxor (7 * 32)(%rsi), RB3, RB3; + + /* Checksum_i = Checksum_{i-1} xor P_i */ + + vmovdqu RA0, (0 * 32)(%rsi); + vpxor RA0, RTMP1, RTMP1; + vmovdqu RA1, (1 * 32)(%rsi); + vpxor RA1, RTMP1, RTMP1; + vmovdqu RA2, (2 * 32)(%rsi); + vpxor RA2, RTMP1, RTMP1; + vmovdqu RA3, (3 * 32)(%rsi); + vpxor RA3, RTMP1, RTMP1; + vmovdqu RB0, (4 * 32)(%rsi); + vpxor RB0, RTMP1, RTMP1; + vmovdqu RB1, (5 * 32)(%rsi); + vpxor RB1, RTMP1, RTMP1; + vmovdqu RB2, (6 * 32)(%rsi); + vpxor RB2, RTMP1, RTMP1; + vmovdqu RB3, (7 * 32)(%rsi); + vpxor RB3, RTMP1, RTMP1; + + vextracti128 $1, RTMP1, RNOTx; + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%r8); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_ocb_dec,.-_gcry_sm4_aesni_avx2_ocb_dec;) + +.align 8 +.globl _gcry_sm4_aesni_avx2_ocb_auth +ELF(.type _gcry_sm4_aesni_avx2_ocb_auth, at function;) + +_gcry_sm4_aesni_avx2_ocb_auth: + /* input: + * %rdi: ctx, CTX + * %rsi: abuf (16 blocks) + * %rdx: offset + * %rcx: checksum + * %r8 : L pointers (void *L[16]) + */ + CFI_STARTPROC(); + + vzeroupper; + + subq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(4 * 8); + + movq %r10, (0 * 8)(%rsp); + movq %r11, (1 * 8)(%rsp); + movq %r12, (2 * 8)(%rsp); + movq %r13, (3 * 8)(%rsp); + CFI_REL_OFFSET(%r10, 0 * 8); + CFI_REL_OFFSET(%r11, 1 * 8); + CFI_REL_OFFSET(%r12, 2 * 8); + CFI_REL_OFFSET(%r13, 3 * 8); + + vmovdqu (%rdx), RTMP0x; + + /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */ + /* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i) */ + +#define OCB_INPUT(n, l0reg, l1reg, yreg) \ + vmovdqu (n * 32)(%rsi), yreg; \ + vpxor (l0reg), RTMP0x, RNOTx; \ + vpxor (l1reg), RNOTx, RTMP0x; \ + vinserti128 $1, RTMP0x, RNOT, RNOT; \ + vpxor yreg, RNOT, yreg; + + movq (0 * 8)(%r8), %r10; + movq (1 * 8)(%r8), %r11; + movq (2 * 8)(%r8), %r12; + movq (3 * 8)(%r8), %r13; + OCB_INPUT(0, %r10, %r11, RA0); + OCB_INPUT(1, %r12, %r13, RA1); + movq (4 * 8)(%r8), %r10; + movq (5 * 8)(%r8), %r11; + movq (6 * 8)(%r8), %r12; + movq (7 * 8)(%r8), %r13; + OCB_INPUT(2, %r10, %r11, RA2); + OCB_INPUT(3, %r12, %r13, RA3); + movq (8 * 8)(%r8), %r10; + movq (9 * 8)(%r8), %r11; + movq (10 * 8)(%r8), %r12; + movq (11 * 8)(%r8), %r13; + OCB_INPUT(4, %r10, %r11, RB0); + OCB_INPUT(5, %r12, %r13, RB1); + movq (12 * 8)(%r8), %r10; + movq (13 * 8)(%r8), %r11; + movq (14 * 8)(%r8), %r12; + movq (15 * 8)(%r8), %r13; + OCB_INPUT(6, %r10, %r11, RB2); + OCB_INPUT(7, %r12, %r13, RB3); +#undef OCB_INPUT + + vmovdqu RTMP0x, (%rdx); + + movq (0 * 8)(%rsp), %r10; + movq (1 * 8)(%rsp), %r11; + movq (2 * 8)(%rsp), %r12; + movq (3 * 8)(%rsp), %r13; + CFI_RESTORE(%r10); + CFI_RESTORE(%r11); + CFI_RESTORE(%r12); + CFI_RESTORE(%r13); + + call __sm4_crypt_blk16; + + addq $(4 * 8), %rsp; + CFI_ADJUST_CFA_OFFSET(-4 * 8); + + vpxor RA0, RB0, RA0; + vpxor RA1, RB1, RA1; + vpxor RA2, RB2, RA2; + vpxor RA3, RB3, RA3; + + vpxor RA1, RA0, RA0; + vpxor RA3, RA2, RA2; + + vpxor RA2, RA0, RTMP1; + + vextracti128 $1, RTMP1, RNOTx; + vpxor (%rcx), RTMP1x, RTMP1x; + vpxor RNOTx, RTMP1x, RTMP1x; + vmovdqu RTMP1x, (%rcx); + + vzeroall; + + ret; + CFI_ENDPROC(); +ELF(.size _gcry_sm4_aesni_avx2_ocb_auth,.-_gcry_sm4_aesni_avx2_ocb_auth;) + +#endif /*defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)*/ +#endif /*__x86_64*/ diff --git a/cipher/sm4.c b/cipher/sm4.c index da75cf87..0da095a5 100644 --- a/cipher/sm4.c +++ b/cipher/sm4.c @@ -47,10 +47,19 @@ # endif #endif +/* USE_AESNI_AVX inidicates whether to compile with Intel AES-NI/AVX2 code. */ +#undef USE_AESNI_AVX2 +#if defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT) +# if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \ + defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) +# define USE_AESNI_AVX2 1 +# endif +#endif + /* Assembly implementations use SystemV ABI, ABI conversion and additional * stack to store XMM6-XMM15 needed on Win64. */ #undef ASM_FUNC_ABI -#if defined(USE_AESNI_AVX) +#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2) # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS # define ASM_FUNC_ABI __attribute__((sysv_abi)) # else @@ -67,6 +76,9 @@ typedef struct #ifdef USE_AESNI_AVX unsigned int use_aesni_avx:1; #endif +#ifdef USE_AESNI_AVX2 + unsigned int use_aesni_avx2:1; +#endif } SM4_context; static const u32 fk[4] = @@ -172,6 +184,40 @@ extern void _gcry_sm4_aesni_avx_ocb_auth(const u32 *rk_enc, const u64 Ls[8]) ASM_FUNC_ABI; #endif /* USE_AESNI_AVX */ +#ifdef USE_AESNI_AVX2 +extern void _gcry_sm4_aesni_avx2_ctr_enc(const u32 *rk_enc, byte *out, + const byte *in, + byte *ctr) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx2_cbc_dec(const u32 *rk_dec, byte *out, + const byte *in, + byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx2_cfb_dec(const u32 *rk_enc, byte *out, + const byte *in, + byte *iv) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx2_ocb_enc(const u32 *rk_enc, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx2_ocb_dec(const u32 *rk_dec, + unsigned char *out, + const unsigned char *in, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; + +extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc, + const unsigned char *abuf, + unsigned char *offset, + unsigned char *checksum, + const u64 Ls[16]) ASM_FUNC_ABI; +#endif /* USE_AESNI_AVX2 */ + static inline void prefetch_sbox_table(void) { const volatile byte *vtab = (void *)&sbox_table; @@ -301,6 +347,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen, #ifdef USE_AESNI_AVX ctx->use_aesni_avx = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX); #endif +#ifdef USE_AESNI_AVX2 + ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2); +#endif sm4_expand_key (ctx, key); return 0; @@ -444,6 +493,21 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr, const byte *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_aesni_avx2_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { @@ -530,6 +594,21 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_aesni_avx2_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { @@ -609,6 +688,21 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv, const unsigned char *inbuf = inbuf_arg; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2) + { + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + _gcry_sm4_aesni_avx2_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { @@ -691,6 +785,53 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg, u64 blkn = c->u_mode.ocb.data_nblocks; int burn_stack_depth = 0; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2) + { + u64 Ls[16]; + unsigned int n = 16 - (blkn % 16); + u64 *l; + int i; + + if (nblocks >= 16) + { + for (i = 0; i < 16; i += 8) + { + /* Use u64 to store pointers for x32 support (assembly function + * assumes 64-bit pointers). */ + Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + } + + Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + l = &Ls[(15 + n) % 16]; + + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + blkn += 16; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16); + + if (encrypt) + _gcry_sm4_aesni_avx2_ocb_enc(ctx->rkey_enc, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + else + _gcry_sm4_aesni_avx2_ocb_dec(ctx->rkey_dec, outbuf, inbuf, + c->u_iv.iv, c->u_ctr.ctr, Ls); + + nblocks -= 16; + outbuf += 16 * 16; + inbuf += 16 * 16; + } + } + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { @@ -813,6 +954,49 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks) const unsigned char *abuf = abuf_arg; u64 blkn = c->u_mode.ocb.aad_nblocks; +#ifdef USE_AESNI_AVX2 + if (ctx->use_aesni_avx2) + { + u64 Ls[16]; + unsigned int n = 16 - (blkn % 16); + u64 *l; + int i; + + if (nblocks >= 16) + { + for (i = 0; i < 16; i += 8) + { + /* Use u64 to store pointers for x32 support (assembly function + * assumes 64-bit pointers). */ + Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2]; + Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1]; + Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0]; + } + + Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3]; + l = &Ls[(15 + n) % 16]; + + /* Process data in 16 block chunks. */ + while (nblocks >= 16) + { + blkn += 16; + *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16); + + _gcry_sm4_aesni_avx2_ocb_auth(ctx->rkey_enc, abuf, + c->u_mode.ocb.aad_offset, + c->u_mode.ocb.aad_sum, Ls); + + nblocks -= 16; + abuf += 16 * 16; + } + } + } +#endif + #ifdef USE_AESNI_AVX if (ctx->use_aesni_avx) { diff --git a/configure.ac b/configure.ac index 2458acfc..1f03e79f 100644 --- a/configure.ac +++ b/configure.ac @@ -2569,6 +2569,7 @@ if test "$found" = "1" ; then x86_64-*-*) # Build with the assembly implementation GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx-amd64.lo" + GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx2-amd64.lo" ;; esac fi -- 2.25.1 From jussi.kivilinna at iki.fi Tue Jun 16 21:35:49 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Tue, 16 Jun 2020 22:35:49 +0300 Subject: [PATCH v2 0/2] Add SM4 symmetric cipher algorithm In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com> Message-ID: Hello, On 16.6.2020 12.09, Tianjia Zhang via Gcrypt-devel wrote: > SM4 (GBT.32907-2016) is a cryptographic standard issued by the > Organization of State Commercial Administration of China (OSCCA) > as an authorized cryptographic algorithms for the use within China. > > SMS4 was originally created for use in protecting wireless > networks, and is mandated in the Chinese National Standard for > Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure) > (GB.15629.11-2003). > > Tianjia Zhang (2): > Add SM4 symmetric cipher algorithm > tests: Add basic test-vectors for SM4 > Thanks, pushed to master with small fixes. -Jussi From jussi.kivilinna at iki.fi Sat Jun 20 14:08:28 2020 From: jussi.kivilinna at iki.fi (Jussi Kivilinna) Date: Sat, 20 Jun 2020 15:08:28 +0300 Subject: [PATCH] Camellia AES-NI/AVX/AVX2 size optimization Message-ID: <20200620120828.2892006-1-jussi.kivilinna@iki.fi> * cipher/camellia-aesni-avx-amd64.S: Use loop for handling repeating '(enc|dec)_rounds16/fls16' portions of encryption/decryption. * cipher/camellia-aesni-avx2-amd64.S: Use loop for handling repeating '(enc|dec)_rounds32/fls32' portions of encryption/decryption. -- Use rounds+fls loop to reduce binary size of Camellia AES-NI/AVX/AVX2 implementations. This also gives small performance boost on AMD Zen2. Before: text data bss dec hex filename 63877 0 0 63877 f985 cipher/.libs/camellia-aesni-avx2-amd64.o 59623 0 0 59623 e8e7 cipher/.libs/camellia-aesni-avx-amd64.o After: text data bss dec hex filename 22999 0 0 22999 59d7 cipher/.libs/camellia-aesni-avx2-amd64.o 25047 0 0 25047 61d7 cipher/.libs/camellia-aesni-avx-amd64.o Benchmark on AMD Ryzen 7 3700X: Before: Cipher: CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.670 ns/B 1424 MiB/s 2.88 c/B 4300 CFB dec | 0.667 ns/B 1430 MiB/s 2.87 c/B 4300 CTR enc | 0.677 ns/B 1410 MiB/s 2.91 c/B 4300 CTR dec | 0.676 ns/B 1412 MiB/s 2.90 c/B 4300 OCB enc | 0.696 ns/B 1370 MiB/s 2.98 c/B 4275 OCB dec | 0.698 ns/B 1367 MiB/s 2.98 c/B 4275 OCB auth | 0.683 ns/B 1395 MiB/s 2.94 c/B 4300 After (~8% faster): CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz CBC dec | 0.611 ns/B 1561 MiB/s 2.64 c/B 4313 CFB dec | 0.616 ns/B 1549 MiB/s 2.65 c/B 4312 CTR enc | 0.625 ns/B 1525 MiB/s 2.69 c/B 4300 CTR dec | 0.625 ns/B 1526 MiB/s 2.69 c/B 4299 OCB enc | 0.639 ns/B 1493 MiB/s 2.75 c/B 4307 OCB dec | 0.642 ns/B 1485 MiB/s 2.76 c/B 4301 OCB auth | 0.631 ns/B 1512 MiB/s 2.71 c/B 4300 Signed-off-by: Jussi Kivilinna --- cipher/camellia-aesni-avx-amd64.S | 136 +++++++++++------------------ cipher/camellia-aesni-avx2-amd64.S | 135 +++++++++++----------------- 2 files changed, 106 insertions(+), 165 deletions(-) diff --git a/cipher/camellia-aesni-avx-amd64.S b/cipher/camellia-aesni-avx-amd64.S index 4671bcfe..64cabaa5 100644 --- a/cipher/camellia-aesni-avx-amd64.S +++ b/cipher/camellia-aesni-avx-amd64.S @@ -1,6 +1,6 @@ /* camellia-avx-aesni-amd64.S - AES-NI/AVX implementation of Camellia cipher * - * Copyright (C) 2013-2015 Jussi Kivilinna + * Copyright (C) 2013-2015,2020 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -35,7 +35,6 @@ /* register macros */ #define CTX %rdi -#define RIO %r8 /********************************************************************** helper macros @@ -772,6 +771,7 @@ __camellia_enc_blk16: /* input: * %rdi: ctx, CTX * %rax: temporary storage, 256 bytes + * %r8d: 24 for 16 byte key, 32 for larger * %xmm0..%xmm15: 16 plaintext blocks * output: * %xmm0..%xmm15: 16 encrypted blocks, order swapped: @@ -781,42 +781,32 @@ __camellia_enc_blk16: leaq 8 * 16(%rax), %rcx; + leaq (-8 * 8)(CTX, %r8, 8), %r8; + inpack16_post(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %rax, %rcx); +.align 8 +.Lenc_loop: enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %rax, %rcx, 0); - fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, - ((key_table + (8) * 8) + 0)(CTX), - ((key_table + (8) * 8) + 4)(CTX), - ((key_table + (8) * 8) + 8)(CTX), - ((key_table + (8) * 8) + 12)(CTX)); - - enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 8); + cmpq %r8, CTX; + je .Lenc_done; + leaq (8 * 8)(CTX), CTX; fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, - ((key_table + (16) * 8) + 0)(CTX), - ((key_table + (16) * 8) + 4)(CTX), - ((key_table + (16) * 8) + 8)(CTX), - ((key_table + (16) * 8) + 12)(CTX)); - - enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 16); - - movl $24, %r8d; - cmpl $128, key_bitlength(CTX); - jne .Lenc_max32; + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX), + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX)); + jmp .Lenc_loop; +.align 8 .Lenc_done: /* load CD for output */ vmovdqu 0 * 16(%rcx), %xmm8; @@ -830,27 +820,9 @@ __camellia_enc_blk16: outunpack16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, (key_table)(CTX, %r8, 8), (%rax), 1 * 16(%rax)); + %xmm15, ((key_table) + 8 * 8)(%r8), (%rax), 1 * 16(%rax)); ret; - -.align 8 -.Lenc_max32: - movl $32, %r8d; - - fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, - ((key_table + (24) * 8) + 0)(CTX), - ((key_table + (24) * 8) + 4)(CTX), - ((key_table + (24) * 8) + 8)(CTX), - ((key_table + (24) * 8) + 12)(CTX)); - - enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 24); - - jmp .Lenc_done; CFI_ENDPROC(); ELF(.size __camellia_enc_blk16,.-__camellia_enc_blk16;) @@ -869,44 +841,38 @@ __camellia_dec_blk16: */ CFI_STARTPROC(); + movq %r8, %rcx; + movq CTX, %r8 + leaq (-8 * 8)(CTX, %rcx, 8), CTX; + leaq 8 * 16(%rax), %rcx; inpack16_post(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %rax, %rcx); - cmpl $32, %r8d; - je .Ldec_max32; - -.Ldec_max24: +.align 8 +.Ldec_loop: dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 16); - - fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, - ((key_table + (16) * 8) + 8)(CTX), - ((key_table + (16) * 8) + 12)(CTX), - ((key_table + (16) * 8) + 0)(CTX), - ((key_table + (16) * 8) + 4)(CTX)); + %xmm15, %rax, %rcx, 0); - dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 8); + cmpq %r8, CTX; + je .Ldec_done; fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, - ((key_table + (8) * 8) + 8)(CTX), - ((key_table + (8) * 8) + 12)(CTX), - ((key_table + (8) * 8) + 0)(CTX), - ((key_table + (8) * 8) + 4)(CTX)); + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX), + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX)); - dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 0); + leaq (-8 * 8)(CTX), CTX; + jmp .Ldec_loop; +.align 8 +.Ldec_done: /* load CD for output */ vmovdqu 0 * 16(%rcx), %xmm8; vmovdqu 1 * 16(%rcx), %xmm9; @@ -922,22 +888,6 @@ __camellia_dec_blk16: %xmm15, (key_table)(CTX), (%rax), 1 * 16(%rax)); ret; - -.align 8 -.Ldec_max32: - dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, %rax, %rcx, 24); - - fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, - %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, - %xmm15, - ((key_table + (24) * 8) + 8)(CTX), - ((key_table + (24) * 8) + 12)(CTX), - ((key_table + (24) * 8) + 0)(CTX), - ((key_table + (24) * 8) + 4)(CTX)); - - jmp .Ldec_max24; CFI_ENDPROC(); ELF(.size __camellia_dec_blk16,.-__camellia_dec_blk16;) @@ -967,6 +917,11 @@ _gcry_camellia_aesni_avx_ctr_enc: vzeroupper; + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + subq $(16 * 16), %rsp; andq $~31, %rsp; movq %rsp, %rax; @@ -1163,6 +1118,11 @@ _gcry_camellia_aesni_avx_cfb_dec: vzeroupper; + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + subq $(16 * 16), %rsp; andq $~31, %rsp; movq %rsp, %rax; @@ -1307,6 +1267,11 @@ _gcry_camellia_aesni_avx_ocb_enc: vmovdqu %xmm14, (%rcx); vmovdqu %xmm15, (%r8); + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %r10d; + cmovel %r10d, %r8d; /* max */ + /* inpack16_pre: */ vmovq (key_table)(CTX), %xmm15; vpshufb .Lpack_bswap rRIP, %xmm15, %xmm15; @@ -1617,6 +1582,11 @@ _gcry_camellia_aesni_avx_ocb_auth: OCB_INPUT(15, %r13, %xmm0); #undef OCB_INPUT + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %r10d; + cmovel %r10d, %r8d; /* max */ + vmovdqu %xmm15, (%rdx); movq %rcx, %r10; diff --git a/cipher/camellia-aesni-avx2-amd64.S b/cipher/camellia-aesni-avx2-amd64.S index 517e6880..f620f040 100644 --- a/cipher/camellia-aesni-avx2-amd64.S +++ b/cipher/camellia-aesni-avx2-amd64.S @@ -1,6 +1,6 @@ /* camellia-avx2-aesni-amd64.S - AES-NI/AVX2 implementation of Camellia cipher * - * Copyright (C) 2013-2015 Jussi Kivilinna + * Copyright (C) 2013-2015,2020 Jussi Kivilinna * * This file is part of Libgcrypt. * @@ -751,6 +751,7 @@ __camellia_enc_blk32: /* input: * %rdi: ctx, CTX * %rax: temporary storage, 512 bytes + * %r8d: 24 for 16 byte key, 32 for larger * %ymm0..%ymm15: 32 plaintext blocks * output: * %ymm0..%ymm15: 32 encrypted blocks, order swapped: @@ -760,42 +761,32 @@ __camellia_enc_blk32: leaq 8 * 32(%rax), %rcx; + leaq (-8 * 8)(CTX, %r8, 8), %r8; + inpack32_post(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, %rax, %rcx); +.align 8 +.Lenc_loop: enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, %rax, %rcx, 0); - fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, - ((key_table + (8) * 8) + 0)(CTX), - ((key_table + (8) * 8) + 4)(CTX), - ((key_table + (8) * 8) + 8)(CTX), - ((key_table + (8) * 8) + 12)(CTX)); - - enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 8); + cmpq %r8, CTX; + je .Lenc_done; + leaq (8 * 8)(CTX), CTX; fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, - ((key_table + (16) * 8) + 0)(CTX), - ((key_table + (16) * 8) + 4)(CTX), - ((key_table + (16) * 8) + 8)(CTX), - ((key_table + (16) * 8) + 12)(CTX)); - - enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 16); - - movl $24, %r8d; - cmpl $128, key_bitlength(CTX); - jne .Lenc_max32; + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX), + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX)); + jmp .Lenc_loop; +.align 8 .Lenc_done: /* load CD for output */ vmovdqu 0 * 32(%rcx), %ymm8; @@ -809,27 +800,9 @@ __camellia_enc_blk32: outunpack32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, (key_table)(CTX, %r8, 8), (%rax), 1 * 32(%rax)); + %ymm15, ((key_table) + 8 * 8)(%r8), (%rax), 1 * 32(%rax)); ret; - -.align 8 -.Lenc_max32: - movl $32, %r8d; - - fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, - ((key_table + (24) * 8) + 0)(CTX), - ((key_table + (24) * 8) + 4)(CTX), - ((key_table + (24) * 8) + 8)(CTX), - ((key_table + (24) * 8) + 12)(CTX)); - - enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 24); - - jmp .Lenc_done; CFI_ENDPROC(); ELF(.size __camellia_enc_blk32,.-__camellia_enc_blk32;) @@ -848,44 +821,38 @@ __camellia_dec_blk32: */ CFI_STARTPROC(); + movq %r8, %rcx; + movq CTX, %r8 + leaq (-8 * 8)(CTX, %rcx, 8), CTX; + leaq 8 * 32(%rax), %rcx; inpack32_post(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, %rax, %rcx); - cmpl $32, %r8d; - je .Ldec_max32; - -.Ldec_max24: +.align 8 +.Ldec_loop: dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 16); - - fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, - ((key_table + (16) * 8) + 8)(CTX), - ((key_table + (16) * 8) + 12)(CTX), - ((key_table + (16) * 8) + 0)(CTX), - ((key_table + (16) * 8) + 4)(CTX)); + %ymm15, %rax, %rcx, 0); - dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 8); + cmpq %r8, CTX; + je .Ldec_done; fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, %ymm15, - ((key_table + (8) * 8) + 8)(CTX), - ((key_table + (8) * 8) + 12)(CTX), - ((key_table + (8) * 8) + 0)(CTX), - ((key_table + (8) * 8) + 4)(CTX)); + ((key_table) + 8)(CTX), + ((key_table) + 12)(CTX), + ((key_table) + 0)(CTX), + ((key_table) + 4)(CTX)); - dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 0); + leaq (-8 * 8)(CTX), CTX; + jmp .Ldec_loop; +.align 8 +.Ldec_done: /* load CD for output */ vmovdqu 0 * 32(%rcx), %ymm8; vmovdqu 1 * 32(%rcx), %ymm9; @@ -901,22 +868,6 @@ __camellia_dec_blk32: %ymm15, (key_table)(CTX), (%rax), 1 * 32(%rax)); ret; - -.align 8 -.Ldec_max32: - dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, %rax, %rcx, 24); - - fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7, - %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14, - %ymm15, - ((key_table + (24) * 8) + 8)(CTX), - ((key_table + (24) * 8) + 12)(CTX), - ((key_table + (24) * 8) + 0)(CTX), - ((key_table + (24) * 8) + 4)(CTX)); - - jmp .Ldec_max24; CFI_ENDPROC(); ELF(.size __camellia_dec_blk32,.-__camellia_dec_blk32;) @@ -949,6 +900,11 @@ _gcry_camellia_aesni_avx2_ctr_enc: vzeroupper; + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + subq $(16 * 32), %rsp; andq $~63, %rsp; movq %rsp, %rax; @@ -1216,6 +1172,11 @@ _gcry_camellia_aesni_avx2_cfb_dec: vzeroupper; + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %eax; + cmovel %eax, %r8d; /* max */ + subq $(16 * 32), %rsp; andq $~63, %rsp; movq %rsp, %rax; @@ -1384,6 +1345,11 @@ _gcry_camellia_aesni_avx2_ocb_enc: vpxor %xmm13, %xmm15, %xmm15; vmovdqu %xmm15, (%r8); + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %r10d; + cmovel %r10d, %r8d; /* max */ + /* inpack16_pre: */ vpbroadcastq (key_table)(CTX), %ymm15; vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15; @@ -1742,6 +1708,11 @@ _gcry_camellia_aesni_avx2_ocb_auth: vmovdqu %xmm14, (%rdx); + cmpl $128, key_bitlength(CTX); + movl $32, %r8d; + movl $24, %r10d; + cmovel %r10d, %r8d; /* max */ + movq %rcx, %r10; /* inpack16_pre: */ -- 2.25.1 From stanermetin at gmail.com Sun Jun 21 23:59:29 2020 From: stanermetin at gmail.com (Taner) Date: Sun, 21 Jun 2020 23:59:29 +0200 Subject: Leave Message-ID: Hello, I would like to get out from mailing group. Thank you, -------------- next part -------------- An HTML attachment was scrubbed... URL: From mandar.apte409 at gmail.com Mon Jun 22 14:44:31 2020 From: mandar.apte409 at gmail.com (mandar.apte409 at gmail.com) Date: Mon, 22 Jun 2020 05:44:31 -0700 (MST) Subject: Generate ECDH shared key - NIST P256 Message-ID: <518964648.31228.1592829871223.JavaMail.administrator@n7.nabble.com> Hi all, I am trying to generate ECC shared secret key using Libgcrypt 1.8.5. Based on documentation of Libgcrypt, I used gcry_pk_genkey() to generate public-private key pair on server and client. The S-Expression I used is "(genkey(ecdh(curve NIST-P256)(use-fips186)))" to generate public-private key pair based on ECC NIST P256 curve. Now I need to generate shared secret key (ECDH agreement) using Local private key and remote public key. I see that, there is no single function like "ECDH_compute_key()" in openssl, to generate secret shared key. After browsing lot of websites, white paper etc, I figured out that gcry_pk_encrypt() is suppose to be used to generate shared secret. But when I tried to use that function, it yielded me different shared secret. On client side, I am using client's private key and server's public key and on server side I am using Server's private key and client's public key. Can anyone help me in this please, to generate shared secret key using Libgcrypt 1.8.5 version ? Any help is highly appreciated. Thank you in advance. Best Regards, Mandar _____________________________________ Sent from http://gnupg.10057.n7.nabble.com