From gniibe at fsij.org  Mon Jun  1 09:39:30 2020
From: gniibe at fsij.org (NIIBE Yutaka)
Date: Mon, 01 Jun 2020 16:39:30 +0900
Subject: gcry_mpi_invm succeeds if the inverse does not exist
In-Reply-To: <6f38bc98-85fe-12b1-9cce-8e96e699e378@iki.fi>
References: <CAO5O-ELb_27vpXnAXeeJZxnovftEqe_aFe0JsbUs7KMhtEuD=Q@mail.gmail.com>
 <87zhbeh921.fsf@iwagami.gniibe.org>
 <CAO5O-EJipQu2VBbtTiO_bt=7mJ35aT_AFd-OJz4F9swR7SfWbg@mail.gmail.com>
 <51fc5b09-7b2f-ce2e-bb8c-f653ba907446@iki.fi>
 <CAO5O-EJwyViAdfUY-hnZ0hgBYWkLR7DK9M2-RGvbyT-t-BEsPg@mail.gmail.com>
 <87tv0k9vwj.fsf@jumper.gniibe.org>
 <6f38bc98-85fe-12b1-9cce-8e96e699e378@iki.fi>
Message-ID: <87d06jmdzh.fsf@iwagami.gniibe.org>

Jussi Kivilinna <jussi.kivilinna at iki.fi> wrote:
> Cryptofuzz is reporting another heap-buffer-overflow issue in
> _gcry_mpi_invm. I've attached reproducer, original from Guido and
> as patch applied to tests/basic.c.

My fix of 69b55f87053ce2494cd4b38dc600f867bc4355be was not enough.
I just push another change:

	6f8b1d4cb798375e6d830fd6b73c71da93ee5f3f

Thank you for your report.
-- 


From mandar.apte409 at gmail.com  Tue Jun  2 13:27:23 2020
From: mandar.apte409 at gmail.com (Mandar Apte)
Date: Tue, 2 Jun 2020 16:57:23 +0530
Subject: Decrypt using BcryptDecrypt
Message-ID: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>

Hello team,

           I am trying out Libgcrypt 1.85 APIs for AES 256 encryption in
CBC mode on Fedora computer. I have a windows 10 computer on which I have
installed oracle virtual box and running a Fedora OS machine in it.

Firstly, I tried encryption and decryption on Fedora using Libgcrypt APIs.
It worked so easy and Smooth with no error and data loss.

Since nowadays cross platform capability has become a MUST point in
software world, I am also trying to test encryption and encryption in cross
platform scenario.

 I am trying to encrypt file on fedora using Libgcrypt APIs, and decrypt
that encrypted file on windows.

On windows I am using Bcrypt library which also supports AES 256 in CBC
mode.

The problem I am facing right now is, I am getting an error from
BcryptDecrypt() function on windows when I try to decrypt the file
encrypted on Fedora box.

Though the surprising thing is when I pass the entire encrypted file
content all at once to BcryptDecrypt() it is able to decrypt the data
correctly with no data loss, but it still returns error code "-1073741762
(0xC000003E) which means as" STATUS_DATA_ERROR" in windows.

Hence, I wanted to check, if the Libgcrypt APIs are doing padding
internally since I am not passing any such instruction to the Libgcrypt
library explicitly?

I am kind of stuck in this since 2 weeks now. I tried all possible things,
checked endianess, byte size etc on both Fedora and windows computer.

I need some help here to know internal behaviour if Libgcrypt library.

Please help me.

Thank you in advance.

Best Regards,
Mandar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200602/21d9bec9/attachment-0001.html>

From jussi.kivilinna at iki.fi  Wed Jun  3 22:08:37 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Wed,  3 Jun 2020 23:08:37 +0300
Subject: [PATCH 1/2] Disable all assembly modules with --disable-asm
Message-ID: <20200603200838.562876-1-jussi.kivilinna@iki.fi>

* configure.ac (try_asm_modules): Update description,
"MPI" => "MPI and cipher".
(gcry_cv_gcc_arm_platform_as_ok, gcry_cv_gcc_aarch64_platform_as_ok)
(gcry_cv_gcc_inline_asm_ssse3, gcry_cv_gcc_inline_asm_pclmul)
(gcry_cv_gcc_inline_asm_shaext, gcry_cv_gcc_inline_asm_sse41)
(gcry_cv_gcc_inline_asm_avx, gcry_cv_gcc_inline_asm_avx2)
(gcry_cv_gcc_inline_asm_bmi2, gcry_cv_gcc_amd64_platform_as_ok)
(gcry_cv_gcc_platform_as_ok_for_intel_syntax)
(gcry_cv_cc_arm_arch_is_v6, gcry_cv_gcc_inline_asm_neon)
(gcry_cv_gcc_inline_asm_aarch32_crypto)
(gcry_cv_gcc_inline_asm_aarch64_neon)
(gcry_cv_gcc_inline_asm_aarch64_crypto)
(gcry_cv_cc_ppc_altivec, gcry_cv_gcc_inline_asm_ppc_altivec)
(gcry_cv_gcc_inline_asm_ppc_arch_3_00): Check for "try_asm_modules".
* mpi/config.links: Set "mpi_cpu_arch" to "disabled"
with --disable-asm.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 configure.ac     | 86 +++++++++++++++++++++++++++++++-----------------
 mpi/config.links |  1 +
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/configure.ac b/configure.ac
index 3bf0179e..0c9100bf 100644
--- a/configure.ac
+++ b/configure.ac
@@ -535,10 +535,10 @@ AM_CONDITIONAL(USE_RANDOM_DAEMON, test x$use_random_daemon = xyes)
 
 
 # Implementation of --disable-asm.
-AC_MSG_CHECKING([whether MPI assembler modules are requested])
+AC_MSG_CHECKING([whether MPI and cipher assembler modules are requested])
 AC_ARG_ENABLE([asm],
               AC_HELP_STRING([--disable-asm],
-	                     [Disable MPI assembler modules]),
+	                     [Disable MPI and cipher assembler modules]),
               [try_asm_modules=$enableval],
               [try_asm_modules=yes])
 AC_MSG_RESULT($try_asm_modules)
@@ -1140,9 +1140,12 @@ fi
 #
 AC_CACHE_CHECK([whether GCC assembler is compatible for ARM assembly implementations],
        [gcry_cv_gcc_arm_platform_as_ok],
-       [gcry_cv_gcc_arm_platform_as_ok=no
-        AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-          [[__asm__(
+       [if test "$try_asm_modules" != "yes" ; then
+          gcry_cv_gcc_arm_platform_as_ok="n/a"
+        else
+          gcry_cv_gcc_arm_platform_as_ok=no
+          AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+            [[__asm__(
                 /* Test if assembler supports UAL syntax.  */
                 ".syntax unified\n\t"
                 ".arm\n\t" /* our assembly code is in ARM mode  */
@@ -1153,8 +1156,9 @@ AC_CACHE_CHECK([whether GCC assembler is compatible for ARM assembly implementat
                 /* Test if '.type' and '.size' are supported.  */
                 ".size asmfunc,.-asmfunc;\n\t"
                 ".type asmfunc,%function;\n\t"
-            );]])],
-          [gcry_cv_gcc_arm_platform_as_ok=yes])])
+              );]])],
+            [gcry_cv_gcc_arm_platform_as_ok=yes])
+        fi])
 if test "$gcry_cv_gcc_arm_platform_as_ok" = "yes" ; then
    AC_DEFINE(HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS,1,
      [Defined if underlying assembler is compatible with ARM assembly implementations])
@@ -1168,15 +1172,19 @@ fi
 #
 AC_CACHE_CHECK([whether GCC assembler is compatible for ARMv8/Aarch64 assembly implementations],
        [gcry_cv_gcc_aarch64_platform_as_ok],
-       [gcry_cv_gcc_aarch64_platform_as_ok=no
-        AC_COMPILE_IFELSE([AC_LANG_SOURCE(
-          [[__asm__(
+       [if test "$try_asm_modules" != "yes" ; then
+          gcry_cv_gcc_aarch64_platform_as_ok="n/a"
+        else
+          gcry_cv_gcc_aarch64_platform_as_ok=no
+          AC_COMPILE_IFELSE([AC_LANG_SOURCE(
+            [[__asm__(
                 "asmfunc:\n\t"
                 "eor x0, x0, x30, ror #12;\n\t"
                 "add x0, x0, x30, asr #12;\n\t"
                 "eor v0.16b, v0.16b, v31.16b;\n\t"
-            );]])],
-          [gcry_cv_gcc_aarch64_platform_as_ok=yes])])
+              );]])],
+            [gcry_cv_gcc_aarch64_platform_as_ok=yes])
+        fi])
 if test "$gcry_cv_gcc_aarch64_platform_as_ok" = "yes" ; then
    AC_DEFINE(HAVE_COMPATIBLE_GCC_AARCH64_PLATFORM_AS,1,
      [Defined if underlying assembler is compatible with ARMv8/Aarch64 assembly implementations])
@@ -1383,7 +1391,8 @@ CFLAGS=$_gcc_cflags_save;
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports SSSE3 instructions],
        [gcry_cv_gcc_inline_asm_ssse3],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_ssse3="n/a"
         else
           gcry_cv_gcc_inline_asm_ssse3=no
@@ -1406,7 +1415,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports PCLMUL instructions],
        [gcry_cv_gcc_inline_asm_pclmul],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_pclmul="n/a"
         else
           gcry_cv_gcc_inline_asm_pclmul=no
@@ -1427,7 +1437,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports SHA Extensions instructions],
        [gcry_cv_gcc_inline_asm_shaext],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_shaext="n/a"
         else
           gcry_cv_gcc_inline_asm_shaext=no
@@ -1454,7 +1465,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports SSE4.1 instructions],
        [gcry_cv_gcc_inline_asm_sse41],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_sse41="n/a"
         else
           gcry_cv_gcc_inline_asm_sse41=no
@@ -1476,7 +1488,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports AVX instructions],
        [gcry_cv_gcc_inline_asm_avx],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_avx="n/a"
         else
           gcry_cv_gcc_inline_asm_avx=no
@@ -1497,7 +1510,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports AVX2 instructions],
        [gcry_cv_gcc_inline_asm_avx2],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_avx2="n/a"
         else
           gcry_cv_gcc_inline_asm_avx2=no
@@ -1518,7 +1532,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports BMI2 instructions],
        [gcry_cv_gcc_inline_asm_bmi2],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_bmi2="n/a"
         else
           gcry_cv_gcc_inline_asm_bmi2=no
@@ -1579,7 +1594,8 @@ fi
 if test $amd64_as_feature_detection = yes; then
   AC_CACHE_CHECK([whether GCC assembler is compatible for amd64 assembly implementations],
        [gcry_cv_gcc_amd64_platform_as_ok],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_amd64_platform_as_ok="n/a"
         else
           gcry_cv_gcc_amd64_platform_as_ok=no
@@ -1629,7 +1645,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC assembler is compatible for Intel syntax assembly implementations],
        [gcry_cv_gcc_platform_as_ok_for_intel_syntax],
-       [if test "$mpi_cpu_arch" != "x86" ; then
+       [if test "$mpi_cpu_arch" != "x86" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_platform_as_ok_for_intel_syntax="n/a"
         else
           gcry_cv_gcc_platform_as_ok_for_intel_syntax=no
@@ -1666,7 +1683,8 @@ fi
 #
 AC_CACHE_CHECK([whether compiler is configured for ARMv6 or newer architecture],
        [gcry_cv_cc_arm_arch_is_v6],
-       [if test "$mpi_cpu_arch" != "arm" ; then
+       [if test "$mpi_cpu_arch" != "arm" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_cc_arm_arch_is_v6="n/a"
         else
           gcry_cv_cc_arm_arch_is_v6=no
@@ -1699,7 +1717,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports NEON instructions],
        [gcry_cv_gcc_inline_asm_neon],
-       [if test "$mpi_cpu_arch" != "arm" ; then
+       [if test "$mpi_cpu_arch" != "arm" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_neon="n/a"
         else
           gcry_cv_gcc_inline_asm_neon=no
@@ -1727,7 +1746,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports AArch32 Crypto Extension instructions],
        [gcry_cv_gcc_inline_asm_aarch32_crypto],
-       [if test "$mpi_cpu_arch" != "arm" ; then
+       [if test "$mpi_cpu_arch" != "arm" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_aarch32_crypto="n/a"
         else
           gcry_cv_gcc_inline_asm_aarch32_crypto=no
@@ -1771,7 +1791,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 NEON instructions],
        [gcry_cv_gcc_inline_asm_aarch64_neon],
-       [if test "$mpi_cpu_arch" != "aarch64" ; then
+       [if test "$mpi_cpu_arch" != "aarch64" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_aarch64_neon="n/a"
         else
           gcry_cv_gcc_inline_asm_aarch64_neon=no
@@ -1796,7 +1817,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports AArch64 Crypto Extension instructions],
        [gcry_cv_gcc_inline_asm_aarch64_crypto],
-       [if test "$mpi_cpu_arch" != "aarch64" ; then
+       [if test "$mpi_cpu_arch" != "aarch64" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_aarch64_crypto="n/a"
         else
           gcry_cv_gcc_inline_asm_aarch64_crypto=no
@@ -1842,7 +1864,8 @@ fi
 #
 AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX intrinsics],
       [gcry_cv_cc_ppc_altivec],
-      [if test "$mpi_cpu_arch" != "ppc" ; then
+      [if test "$mpi_cpu_arch" != "ppc" ||
+	  test "$try_asm_modules" != "yes" ; then
 	gcry_cv_cc_ppc_altivec="n/a"
       else
 	gcry_cv_cc_ppc_altivec=no
@@ -1868,7 +1891,8 @@ _gcc_cflags_save=$CFLAGS
 CFLAGS="$CFLAGS -maltivec -mvsx -mcrypto"
 
 if test "$gcry_cv_cc_ppc_altivec" = "no" &&
-    test "$mpi_cpu_arch" = "ppc" ; then
+    test "$mpi_cpu_arch" = "ppc" &&
+    test "$try_asm_modules" == "yes" ; then
   AC_CACHE_CHECK([whether compiler supports PowerPC AltiVec/VSX/crypto intrinsics with extra GCC flags],
     [gcry_cv_cc_ppc_altivec_cflags],
     [gcry_cv_cc_ppc_altivec_cflags=no
@@ -1903,7 +1927,8 @@ CFLAGS=$_gcc_cflags_save;
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports PowerPC AltiVec/VSX/crypto instructions],
        [gcry_cv_gcc_inline_asm_ppc_altivec],
-       [if test "$mpi_cpu_arch" != "ppc" ; then
+       [if test "$mpi_cpu_arch" != "ppc" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_ppc_altivec="n/a"
         else
           gcry_cv_gcc_inline_asm_ppc_altivec=no
@@ -1933,7 +1958,8 @@ fi
 #
 AC_CACHE_CHECK([whether GCC inline assembler supports PowerISA 3.00 instructions],
        [gcry_cv_gcc_inline_asm_ppc_arch_3_00],
-       [if test "$mpi_cpu_arch" != "ppc" ; then
+       [if test "$mpi_cpu_arch" != "ppc" ||
+           test "$try_asm_modules" != "yes" ; then
           gcry_cv_gcc_inline_asm_ppc_arch_3_00="n/a"
         else
           gcry_cv_gcc_inline_asm_ppc_arch_3_00=no
diff --git a/mpi/config.links b/mpi/config.links
index 4f43b732..ce6822db 100644
--- a/mpi/config.links
+++ b/mpi/config.links
@@ -375,6 +375,7 @@ if test "$try_asm_modules" != "yes" ; then
     path=""
     mpi_sflags=""
     mpi_extra_modules=""
+    mpi_cpu_arch="disabled"
 fi
 
 # Make sure that mpi_cpu_arch is not the empty string.
-- 
2.25.1


From jussi.kivilinna at iki.fi  Wed Jun  3 22:08:38 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Wed,  3 Jun 2020 23:08:38 +0300
Subject: [PATCH 2/2] rijndael: fix UBSAN warning on left shift by 24 places
 with type 'int'
In-Reply-To: <20200603200838.562876-1-jussi.kivilinna@iki.fi>
References: <20200603200838.562876-1-jussi.kivilinna@iki.fi>
Message-ID: <20200603200838.562876-2-jussi.kivilinna@iki.fi>

* cipher/rijndael.c (do_encrypt_fn, do_decrypt_fn): Cast final
sbox/inv_sbox look-ups to 'u32' type.
--

Fixes following type of UBSAN errors seen from generic C-implementation
of rijndael:

 runtime error: left shift of <xx> by 24 places cannot be represented\
 in type 'int'

where <xx> is greater than 127.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/rijndael.c | 64 +++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/cipher/rijndael.c b/cipher/rijndael.c
index a1c4cfc1..3e9bae55 100644
--- a/cipher/rijndael.c
+++ b/cipher/rijndael.c
@@ -886,28 +886,28 @@ do_encrypt_fn (const RIJNDAEL_context *ctx, unsigned char *b,
 
   /* Last round is special. */
 
-  sb[0] = (sbox[(byte)(sa[0] >> (0 * 8)) * 4]) << (0 * 8);
-  sb[3] = (sbox[(byte)(sa[0] >> (1 * 8)) * 4]) << (1 * 8);
-  sb[2] = (sbox[(byte)(sa[0] >> (2 * 8)) * 4]) << (2 * 8);
-  sb[1] = (sbox[(byte)(sa[0] >> (3 * 8)) * 4]) << (3 * 8);
+  sb[0] = ((u32)sbox[(byte)(sa[0] >> (0 * 8)) * 4]) << (0 * 8);
+  sb[3] = ((u32)sbox[(byte)(sa[0] >> (1 * 8)) * 4]) << (1 * 8);
+  sb[2] = ((u32)sbox[(byte)(sa[0] >> (2 * 8)) * 4]) << (2 * 8);
+  sb[1] = ((u32)sbox[(byte)(sa[0] >> (3 * 8)) * 4]) << (3 * 8);
   sa[0] = rk[r][0] ^ sb[0];
 
-  sb[1] ^= (sbox[(byte)(sa[1] >> (0 * 8)) * 4]) << (0 * 8);
-  sa[0] ^= (sbox[(byte)(sa[1] >> (1 * 8)) * 4]) << (1 * 8);
-  sb[3] ^= (sbox[(byte)(sa[1] >> (2 * 8)) * 4]) << (2 * 8);
-  sb[2] ^= (sbox[(byte)(sa[1] >> (3 * 8)) * 4]) << (3 * 8);
+  sb[1] ^= ((u32)sbox[(byte)(sa[1] >> (0 * 8)) * 4]) << (0 * 8);
+  sa[0] ^= ((u32)sbox[(byte)(sa[1] >> (1 * 8)) * 4]) << (1 * 8);
+  sb[3] ^= ((u32)sbox[(byte)(sa[1] >> (2 * 8)) * 4]) << (2 * 8);
+  sb[2] ^= ((u32)sbox[(byte)(sa[1] >> (3 * 8)) * 4]) << (3 * 8);
   sa[1] = rk[r][1] ^ sb[1];
 
-  sb[2] ^= (sbox[(byte)(sa[2] >> (0 * 8)) * 4]) << (0 * 8);
-  sa[1] ^= (sbox[(byte)(sa[2] >> (1 * 8)) * 4]) << (1 * 8);
-  sa[0] ^= (sbox[(byte)(sa[2] >> (2 * 8)) * 4]) << (2 * 8);
-  sb[3] ^= (sbox[(byte)(sa[2] >> (3 * 8)) * 4]) << (3 * 8);
+  sb[2] ^= ((u32)sbox[(byte)(sa[2] >> (0 * 8)) * 4]) << (0 * 8);
+  sa[1] ^= ((u32)sbox[(byte)(sa[2] >> (1 * 8)) * 4]) << (1 * 8);
+  sa[0] ^= ((u32)sbox[(byte)(sa[2] >> (2 * 8)) * 4]) << (2 * 8);
+  sb[3] ^= ((u32)sbox[(byte)(sa[2] >> (3 * 8)) * 4]) << (3 * 8);
   sa[2] = rk[r][2] ^ sb[2];
 
-  sb[3] ^= (sbox[(byte)(sa[3] >> (0 * 8)) * 4]) << (0 * 8);
-  sa[2] ^= (sbox[(byte)(sa[3] >> (1 * 8)) * 4]) << (1 * 8);
-  sa[1] ^= (sbox[(byte)(sa[3] >> (2 * 8)) * 4]) << (2 * 8);
-  sa[0] ^= (sbox[(byte)(sa[3] >> (3 * 8)) * 4]) << (3 * 8);
+  sb[3] ^= ((u32)sbox[(byte)(sa[3] >> (0 * 8)) * 4]) << (0 * 8);
+  sa[2] ^= ((u32)sbox[(byte)(sa[3] >> (1 * 8)) * 4]) << (1 * 8);
+  sa[1] ^= ((u32)sbox[(byte)(sa[3] >> (2 * 8)) * 4]) << (2 * 8);
+  sa[0] ^= ((u32)sbox[(byte)(sa[3] >> (3 * 8)) * 4]) << (3 * 8);
   sa[3] = rk[r][3] ^ sb[3];
 
   buf_put_le32(b + 0, sa[0]);
@@ -1286,28 +1286,28 @@ do_decrypt_fn (const RIJNDAEL_context *ctx, unsigned char *b,
   sa[3] = rk[1][3] ^ sb[3];
 
   /* Last round is special. */
-  sb[0] = inv_sbox[(byte)(sa[0] >> (0 * 8))] << (0 * 8);
-  sb[1] = inv_sbox[(byte)(sa[0] >> (1 * 8))] << (1 * 8);
-  sb[2] = inv_sbox[(byte)(sa[0] >> (2 * 8))] << (2 * 8);
-  sb[3] = inv_sbox[(byte)(sa[0] >> (3 * 8))] << (3 * 8);
+  sb[0] = (u32)inv_sbox[(byte)(sa[0] >> (0 * 8))] << (0 * 8);
+  sb[1] = (u32)inv_sbox[(byte)(sa[0] >> (1 * 8))] << (1 * 8);
+  sb[2] = (u32)inv_sbox[(byte)(sa[0] >> (2 * 8))] << (2 * 8);
+  sb[3] = (u32)inv_sbox[(byte)(sa[0] >> (3 * 8))] << (3 * 8);
   sa[0] = sb[0] ^ rk[0][0];
 
-  sb[1] ^= inv_sbox[(byte)(sa[1] >> (0 * 8))] << (0 * 8);
-  sb[2] ^= inv_sbox[(byte)(sa[1] >> (1 * 8))] << (1 * 8);
-  sb[3] ^= inv_sbox[(byte)(sa[1] >> (2 * 8))] << (2 * 8);
-  sa[0] ^= inv_sbox[(byte)(sa[1] >> (3 * 8))] << (3 * 8);
+  sb[1] ^= (u32)inv_sbox[(byte)(sa[1] >> (0 * 8))] << (0 * 8);
+  sb[2] ^= (u32)inv_sbox[(byte)(sa[1] >> (1 * 8))] << (1 * 8);
+  sb[3] ^= (u32)inv_sbox[(byte)(sa[1] >> (2 * 8))] << (2 * 8);
+  sa[0] ^= (u32)inv_sbox[(byte)(sa[1] >> (3 * 8))] << (3 * 8);
   sa[1] = sb[1] ^ rk[0][1];
 
-  sb[2] ^= inv_sbox[(byte)(sa[2] >> (0 * 8))] << (0 * 8);
-  sb[3] ^= inv_sbox[(byte)(sa[2] >> (1 * 8))] << (1 * 8);
-  sa[0] ^= inv_sbox[(byte)(sa[2] >> (2 * 8))] << (2 * 8);
-  sa[1] ^= inv_sbox[(byte)(sa[2] >> (3 * 8))] << (3 * 8);
+  sb[2] ^= (u32)inv_sbox[(byte)(sa[2] >> (0 * 8))] << (0 * 8);
+  sb[3] ^= (u32)inv_sbox[(byte)(sa[2] >> (1 * 8))] << (1 * 8);
+  sa[0] ^= (u32)inv_sbox[(byte)(sa[2] >> (2 * 8))] << (2 * 8);
+  sa[1] ^= (u32)inv_sbox[(byte)(sa[2] >> (3 * 8))] << (3 * 8);
   sa[2] = sb[2] ^ rk[0][2];
 
-  sb[3] ^= inv_sbox[(byte)(sa[3] >> (0 * 8))] << (0 * 8);
-  sa[0] ^= inv_sbox[(byte)(sa[3] >> (1 * 8))] << (1 * 8);
-  sa[1] ^= inv_sbox[(byte)(sa[3] >> (2 * 8))] << (2 * 8);
-  sa[2] ^= inv_sbox[(byte)(sa[3] >> (3 * 8))] << (3 * 8);
+  sb[3] ^= (u32)inv_sbox[(byte)(sa[3] >> (0 * 8))] << (0 * 8);
+  sa[0] ^= (u32)inv_sbox[(byte)(sa[3] >> (1 * 8))] << (1 * 8);
+  sa[1] ^= (u32)inv_sbox[(byte)(sa[3] >> (2 * 8))] << (2 * 8);
+  sa[2] ^= (u32)inv_sbox[(byte)(sa[3] >> (3 * 8))] << (3 * 8);
   sa[3] = sb[3] ^ rk[0][3];
 
   buf_put_le32(b + 0, sa[0]);
-- 
2.25.1


From wk at gnupg.org  Fri Jun  5 10:32:30 2020
From: wk at gnupg.org (Werner Koch)
Date: Fri, 05 Jun 2020 10:32:30 +0200
Subject: Decrypt using BcryptDecrypt
In-Reply-To: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>
 (Mandar Apte via Gcrypt-devel's message of "Tue, 2 Jun 2020 16:57:23
 +0530")
References: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>
Message-ID: <87k10l6hgh.fsf@wheatstone.g10code.de>

On Tue,  2 Jun 2020 16:57, Mandar Apte said:
> On windows I am using Bcrypt library which also supports AES 256 in CBC
> mode.

FWIW, Libgcrypt runs very well on Windows.

> Hence, I wanted to check, if the Libgcrypt APIs are doing padding
> internally since I am not passing any such instruction to the Libgcrypt
> library explicitly?

No, Libgcrypt does not do any padding and it expects complete blocks.
gcry_cipher_get_algo_blklen() tells you the block length of the cipher
algorithm.

There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but
in this case you need to pass the entire plaintext/ciphertext to the
encrypt/decrypt function; there is no way to do this incremental.

For the standard padding as used in CMS (S/MIME), you need to handle the
padding in your code; here is a snippet

 if (last_block_is_incomplete)
   { 
      int i,
      int npad = blklen - (buflen % blklen);
      
      p = buffer;
      for (n=buflen, i=0; n < bufsize && i < npad; n++, i++)
        p[n] = npad;
      gcry_cipher_encrypt (chd, buffer, n, buffer, n);
    }


Shalom-Salam,

   Werner


--
Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200605/5aa06331/attachment.sig>

From mandar.apte409 at gmail.com  Fri Jun  5 15:46:47 2020
From: mandar.apte409 at gmail.com (Mandar Apte)
Date: Fri, 5 Jun 2020 19:16:47 +0530
Subject: Decrypt using BcryptDecrypt
In-Reply-To: <87k10l6hgh.fsf@wheatstone.g10code.de>
References: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>
 <87k10l6hgh.fsf@wheatstone.g10code.de>
Message-ID: <CAGHdk0jY55ow0Bj0RjSbrO4dU95LVt9MBrRqe5sUJJsScGOCag@mail.gmail.com>

Hello Werner,

          Thank you very much for the response.

The way you have shown in the email chain below, I had done same thing in
my code as well. Also, I am passing the data of block length size only to
gcry_cipher_encrypt and gcry_cipher_decrypt APIs.
Now, my goal is to check, if the AES256 encryption/decryption is same for
libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the
data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on
windows.

I am pretty sure if I use windows version of Libgcrypt my problem wont be
there at all.

I think I myself have to handle the padding while encrypting using
Libgcrypt library APIs.

Since, I have to handle padding in my code, is there any APIs in libgcrypt
with which I ensure that I am padding the data in standard way? Are there
any APIs in Libgcrypt using which I can get padded data along with my plain
text data which I can encrypt using gcry_cipher_encrypt?


Thank you in advance.
Best Regards,
Mandar


On Fri, 5 Jun 2020, 2:05 pm Werner Koch, <wk at gnupg.org> wrote:

> On Tue,  2 Jun 2020 16:57, Mandar Apte said:
> > On windows I am using Bcrypt library which also supports AES 256 in CBC
> > mode.
>
> FWIW, Libgcrypt runs very well on Windows.
>
> > Hence, I wanted to check, if the Libgcrypt APIs are doing padding
> > internally since I am not passing any such instruction to the Libgcrypt
> > library explicitly?
>
> No, Libgcrypt does not do any padding and it expects complete blocks.
> gcry_cipher_get_algo_blklen() tells you the block length of the cipher
> algorithm.
>
> There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but
> in this case you need to pass the entire plaintext/ciphertext to the
> encrypt/decrypt function; there is no way to do this incremental.
>
> For the standard padding as used in CMS (S/MIME), you need to handle the
> padding in your code; here is a snippet
>
>  if (last_block_is_incomplete)
>    {
>       int i,
>       int npad = blklen - (buflen % blklen);
>
>       p = buffer;
>       for (n=buflen, i=0; n < bufsize && i < npad; n++, i++)
>         p[n] = npad;
>       gcry_cipher_encrypt (chd, buffer, n, buffer, n);
>     }
>
>
>
> Shalom-Salam,
>
>    Werner
>
>
> --
> Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200605/0b7a4f2f/attachment-0001.html>

From tianjia.zhang at linux.alibaba.com  Mon Jun  8 13:00:50 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Mon,  8 Jun 2020 19:00:50 +0800
Subject: [PATCH 0/1] Add SM4 symmetric cipher algorithm
Message-ID: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>

SM4 (GBT.32907-2016) is a cryptographic standard issued by the
Organization of State Commercial Administration of China (OSCCA)
as an authorized cryptographic algorithms for the use within China.

SMS4 was originally created for use in protecting wireless
networks, and is mandated in the Chinese National Standard for
Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure)
(GB.15629.11-2003).

Tianjia Zhang (1):
  Add SM4 symmetric cipher algorithm

 cipher/Makefile.am |   1 +
 cipher/cipher.c    |   8 ++
 cipher/sm4.c       | 270 +++++++++++++++++++++++++++++++++++++++++++++
 configure.ac       |   7 ++
 src/cipher.h       |   1 +
 src/gcrypt.h.in    |   3 +-
 tests/basic.c      |   3 +
 7 files changed, 292 insertions(+), 1 deletion(-)
 create mode 100644 cipher/sm4.c

-- 
2.17.1


From tianjia.zhang at linux.alibaba.com  Mon Jun  8 13:00:51 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Mon,  8 Jun 2020 19:00:51 +0800
Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm
In-Reply-To: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>
References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com>

* cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
* cipher/cipher.c (cipher_list, cipher_list_algo301):
Add _gcry_cipher_spec_sm4.
* cipher/sm4.c: New.
* configure.ac (available_ciphers): Add sm4.
* src/cipher.h: Add declarations for SM4.
* src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.
* tests/basic.c (check_ciphers): Add sm4 check.

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 cipher/Makefile.am |   1 +
 cipher/cipher.c    |   8 ++
 cipher/sm4.c       | 270 +++++++++++++++++++++++++++++++++++++++++++++
 configure.ac       |   7 ++
 src/cipher.h       |   1 +
 src/gcrypt.h.in    |   3 +-
 tests/basic.c      |   3 +
 7 files changed, 292 insertions(+), 1 deletion(-)
 create mode 100644 cipher/sm4.c

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index ef83cc74..56661dcd 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \
 	scrypt.c \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
+	sm4.c \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/cipher.c b/cipher/cipher.c
index edcb421a..dfb083a0 100644
--- a/cipher/cipher.c
+++ b/cipher/cipher.c
@@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] =
 #endif
 #if USE_CHACHA20
      &_gcry_cipher_spec_chacha20,
+#endif
+#if USE_SM4
+     &_gcry_cipher_spec_sm4,
 #endif
     NULL
   };
@@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] =
     &_gcry_cipher_spec_gost28147_mesh,
 #else
     NULL,
+#endif
+#if USE_SM4
+     &_gcry_cipher_spec_sm4,
+#else
+    NULL,
 #endif
   };
 
diff --git a/cipher/sm4.c b/cipher/sm4.c
new file mode 100644
index 00000000..a1bdca10
--- /dev/null
+++ b/cipher/sm4.c
@@ -0,0 +1,270 @@
+/* sm4.c  -  SM4 Cipher Algorithm
+ * Copyright (C) 2020 Alibaba Group.
+ * Copyright (C) 2020 Tianjia Zhang
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "types.h"  /* for byte and u32 typedefs */
+#include "bithelp.h"
+#include "g10lib.h"
+#include "cipher.h"
+
+typedef struct
+{
+  u32 rkey_enc[32];
+  u32 rkey_dec[32];
+} SM4_context;
+
+static const u32 fk[4] = {
+  0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc
+};
+
+static const byte sbox[256] = {
+  0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
+  0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
+  0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
+  0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
+  0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
+  0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
+  0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
+  0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
+  0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
+  0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
+  0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
+  0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
+  0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
+  0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
+  0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
+  0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
+  0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
+  0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
+  0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
+  0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
+  0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
+  0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
+  0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
+  0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
+  0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
+  0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
+  0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
+  0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
+  0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
+  0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
+  0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
+  0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48
+};
+
+static const u32 ck[] = {
+  0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
+  0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
+  0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
+  0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9,
+  0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229,
+  0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299,
+  0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209,
+  0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279
+};
+
+static u32 sm4_t_non_lin_sub(u32 x)
+{
+  int i;
+  byte *b = (byte *)&x;
+
+  for (i = 0; i < 4; ++i)
+    b[i] = sbox[b[i]];
+
+  return x;
+}
+
+static u32 sm4_key_lin_sub(u32 x)
+{
+  return x ^ rol(x, 13) ^ rol(x, 23);
+}
+
+static u32 sm4_enc_lin_sub(u32 x)
+{
+  return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24);
+}
+
+static u32 sm4_key_sub(u32 x)
+{
+  return sm4_key_lin_sub(sm4_t_non_lin_sub(x));
+}
+
+static u32 sm4_enc_sub(u32 x)
+{
+  return sm4_enc_lin_sub(sm4_t_non_lin_sub(x));
+}
+
+static u32 sm4_round(const u32 *x, const u32 rk)
+{
+  return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk);
+}
+
+static gcry_err_code_t
+sm4_expand_key (SM4_context *ctx, const u32 *key, const unsigned keylen)
+{
+  u32 rk[4], t;
+  int i;
+
+  if (keylen != 16)
+    return GPG_ERR_INV_KEYLEN;
+
+  for (i = 0; i < 4; ++i)
+    rk[i] = be_bswap32(key[i]) ^ fk[i];
+
+  for (i = 0; i < 32; ++i)
+    {
+      t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]);
+      ctx->rkey_enc[i] = t;
+      rk[0] = rk[1];
+      rk[1] = rk[2];
+      rk[2] = rk[3];
+      rk[3] = t;
+    }
+
+  for (i = 0; i < 32; ++i)
+    ctx->rkey_dec[i] = ctx->rkey_enc[31 - i];
+
+  return 0;
+}
+
+static gcry_err_code_t
+sm4_setkey (void *context, const byte *key, const unsigned keylen,
+            gcry_cipher_hd_t hd)
+{
+  SM4_context *ctx = context;
+  (void)hd;
+  return sm4_expand_key (ctx, (const u32 *)key, keylen);
+}
+
+static void
+sm4_do_crypt (const u32 *rk, u32 *out, const u32 *in)
+{
+  u32 x[4], t;
+  int i;
+
+  for (i = 0; i < 4; ++i)
+    x[i] = be_bswap32(in[i]);
+
+  for (i = 0; i < 32; ++i)
+    {
+      t = sm4_round(x, rk[i]);
+      x[0] = x[1];
+      x[1] = x[2];
+      x[2] = x[3];
+      x[3] = t;
+    }
+
+  for (i = 0; i < 4; ++i)
+    out[i] = be_bswap32(x[3 - i]);
+}
+
+static unsigned int
+sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
+{
+  SM4_context *ctx = context;
+
+  sm4_do_crypt (ctx->rkey_enc, (u32 *)outbuf, (const u32 *)inbuf);
+  return 0;
+}
+
+static unsigned int
+sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
+{
+  SM4_context *ctx = context;
+
+  sm4_do_crypt (ctx->rkey_dec, (u32 *)outbuf, (const u32 *)inbuf);
+  return 0;
+}
+
+static const char *
+sm4_selftest (void)
+{
+  SM4_context ctx;
+  byte scratch[16];
+
+  static const byte plaintext[16] = {
+    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
+    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
+  };
+  static const byte key[16] = {
+    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
+    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
+  };
+  static const byte ciphertext[16] = {
+    0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E,
+    0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46
+  };
+
+  sm4_setkey (&ctx, key, sizeof (key), NULL);
+  sm4_encrypt (&ctx, scratch, plaintext);
+  if (memcmp (scratch, ciphertext, sizeof (ciphertext)))
+    return "SM4 test encryption failed.";
+  sm4_decrypt (&ctx, scratch, scratch);
+  if (memcmp (scratch, plaintext, sizeof (plaintext)))
+    return "SM4 test decryption failed.";
+
+  return NULL;
+}
+
+static gpg_err_code_t
+run_selftests (int algo, int extended, selftest_report_func_t report)
+{
+  const char *what;
+  const char *errtxt;
+
+  (void)extended;
+
+  if (algo != GCRY_CIPHER_SM4)
+    return GPG_ERR_CIPHER_ALGO;
+
+  what = "selftest";
+  errtxt = sm4_selftest ();
+  if (errtxt)
+    goto failed;
+
+  return 0;
+
+ failed:
+  if (report)
+    report ("cipher", GCRY_CIPHER_SM4, what, errtxt);
+  return GPG_ERR_SELFTEST_FAILED;
+}
+
+static gcry_cipher_oid_spec_t sm4_oids[] =
+  {
+    { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB },
+    { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC },
+    { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB },
+    { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB },
+    { NULL }
+  };
+
+gcry_cipher_spec_t _gcry_cipher_spec_sm4 =
+  {
+    GCRY_CIPHER_SM4, {0, 0},
+    "SM4", NULL, sm4_oids, 16, 128,
+    sizeof (SM4_context),
+    sm4_setkey, sm4_encrypt, sm4_decrypt,
+    NULL, NULL,
+    run_selftests
+  };
diff --git a/configure.ac b/configure.ac
index 3bf0179e..472758b5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host"
 # Definitions for symmetric ciphers.
 available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed"
 available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20"
+available_ciphers="$available_ciphers sm4"
 enabled_ciphers=""
 
 # Definitions for public-key ciphers.
@@ -2533,6 +2534,12 @@ if test "$found" = "1" ; then
    fi
 fi
 
+LIST_MEMBER(sm4, $enabled_ciphers)
+if test "$found" = "1" ; then
+   GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo"
+   AC_DEFINE(USE_SM4, 1, [Defined if this module should be included])
+fi
+
 LIST_MEMBER(dsa, $enabled_pubkey_ciphers)
 if test "$found" = "1" ; then
    GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo"
diff --git a/src/cipher.h b/src/cipher.h
index 20ccb8c5..c49bbda5 100644
--- a/src/cipher.h
+++ b/src/cipher.h
@@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12;
 extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147;
 extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh;
 extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20;
+extern gcry_cipher_spec_t _gcry_cipher_spec_sm4;
 
 /* Declarations for the digest specifications.  */
 extern gcry_md_spec_t _gcry_digest_spec_crc32;
diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in
index c0132189..9ddef17b 100644
--- a/src/gcrypt.h.in
+++ b/src/gcrypt.h.in
@@ -946,7 +946,8 @@ enum gcry_cipher_algos
     GCRY_CIPHER_SALSA20R12  = 314,
     GCRY_CIPHER_GOST28147   = 315,
     GCRY_CIPHER_CHACHA20    = 316,
-    GCRY_CIPHER_GOST28147_MESH   = 317 /* GOST 28247 with optional CryptoPro keymeshing */
+    GCRY_CIPHER_GOST28147_MESH   = 317, /* GOST 28247 with optional CryptoPro keymeshing */
+    GCRY_CIPHER_SM4         = 318
   };
 
 /* The Rijndael algorithm is basically AES, so provide some macros. */
diff --git a/tests/basic.c b/tests/basic.c
index 2dee1bee..6f2945a5 100644
--- a/tests/basic.c
+++ b/tests/basic.c
@@ -9444,6 +9444,9 @@ check_ciphers (void)
 #if USE_GOST28147
     GCRY_CIPHER_GOST28147,
     GCRY_CIPHER_GOST28147_MESH,
+#endif
+#if USE_SM4
+    GCRY_CIPHER_SM4,
 #endif
     0
   };
-- 
2.17.1


From jussi.kivilinna at iki.fi  Tue Jun  9 19:59:51 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 9 Jun 2020 20:59:51 +0300
Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm
In-Reply-To: <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com>
References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>
 <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com>
Message-ID: <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi>

Hello,

Patch looks mostly good. I have add few comments below.

On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote:
> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
> * cipher/cipher.c (cipher_list, cipher_list_algo301):
> Add _gcry_cipher_spec_sm4.
> * cipher/sm4.c: New.
> * configure.ac (available_ciphers): Add sm4.
> * src/cipher.h: Add declarations for SM4.
> * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.
> * tests/basic.c (check_ciphers): Add sm4 check.

Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c.

> 
> Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
> ---
>  cipher/Makefile.am |   1 +
>  cipher/cipher.c    |   8 ++
>  cipher/sm4.c       | 270 +++++++++++++++++++++++++++++++++++++++++++++
>  configure.ac       |   7 ++
>  src/cipher.h       |   1 +
>  src/gcrypt.h.in    |   3 +-
>  tests/basic.c      |   3 +
>  7 files changed, 292 insertions(+), 1 deletion(-)
>  create mode 100644 cipher/sm4.c
> 
> diff --git a/cipher/Makefile.am b/cipher/Makefile.am
> index ef83cc74..56661dcd 100644
> --- a/cipher/Makefile.am
> +++ b/cipher/Makefile.am
> @@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \
>  	scrypt.c \
>  	seed.c \
>  	serpent.c serpent-sse2-amd64.S \
> +	sm4.c \
>  	serpent-avx2-amd64.S serpent-armv7-neon.S \
>  	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
>  	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
> diff --git a/cipher/cipher.c b/cipher/cipher.c
> index edcb421a..dfb083a0 100644
> --- a/cipher/cipher.c
> +++ b/cipher/cipher.c
> @@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] =
>  #endif
>  #if USE_CHACHA20
>       &_gcry_cipher_spec_chacha20,
> +#endif
> +#if USE_SM4
> +     &_gcry_cipher_spec_sm4,
>  #endif
>      NULL
>    };
> @@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] =
>      &_gcry_cipher_spec_gost28147_mesh,
>  #else
>      NULL,
> +#endif
> +#if USE_SM4
> +     &_gcry_cipher_spec_sm4,
> +#else
> +    NULL,
>  #endif
>    };
>  
> diff --git a/cipher/sm4.c b/cipher/sm4.c
> new file mode 100644
> index 00000000..a1bdca10
> --- /dev/null
> +++ b/cipher/sm4.c
> @@ -0,0 +1,270 @@
> +/* sm4.c  -  SM4 Cipher Algorithm
> + * Copyright (C) 2020 Alibaba Group.
> + * Copyright (C) 2020 Tianjia Zhang
> + *
> + * This file is part of Libgcrypt.
> + *
> + * Libgcrypt is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU Lesser General Public License as
> + * published by the Free Software Foundation; either version 2.1 of
> + * the License, or (at your option) any later version.
> + *
> + * Libgcrypt is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <config.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +
> +#include "types.h"  /* for byte and u32 typedefs */
> +#include "bithelp.h"
> +#include "g10lib.h"
> +#include "cipher.h"
> +
> +typedef struct
> +{
> +  u32 rkey_enc[32];
> +  u32 rkey_dec[32];
> +} SM4_context;
> +
> +static const u32 fk[4] = {
> +  0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc
> +};
> +
> +static const byte sbox[256] = {
> +  0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
> +  0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
> +  0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
> +  0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
> +  0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
> +  0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
> +  0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
> +  0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
> +  0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
> +  0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
> +  0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
> +  0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
> +  0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
> +  0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
> +  0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
> +  0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
> +  0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
> +  0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
> +  0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
> +  0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
> +  0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
> +  0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
> +  0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
> +  0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
> +  0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
> +  0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
> +  0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
> +  0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
> +  0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
> +  0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
> +  0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
> +  0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48
> +};
> +
> +static const u32 ck[] = {
> +  0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
> +  0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
> +  0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
> +  0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9,
> +  0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229,
> +  0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299,
> +  0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209,
> +  0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279
> +};
> +
> +static u32 sm4_t_non_lin_sub(u32 x)
> +{
> +  int i;
> +  byte *b = (byte *)&x;
> +
> +  for (i = 0; i < 4; ++i)
> +    b[i] = sbox[b[i]];
> +
> +  return x;
> +}
> +
> +static u32 sm4_key_lin_sub(u32 x)
> +{
> +  return x ^ rol(x, 13) ^ rol(x, 23);
> +}
> +
> +static u32 sm4_enc_lin_sub(u32 x)
> +{
> +  return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24);
> +}
> +
> +static u32 sm4_key_sub(u32 x)
> +{
> +  return sm4_key_lin_sub(sm4_t_non_lin_sub(x));
> +}
> +
> +static u32 sm4_enc_sub(u32 x)
> +{
> +  return sm4_enc_lin_sub(sm4_t_non_lin_sub(x));
> +}
> +
> +static u32 sm4_round(const u32 *x, const u32 rk)
> +{
> +  return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk);
> +}
> +
> +static gcry_err_code_t
> +sm4_expand_key (SM4_context *ctx, const u32 *key, const unsigned keylen)
> +{
> +  u32 rk[4], t;
> +  int i;
> +
> +  if (keylen != 16)
> +    return GPG_ERR_INV_KEYLEN;
> +
> +  for (i = 0; i < 4; ++i)
> +    rk[i] = be_bswap32(key[i]) ^ fk[i];
> +
> +  for (i = 0; i < 32; ++i)
> +    {
> +      t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]);
> +      ctx->rkey_enc[i] = t;
> +      rk[0] = rk[1];
> +      rk[1] = rk[2];
> +      rk[2] = rk[3];
> +      rk[3] = t;
> +    }
> +
> +  for (i = 0; i < 32; ++i)
> +    ctx->rkey_dec[i] = ctx->rkey_enc[31 - i];
> +
> +  return 0;
> +}
> +
> +static gcry_err_code_t
> +sm4_setkey (void *context, const byte *key, const unsigned keylen,
> +            gcry_cipher_hd_t hd)
> +{
> +  SM4_context *ctx = context;
> +  (void)hd;
> +  return sm4_expand_key (ctx, (const u32 *)key, keylen);

Casting byte pointer to word pointer here. 'key' here might not be
aligned 4 bytes and can then cause seg-fault in 'sm4_expand_key' on
architectures that do not handle unaligned memory accesses
automatically.

It's better to change 'sm4_expand_key' take 'key' as byte pointer
and use 'buf_get_be32' for reading from it.

> +}
> +
> +static void
> +sm4_do_crypt (const u32 *rk, u32 *out, const u32 *in)

Likewise, better to use byte pointer for 'out' and 'in' here and
use 'buf_get_be32' and 'buf_put_be32' for reading and writing.

> +{
> +  u32 x[4], t;
> +  int i;
> +
> +  for (i = 0; i < 4; ++i)
> +    x[i] = be_bswap32(in[i]);
> +
> +  for (i = 0; i < 32; ++i)
> +    {
> +      t = sm4_round(x, rk[i]);
> +      x[0] = x[1];
> +      x[1] = x[2];
> +      x[2] = x[3];
> +      x[3] = t;
> +    }
> +
> +  for (i = 0; i < 4; ++i)
> +    out[i] = be_bswap32(x[3 - i]);
> +}
> +
> +static unsigned int
> +sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
> +{
> +  SM4_context *ctx = context;
> +
> +  sm4_do_crypt (ctx->rkey_enc, (u32 *)outbuf, (const u32 *)inbuf);
> +  return 0;
> +}
> +
> +static unsigned int
> +sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
> +{
> +  SM4_context *ctx = context;
> +
> +  sm4_do_crypt (ctx->rkey_dec, (u32 *)outbuf, (const u32 *)inbuf);
> +  return 0;
> +}

Encrypt/decrypt functions should return 'stack burn' depth in bytes.
Good estimate for this is size of variables + size of arguments + 
size of call return pointer. So in this case, "4*6+sizeof(void*)*4".

> +
> +static const char *
> +sm4_selftest (void)
> +{
> +  SM4_context ctx;
> +  byte scratch[16];
> +
> +  static const byte plaintext[16] = {
> +    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
> +    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
> +  };
> +  static const byte key[16] = {
> +    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
> +    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
> +  };
> +  static const byte ciphertext[16] = {
> +    0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E,
> +    0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46
> +  };
> +
> +  sm4_setkey (&ctx, key, sizeof (key), NULL);
> +  sm4_encrypt (&ctx, scratch, plaintext);
> +  if (memcmp (scratch, ciphertext, sizeof (ciphertext)))
> +    return "SM4 test encryption failed.";
> +  sm4_decrypt (&ctx, scratch, scratch);
> +  if (memcmp (scratch, plaintext, sizeof (plaintext)))
> +    return "SM4 test decryption failed.";
> +
> +  return NULL;
> +}
> +
> +static gpg_err_code_t
> +run_selftests (int algo, int extended, selftest_report_func_t report)
> +{
> +  const char *what;
> +  const char *errtxt;
> +
> +  (void)extended;
> +
> +  if (algo != GCRY_CIPHER_SM4)
> +    return GPG_ERR_CIPHER_ALGO;
> +
> +  what = "selftest";
> +  errtxt = sm4_selftest ();
> +  if (errtxt)
> +    goto failed;
> +
> +  return 0;
> +
> + failed:
> +  if (report)
> +    report ("cipher", GCRY_CIPHER_SM4, what, errtxt);
> +  return GPG_ERR_SELFTEST_FAILED;
> +}
> +
> +static gcry_cipher_oid_spec_t sm4_oids[] =
> +  {
> +    { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB },
> +    { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC },
> +    { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB },
> +    { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB },
> +    { NULL }
> +  };
> +
> +gcry_cipher_spec_t _gcry_cipher_spec_sm4 =
> +  {
> +    GCRY_CIPHER_SM4, {0, 0},
> +    "SM4", NULL, sm4_oids, 16, 128,
> +    sizeof (SM4_context),
> +    sm4_setkey, sm4_encrypt, sm4_decrypt,
> +    NULL, NULL,
> +    run_selftests
> +  };
> diff --git a/configure.ac b/configure.ac
> index 3bf0179e..472758b5 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host"
>  # Definitions for symmetric ciphers.
>  available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed"
>  available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20"
> +available_ciphers="$available_ciphers sm4"
>  enabled_ciphers=""
>  
>  # Definitions for public-key ciphers.
> @@ -2533,6 +2534,12 @@ if test "$found" = "1" ; then
>     fi
>  fi
>  
> +LIST_MEMBER(sm4, $enabled_ciphers)
> +if test "$found" = "1" ; then
> +   GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo"
> +   AC_DEFINE(USE_SM4, 1, [Defined if this module should be included])
> +fi
> +
>  LIST_MEMBER(dsa, $enabled_pubkey_ciphers)
>  if test "$found" = "1" ; then
>     GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo"
> diff --git a/src/cipher.h b/src/cipher.h
> index 20ccb8c5..c49bbda5 100644
> --- a/src/cipher.h
> +++ b/src/cipher.h
> @@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12;
>  extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147;
>  extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh;
>  extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20;
> +extern gcry_cipher_spec_t _gcry_cipher_spec_sm4;
>  
>  /* Declarations for the digest specifications.  */
>  extern gcry_md_spec_t _gcry_digest_spec_crc32;
> diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in
> index c0132189..9ddef17b 100644
> --- a/src/gcrypt.h.in
> +++ b/src/gcrypt.h.in
> @@ -946,7 +946,8 @@ enum gcry_cipher_algos
>      GCRY_CIPHER_SALSA20R12  = 314,
>      GCRY_CIPHER_GOST28147   = 315,
>      GCRY_CIPHER_CHACHA20    = 316,
> -    GCRY_CIPHER_GOST28147_MESH   = 317 /* GOST 28247 with optional CryptoPro keymeshing */
> +    GCRY_CIPHER_GOST28147_MESH   = 317, /* GOST 28247 with optional CryptoPro keymeshing */
> +    GCRY_CIPHER_SM4         = 318
>    };
>  
>  /* The Rijndael algorithm is basically AES, so provide some macros. */
> diff --git a/tests/basic.c b/tests/basic.c
> index 2dee1bee..6f2945a5 100644
> --- a/tests/basic.c
> +++ b/tests/basic.c
> @@ -9444,6 +9444,9 @@ check_ciphers (void)
>  #if USE_GOST28147
>      GCRY_CIPHER_GOST28147,
>      GCRY_CIPHER_GOST28147_MESH,
> +#endif
> +#if USE_SM4
> +    GCRY_CIPHER_SM4,
>  #endif
>      0
>    };
> 

It would be nice to have some extra test-vectors in basic.c for
common cipher modes with SM4. There's few such vectors at following
Internet-Draft that could be used:
 https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10#appendix-A.2

-Jussi


From mandar.apte409 at gmail.com  Wed Jun 10 18:38:09 2020
From: mandar.apte409 at gmail.com (Mandar Apte)
Date: Wed, 10 Jun 2020 22:08:09 +0530
Subject: Decrypt using BcryptDecrypt
In-Reply-To: <CAGHdk0jY55ow0Bj0RjSbrO4dU95LVt9MBrRqe5sUJJsScGOCag@mail.gmail.com>
References: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>
 <87k10l6hgh.fsf@wheatstone.g10code.de>
 <CAGHdk0jY55ow0Bj0RjSbrO4dU95LVt9MBrRqe5sUJJsScGOCag@mail.gmail.com>
Message-ID: <CAGHdk0iKfs4Tbkfn3Wyu+tjtNccEhni+AEfscn2-a2P9M2-paw@mail.gmail.com>

Hello Team,

          Are there any APIs in Libgcrypt using which I can get padded data
along with my plain text data which I can encrypt using
gcry_cipher_encrypt?


Thanks in advance.
Best Regards,
Mandar

On Fri, 5 Jun 2020, 7:16 pm Mandar Apte, <mandar.apte409 at gmail.com> wrote:

> Hello Werner,
>
>           Thank you very much for the response.
>
> The way you have shown in the email chain below, I had done same thing in
> my code as well. Also, I am passing the data of block length size only to
> gcry_cipher_encrypt and gcry_cipher_decrypt APIs.
> Now, my goal is to check, if the AES256 encryption/decryption is same for
> libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the
> data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on
> windows.
>
> I am pretty sure if I use windows version of Libgcrypt my problem wont be
> there at all.
>
> I think I myself have to handle the padding while encrypting using
> Libgcrypt library APIs.
>
> Since, I have to handle padding in my code, is there any APIs in libgcrypt
> with which I ensure that I am padding the data in standard way?
>


Are there any APIs in Libgcrypt using which I can get padded data along
> with my plain text data which I can encrypt using gcry_cipher_encrypt?
>
>
> Thank you in advance.
> Best Regards,
> Mandar
>
>
>
> On Fri, 5 Jun 2020, 2:05 pm Werner Koch, <wk at gnupg.org> wrote:
>
>> On Tue,  2 Jun 2020 16:57, Mandar Apte said:
>> > On windows I am using Bcrypt library which also supports AES 256 in CBC
>> > mode.
>>
>> FWIW, Libgcrypt runs very well on Windows.
>>
>> > Hence, I wanted to check, if the Libgcrypt APIs are doing padding
>> > internally since I am not passing any such instruction to the Libgcrypt
>> > library explicitly?
>>
>> No, Libgcrypt does not do any padding and it expects complete blocks.
>> gcry_cipher_get_algo_blklen() tells you the block length of the cipher
>> algorithm.
>>
>> There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but
>> in this case you need to pass the entire plaintext/ciphertext to the
>> encrypt/decrypt function; there is no way to do this incremental.
>>
>> For the standard padding as used in CMS (S/MIME), you need to handle the
>> padding in your code; here is a snippet
>>
>>  if (last_block_is_incomplete)
>>    {
>>       int i,
>>       int npad = blklen - (buflen % blklen);
>>
>>       p = buffer;
>>       for (n=buflen, i=0; n < bufsize && i < npad; n++, i++)
>>         p[n] = npad;
>>       gcry_cipher_encrypt (chd, buffer, n, buffer, n);
>>     }
>>
>>
>>
>> Shalom-Salam,
>>
>>    Werner
>>
>>
>> --
>> Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200610/603b85f7/attachment.html>

From jussi.kivilinna at iki.fi  Sat Jun 13 23:34:36 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 14 Jun 2020 00:34:36 +0300
Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm
In-Reply-To: <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi>
References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>
 <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com>
 <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi>
Message-ID: <cb2e9d1b-a19b-fc07-ebd6-0d2a7132581b@iki.fi>

On 9.6.2020 20.59, Jussi Kivilinna wrote:
> Hello,
> 
> Patch looks mostly good. I have add few comments below.
> 
> On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote:
>> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
>> * cipher/cipher.c (cipher_list, cipher_list_algo301):
>> Add _gcry_cipher_spec_sm4.
>> * cipher/sm4.c: New.
>> * configure.ac (available_ciphers): Add sm4.
>> * src/cipher.h: Add declarations for SM4.
>> * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.
>> * tests/basic.c (check_ciphers): Add sm4 check.
> 
> Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c.
> 

Oh, and please add also SM4 to documentation, doc/gcrypt.texi.

-Jussi


From jussi.kivilinna at iki.fi  Sat Jun 13 23:54:56 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sun, 14 Jun 2020 00:54:56 +0300
Subject: [PATCH] doc: add GCRY_MD_SM3,
 GCRY_MAC_HMAC_SM3 and GCRY_MAC_GOST28147_IMIT
Message-ID: <20200613215456.608907-1-jussi.kivilinna@iki.fi>

* doc/gcrypt.texi: add GCRY_MD_SM3, GCRY_MAC_HMAC_SM3 and
GCRY_MAC_GOST28147_IMIT.
--

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 doc/gcrypt.texi | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi
index ad5ada87..4eaf6d8d 100644
--- a/doc/gcrypt.texi
+++ b/doc/gcrypt.texi
@@ -3157,6 +3157,7 @@ are also supported.
 @cindex MD2, MD4, MD5
 @cindex TIGER, TIGER1, TIGER2
 @cindex HAVAL
+ at cindex SM3
 @cindex Whirlpool
 @cindex BLAKE2b-512, BLAKE2b-384, BLAKE2b-256, BLAKE2b-160
 @cindex BLAKE2s-256, BLAKE2s-224, BLAKE2s-160, BLAKE2s-128
@@ -3324,6 +3325,9 @@ See RFC 7693 for the specification.
 This is the BLAKE2s-128 algorithm which yields a message digest of 16 bytes.
 See RFC 7693 for the specification.
 
+ at item GCRY_MD_SM3
+This is the SM3 algorithm which yields a message digest of 32 bytes.
+
 @end table
 @c end table of hash algorithms
 
@@ -3703,6 +3707,7 @@ provided by Libgcrypt.
 @cindex HMAC-RIPE-MD-160
 @cindex HMAC-MD2, HMAC-MD4, HMAC-MD5
 @cindex HMAC-TIGER1
+ at cindex HMAC-SM3
 @cindex HMAC-Whirlpool
 @cindex HMAC-Stribog-256, HMAC-Stribog-512
 @cindex HMAC-GOSTR-3411-94
@@ -3816,6 +3821,10 @@ algorithm.
 This is HMAC message authentication algorithm based on the BLAKE2s-128 hash
 algorithm.
 
+ at item GCRY_MAC_HMAC_SM3
+This is HMAC message authentication algorithm based on the SM3 hash
+algorithm.
+
 @item GCRY_MAC_CMAC_AES
 This is CMAC (Cipher-based MAC) message authentication algorithm based on
 the AES block cipher algorithm.
@@ -3904,6 +3913,9 @@ key and one-time nonce.
 This is Poly1305-SEED message authentication algorithm, used with
 key and one-time nonce.
 
+ at item GCRY_MAC_GOST28147_IMIT
+This is MAC construction defined in GOST 28147-89 (see RFC 5830 Section 8).
+
 @end table
 @c end table of MAC algorithms
 
-- 
2.25.1


From mandar.apte409 at gmail.com  Mon Jun 15 09:02:23 2020
From: mandar.apte409 at gmail.com (Mandar Apte)
Date: Mon, 15 Jun 2020 12:32:23 +0530
Subject: Decrypt using BcryptDecrypt
In-Reply-To: <CAGHdk0iKfs4Tbkfn3Wyu+tjtNccEhni+AEfscn2-a2P9M2-paw@mail.gmail.com>
References: <CAGHdk0hiZOcSLPi57xzGfyB2B8fdUDxZox2S7p=8uqGN+zXGAw@mail.gmail.com>
 <87k10l6hgh.fsf@wheatstone.g10code.de>
 <CAGHdk0jY55ow0Bj0RjSbrO4dU95LVt9MBrRqe5sUJJsScGOCag@mail.gmail.com>
 <CAGHdk0iKfs4Tbkfn3Wyu+tjtNccEhni+AEfscn2-a2P9M2-paw@mail.gmail.com>
Message-ID: <CAGHdk0iPVJrBMHGjhGAvLshMaD402O__bau0sT_HzhHT66K6Ug@mail.gmail.com>

Any help regarding request in below email ?

On Wed, 10 Jun 2020, 10:08 pm Mandar Apte, <mandar.apte409 at gmail.com> wrote:

> Hello Team,
>
>           Are there any APIs in Libgcrypt using which I can get padded
> data along with my plain text data which I can encrypt using
> gcry_cipher_encrypt?
>
>
> Thanks in advance.
> Best Regards,
> Mandar
>
> On Fri, 5 Jun 2020, 7:16 pm Mandar Apte, <mandar.apte409 at gmail.com> wrote:
>
>> Hello Werner,
>>
>>           Thank you very much for the response.
>>
>> The way you have shown in the email chain below, I had done same thing in
>> my code as well. Also, I am passing the data of block length size only to
>> gcry_cipher_encrypt and gcry_cipher_decrypt APIs.
>> Now, my goal is to check, if the AES256 encryption/decryption is same for
>> libgcrypt and Bcrypt library. Thats the reason I am trying to decrypt the
>> data, which was encrypted using Libgcrypt APIs, using Bcrypt APIs on
>> windows.
>>
>> I am pretty sure if I use windows version of Libgcrypt my problem wont be
>> there at all.
>>
>> I think I myself have to handle the padding while encrypting using
>> Libgcrypt library APIs.
>>
>> Since, I have to handle padding in my code, is there any APIs in
>> libgcrypt with which I ensure that I am padding the data in standard way?
>>
>
>
> Are there any APIs in Libgcrypt using which I can get padded data along
>> with my plain text data which I can encrypt using gcry_cipher_encrypt?
>>
>>
>> Thank you in advance.
>> Best Regards,
>> Mandar
>>
>>
>>
>> On Fri, 5 Jun 2020, 2:05 pm Werner Koch, <wk at gnupg.org> wrote:
>>
>>> On Tue,  2 Jun 2020 16:57, Mandar Apte said:
>>> > On windows I am using Bcrypt library which also supports AES 256 in CBC
>>> > mode.
>>>
>>> FWIW, Libgcrypt runs very well on Windows.
>>>
>>> > Hence, I wanted to check, if the Libgcrypt APIs are doing padding
>>> > internally since I am not passing any such instruction to the Libgcrypt
>>> > library explicitly?
>>>
>>> No, Libgcrypt does not do any padding and it expects complete blocks.
>>> gcry_cipher_get_algo_blklen() tells you the block length of the cipher
>>> algorithm.
>>>
>>> There is a flag to enable ciphertext stealing (GCRY_CIPHER_CBC_CTS) but
>>> in this case you need to pass the entire plaintext/ciphertext to the
>>> encrypt/decrypt function; there is no way to do this incremental.
>>>
>>> For the standard padding as used in CMS (S/MIME), you need to handle the
>>> padding in your code; here is a snippet
>>>
>>>  if (last_block_is_incomplete)
>>>    {
>>>       int i,
>>>       int npad = blklen - (buflen % blklen);
>>>
>>>       p = buffer;
>>>       for (n=buflen, i=0; n < bufsize && i < npad; n++, i++)
>>>         p[n] = npad;
>>>       gcry_cipher_encrypt (chd, buffer, n, buffer, n);
>>>     }
>>>
>>>
>>>
>>> Shalom-Salam,
>>>
>>>    Werner
>>>
>>>
>>> --
>>> Die Gedanken sind frei.  Ausnahmen regelt ein Bundesgesetz.
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200615/114966d0/attachment-0001.html>

From tianjia.zhang at linux.alibaba.com  Tue Jun 16 11:09:27 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 16 Jun 2020 17:09:27 +0800
Subject: [PATCH v2 0/2] Add SM4 symmetric cipher algorithm
Message-ID: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>

SM4 (GBT.32907-2016) is a cryptographic standard issued by the
Organization of State Commercial Administration of China (OSCCA)
as an authorized cryptographic algorithms for the use within China.

SMS4 was originally created for use in protecting wireless
networks, and is mandated in the Chinese National Standard for
Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure)
(GB.15629.11-2003).

Tianjia Zhang (2):
  Add SM4 symmetric cipher algorithm
  tests: Add basic test-vectors for SM4

 cipher/Makefile.am    |   1 +
 cipher/cipher.c       |   8 ++
 cipher/mac-cmac.c     |   6 +
 cipher/mac-internal.h |   3 +
 cipher/mac.c          |  10 +-
 cipher/sm4.c          | 275 ++++++++++++++++++++++++++++++++++++++++++
 configure.ac          |   7 ++
 doc/gcrypt.texi       |   6 +
 src/cipher.h          |   1 +
 src/gcrypt.h.in       |   4 +-
 tests/basic.c         |  99 +++++++++++++++
 11 files changed, 418 insertions(+), 2 deletions(-)
 create mode 100644 cipher/sm4.c

-- 
2.17.1


From tianjia.zhang at linux.alibaba.com  Tue Jun 16 11:09:28 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 16 Jun 2020 17:09:28 +0800
Subject: [PATCH v2 1/2] Add SM4 symmetric cipher algorithm
In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20200616090929.102931-2-tianjia.zhang@linux.alibaba.com>

* cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
* cipher/cipher.c (cipher_list, cipher_list_algo301):
Add _gcry_cipher_spec_sm4.
* cipher/mac-cmac.c: Add cmac SM4.
* cipher/mac-internal.h: Declare spec_cmac_sm4.
* cipher/mac.c (mac_list, mac_list_algo201): Add cmac SM4.
* cipher/sm4.c: New.
* configure.ac (available_ciphers): Add sm4.
* doc/gcrypt.texi: Add SM4 document.
* src/cipher.h: Add declarations for SM4 and cmac SM4.
* src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 cipher/Makefile.am    |   1 +
 cipher/cipher.c       |   8 ++
 cipher/mac-cmac.c     |   6 +
 cipher/mac-internal.h |   3 +
 cipher/mac.c          |  10 +-
 cipher/sm4.c          | 275 ++++++++++++++++++++++++++++++++++++++++++
 configure.ac          |   7 ++
 doc/gcrypt.texi       |   6 +
 src/cipher.h          |   1 +
 src/gcrypt.h.in       |   4 +-
 10 files changed, 319 insertions(+), 2 deletions(-)
 create mode 100644 cipher/sm4.c

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index ef83cc74..56661dcd 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,6 +107,7 @@ EXTRA_libcipher_la_SOURCES = \
 	scrypt.c \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
+	sm4.c \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/cipher.c b/cipher/cipher.c
index edcb421a..dfb083a0 100644
--- a/cipher/cipher.c
+++ b/cipher/cipher.c
@@ -87,6 +87,9 @@ static gcry_cipher_spec_t * const cipher_list[] =
 #endif
 #if USE_CHACHA20
      &_gcry_cipher_spec_chacha20,
+#endif
+#if USE_SM4
+     &_gcry_cipher_spec_sm4,
 #endif
     NULL
   };
@@ -202,6 +205,11 @@ static gcry_cipher_spec_t * const cipher_list_algo301[] =
     &_gcry_cipher_spec_gost28147_mesh,
 #else
     NULL,
+#endif
+#if USE_SM4
+     &_gcry_cipher_spec_sm4,
+#else
+    NULL,
 #endif
   };
 
diff --git a/cipher/mac-cmac.c b/cipher/mac-cmac.c
index aee5bb63..3fb5b373 100644
--- a/cipher/mac-cmac.c
+++ b/cipher/mac-cmac.c
@@ -225,3 +225,9 @@ gcry_mac_spec_t _gcry_mac_type_spec_cmac_gost28147 = {
   &cmac_ops
 };
 #endif
+#if USE_SM4
+gcry_mac_spec_t _gcry_mac_type_spec_cmac_sm4 = {
+  GCRY_MAC_CMAC_SM4, {0, 0}, "CMAC_SM4",
+  &cmac_ops
+};
+#endif
diff --git a/cipher/mac-internal.h b/cipher/mac-internal.h
index 1936150c..8c13520b 100644
--- a/cipher/mac-internal.h
+++ b/cipher/mac-internal.h
@@ -229,6 +229,9 @@ extern gcry_mac_spec_t _gcry_mac_type_spec_cmac_gost28147;
 #if USE_GOST28147
 extern gcry_mac_spec_t _gcry_mac_type_spec_gost28147_imit;
 #endif
+#if USE_SM4
+extern gcry_mac_spec_t _gcry_mac_type_spec_cmac_sm4;
+#endif
 
 /*
  * The GMAC algorithm specifications (mac-gmac.c).
diff --git a/cipher/mac.c b/cipher/mac.c
index 0abc0d33..933be74c 100644
--- a/cipher/mac.c
+++ b/cipher/mac.c
@@ -130,6 +130,9 @@ static gcry_mac_spec_t * const mac_list[] = {
   &_gcry_mac_type_spec_gost28147_imit,
 #endif
   &_gcry_mac_type_spec_poly1305mac,
+#if USE_SM4
+  &_gcry_mac_type_spec_cmac_sm4,
+#endif
   NULL,
 };
 
@@ -300,7 +303,12 @@ static gcry_mac_spec_t * const mac_list_algo201[] =
     NULL,
 #endif
 #if USE_GOST28147
-    &_gcry_mac_type_spec_cmac_gost28147
+    &_gcry_mac_type_spec_cmac_gost28147,
+#else
+    NULL,
+#endif
+#if USE_SM4
+    &_gcry_mac_type_spec_cmac_sm4
 #else
     NULL
 #endif
diff --git a/cipher/sm4.c b/cipher/sm4.c
new file mode 100644
index 00000000..061ee26e
--- /dev/null
+++ b/cipher/sm4.c
@@ -0,0 +1,275 @@
+/* sm4.c  -  SM4 Cipher Algorithm
+ * Copyright (C) 2020 Alibaba Group.
+ * Copyright (C) 2020 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <config.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "types.h"  /* for byte and u32 typedefs */
+#include "bithelp.h"
+#include "g10lib.h"
+#include "cipher.h"
+#include "bufhelp.h"
+
+typedef struct
+{
+  u32 rkey_enc[32];
+  u32 rkey_dec[32];
+} SM4_context;
+
+static const u32 fk[4] = {
+  0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc
+};
+
+static const byte sbox[256] = {
+  0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
+  0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
+  0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
+  0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
+  0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
+  0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
+  0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
+  0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
+  0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
+  0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
+  0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
+  0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
+  0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
+  0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
+  0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
+  0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
+  0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
+  0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
+  0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
+  0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
+  0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
+  0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
+  0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
+  0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
+  0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
+  0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
+  0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
+  0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
+  0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
+  0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
+  0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
+  0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48
+};
+
+static const u32 ck[] = {
+  0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
+  0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
+  0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
+  0x50575e65, 0x6c737a81, 0x888f969d, 0xa4abb2b9,
+  0xc0c7ced5, 0xdce3eaf1, 0xf8ff060d, 0x141b2229,
+  0x30373e45, 0x4c535a61, 0x686f767d, 0x848b9299,
+  0xa0a7aeb5, 0xbcc3cad1, 0xd8dfe6ed, 0xf4fb0209,
+  0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279
+};
+
+static u32 sm4_t_non_lin_sub(u32 x)
+{
+  int i;
+  byte *b = (byte *)&x;
+
+  for (i = 0; i < 4; ++i)
+    b[i] = sbox[b[i]];
+
+  return x;
+}
+
+static u32 sm4_key_lin_sub(u32 x)
+{
+  return x ^ rol(x, 13) ^ rol(x, 23);
+}
+
+static u32 sm4_enc_lin_sub(u32 x)
+{
+  return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24);
+}
+
+static u32 sm4_key_sub(u32 x)
+{
+  return sm4_key_lin_sub(sm4_t_non_lin_sub(x));
+}
+
+static u32 sm4_enc_sub(u32 x)
+{
+  return sm4_enc_lin_sub(sm4_t_non_lin_sub(x));
+}
+
+static u32 sm4_round(const u32 *x, const u32 rk)
+{
+  return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk);
+}
+
+static gcry_err_code_t
+sm4_expand_key (SM4_context *ctx, const byte *key, const unsigned keylen)
+{
+  u32 rk[4], t;
+  int i;
+
+  if (keylen != 16)
+    return GPG_ERR_INV_KEYLEN;
+
+  for (i = 0; i < 4; ++i)
+    rk[i] = buf_get_be32(&key[i*4]) ^ fk[i];
+
+  for (i = 0; i < 32; ++i)
+    {
+      t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]);
+      ctx->rkey_enc[i] = t;
+      rk[0] = rk[1];
+      rk[1] = rk[2];
+      rk[2] = rk[3];
+      rk[3] = t;
+    }
+
+  for (i = 0; i < 32; ++i)
+    ctx->rkey_dec[i] = ctx->rkey_enc[31 - i];
+
+  return 0;
+}
+
+static gcry_err_code_t
+sm4_setkey (void *context, const byte *key, const unsigned keylen,
+            gcry_cipher_hd_t hd)
+{
+  SM4_context *ctx = context;
+  int rc = sm4_expand_key (ctx, key, keylen);
+  (void)hd;
+  _gcry_burn_stack (4*5 + sizeof(int)*2);
+  return rc;
+}
+
+static void
+sm4_do_crypt (const u32 *rk, byte *out, const byte *in)
+{
+  u32 x[4], t;
+  int i;
+
+  for (i = 0; i < 4; ++i)
+    x[i] = buf_get_be32(&in[i*4]);
+
+  for (i = 0; i < 32; ++i)
+    {
+      t = sm4_round(x, rk[i]);
+      x[0] = x[1];
+      x[1] = x[2];
+      x[2] = x[3];
+      x[3] = t;
+    }
+
+  for (i = 0; i < 4; ++i)
+    buf_put_be32(&out[i*4], x[3 - i]);
+}
+
+static unsigned int
+sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
+{
+  SM4_context *ctx = context;
+
+  sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf);
+  return /*burn_stack*/ 4*6+sizeof(void*)*4;
+}
+
+static unsigned int
+sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
+{
+  SM4_context *ctx = context;
+
+  sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf);
+  return /*burn_stack*/ 4*6+sizeof(void*)*4;
+}
+
+static const char *
+sm4_selftest (void)
+{
+  SM4_context ctx;
+  byte scratch[16];
+
+  static const byte plaintext[16] = {
+    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
+    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
+  };
+  static const byte key[16] = {
+    0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
+    0xFE, 0xDC, 0xBA, 0x98, 0x76, 0x54, 0x32, 0x10,
+  };
+  static const byte ciphertext[16] = {
+    0x68, 0x1E, 0xDF, 0x34, 0xD2, 0x06, 0x96, 0x5E,
+    0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46
+  };
+
+  sm4_setkey (&ctx, key, sizeof (key), NULL);
+  sm4_encrypt (&ctx, scratch, plaintext);
+  if (memcmp (scratch, ciphertext, sizeof (ciphertext)))
+    return "SM4 test encryption failed.";
+  sm4_decrypt (&ctx, scratch, scratch);
+  if (memcmp (scratch, plaintext, sizeof (plaintext)))
+    return "SM4 test decryption failed.";
+
+  return NULL;
+}
+
+static gpg_err_code_t
+run_selftests (int algo, int extended, selftest_report_func_t report)
+{
+  const char *what;
+  const char *errtxt;
+
+  (void)extended;
+
+  if (algo != GCRY_CIPHER_SM4)
+    return GPG_ERR_CIPHER_ALGO;
+
+  what = "selftest";
+  errtxt = sm4_selftest ();
+  if (errtxt)
+    goto failed;
+
+  return 0;
+
+ failed:
+  if (report)
+    report ("cipher", GCRY_CIPHER_SM4, what, errtxt);
+  return GPG_ERR_SELFTEST_FAILED;
+}
+
+
+static gcry_cipher_oid_spec_t sm4_oids[] =
+  {
+    { "1.2.156.10197.1.104.1", GCRY_CIPHER_MODE_ECB },
+    { "1.2.156.10197.1.104.2", GCRY_CIPHER_MODE_CBC },
+    { "1.2.156.10197.1.104.3", GCRY_CIPHER_MODE_OFB },
+    { "1.2.156.10197.1.104.4", GCRY_CIPHER_MODE_CFB },
+    { "1.2.156.10197.1.104.7", GCRY_CIPHER_MODE_CTR },
+    { NULL }
+  };
+
+gcry_cipher_spec_t _gcry_cipher_spec_sm4 =
+  {
+    GCRY_CIPHER_SM4, {0, 0},
+    "SM4", NULL, sm4_oids, 16, 128,
+    sizeof (SM4_context),
+    sm4_setkey, sm4_encrypt, sm4_decrypt,
+    NULL, NULL,
+    run_selftests
+  };
diff --git a/configure.ac b/configure.ac
index 0c9100bf..f77476e0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -212,6 +212,7 @@ LIBGCRYPT_CONFIG_HOST="$host"
 # Definitions for symmetric ciphers.
 available_ciphers="arcfour blowfish cast5 des aes twofish serpent rfc2268 seed"
 available_ciphers="$available_ciphers camellia idea salsa20 gost28147 chacha20"
+available_ciphers="$available_ciphers sm4"
 enabled_ciphers=""
 
 # Definitions for public-key ciphers.
@@ -2559,6 +2560,12 @@ if test "$found" = "1" ; then
    fi
 fi
 
+LIST_MEMBER(sm4, $enabled_ciphers)
+if test "$found" = "1" ; then
+   GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo"
+   AC_DEFINE(USE_SM4, 1, [Defined if this module should be included])
+fi
+
 LIST_MEMBER(dsa, $enabled_pubkey_ciphers)
 if test "$found" = "1" ; then
    GCRYPT_PUBKEY_CIPHERS="$GCRYPT_PUBKEY_CIPHERS dsa.lo"
diff --git a/doc/gcrypt.texi b/doc/gcrypt.texi
index ad5ada87..9d1db3d4 100644
--- a/doc/gcrypt.texi
+++ b/doc/gcrypt.texi
@@ -1641,6 +1641,12 @@ if it has to be used for the selected parameter set.
 @cindex ChaCha20
 This is the ChaCha20 stream cipher.
 
+ at item GCRY_CIPHER_SM4
+ at cindex SM4 (cipher)
+A 128 bit cipher by the State Cryptography Administration
+of China (SCA).  See
+ at uref{https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10}.
+
 @end table
 
 @node Available cipher modes
diff --git a/src/cipher.h b/src/cipher.h
index 20ccb8c5..c49bbda5 100644
--- a/src/cipher.h
+++ b/src/cipher.h
@@ -302,6 +302,7 @@ extern gcry_cipher_spec_t _gcry_cipher_spec_salsa20r12;
 extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147;
 extern gcry_cipher_spec_t _gcry_cipher_spec_gost28147_mesh;
 extern gcry_cipher_spec_t _gcry_cipher_spec_chacha20;
+extern gcry_cipher_spec_t _gcry_cipher_spec_sm4;
 
 /* Declarations for the digest specifications.  */
 extern gcry_md_spec_t _gcry_digest_spec_crc32;
diff --git a/src/gcrypt.h.in b/src/gcrypt.h.in
index c0132189..5668e625 100644
--- a/src/gcrypt.h.in
+++ b/src/gcrypt.h.in
@@ -946,7 +946,8 @@ enum gcry_cipher_algos
     GCRY_CIPHER_SALSA20R12  = 314,
     GCRY_CIPHER_GOST28147   = 315,
     GCRY_CIPHER_CHACHA20    = 316,
-    GCRY_CIPHER_GOST28147_MESH   = 317 /* GOST 28247 with optional CryptoPro keymeshing */
+    GCRY_CIPHER_GOST28147_MESH   = 317, /* GOST 28247 with optional CryptoPro keymeshing */
+    GCRY_CIPHER_SM4         = 318
   };
 
 /* The Rijndael algorithm is basically AES, so provide some macros. */
@@ -1484,6 +1485,7 @@ enum gcry_mac_algos
     GCRY_MAC_CMAC_RFC2268       = 209,
     GCRY_MAC_CMAC_IDEA          = 210,
     GCRY_MAC_CMAC_GOST28147     = 211,
+    GCRY_MAC_CMAC_SM4           = 212,
 
     GCRY_MAC_GMAC_AES           = 401,
     GCRY_MAC_GMAC_CAMELLIA      = 402,
-- 
2.17.1


From tianjia.zhang at linux.alibaba.com  Tue Jun 16 11:09:29 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 16 Jun 2020 17:09:29 +0800
Subject: [PATCH v2 2/2] tests: Add basic test-vectors for SM4
In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20200616090929.102931-3-tianjia.zhang@linux.alibaba.com>

The added test vectors are from:
https://tools.ietf.org/html/draft-ribose-cfrg-sm4-10#appendix-A.2

* tests/basic.c (check_ciphers): Add SM4 check and test-vectors.

Signed-off-by: Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
---
 tests/basic.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tests/basic.c b/tests/basic.c
index 2dee1bee..5acbab84 100644
--- a/tests/basic.c
+++ b/tests/basic.c
@@ -845,6 +845,30 @@ check_ecb_cipher (void)
 	  { }
 	}
       },
+      { GCRY_CIPHER_SM4,
+	"\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10",
+	0,
+	{ { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+	    "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+	    16,
+	    32,
+	    "\x5e\xc8\x14\x3d\xe5\x09\xcf\xf7\xb5\x17\x9f\x8f\x47\x4b\x86\x19"
+	    "\x2f\x1d\x30\x5a\x7f\xb1\x7d\xf9\x85\xf8\x1c\x84\x82\x19\x23\x04" },
+	  { }
+	}
+      },
+      { GCRY_CIPHER_SM4,
+	"\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef",
+	0,
+	{ { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+	    "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+	    16,
+	    32,
+	    "\xc5\x87\x68\x97\xe4\xa5\x9b\xbb\xa7\x2a\x10\xc8\x38\x72\x24\x5b"
+	    "\x12\xdd\x90\xbc\x2d\x20\x06\x92\xb5\x29\xa4\x15\x5a\xc9\xe6\x00" },
+	  { }
+	}
+      },
     };
   gcry_cipher_hd_t hde, hdd;
   unsigned char out[MAX_DATA_LEN];
@@ -2059,6 +2083,38 @@ check_ctr_cipher (void)
 	}
       },
 #endif /*USE_CAST5*/
+      {	GCRY_CIPHER_SM4,
+        "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+        { { "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb"
+            "\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xee\xee\xee\xee\xff\xff\xff\xff\xff\xff\xff\xff"
+            "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb",
+            64,
+            "\xac\x32\x36\xcb\x97\x0c\xc2\x07\x91\x36\x4c\x39\x5a\x13\x42\xd1"
+            "\xa3\xcb\xc1\x87\x8c\x6f\x30\xcd\x07\x4c\xce\x38\x5c\xdd\x70\xc7"
+            "\xf2\x34\xbc\x0e\x24\xc1\x19\x80\xfd\x12\x86\x31\x0c\xe3\x7b\x92"
+            "\x6e\x02\xfc\xd0\xfa\xa0\xba\xf3\x8b\x29\x33\x85\x1d\x82\x45\x14" },
+
+          { "", 0, "" }
+        }
+      },
+      {	GCRY_CIPHER_SM4,
+        "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+        { { "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb"
+            "\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xee\xee\xee\xee\xff\xff\xff\xff\xff\xff\xff\xff"
+            "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb",
+            64,
+            "\x5d\xcc\xcd\x25\xb9\x5a\xb0\x74\x17\xa0\x85\x12\xee\x16\x0e\x2f"
+            "\x8f\x66\x15\x21\xcb\xba\xb4\x4c\xc8\x71\x38\x44\x5b\xc2\x9e\x5c"
+            "\x0a\xe0\x29\x72\x05\xd6\x27\x04\x17\x3b\x21\x23\x9b\x88\x7f\x6c"
+            "\x8c\xb5\xb8\x00\x91\x7a\x24\x88\x28\x4b\xde\x9e\x16\xea\x29\x06" },
+
+          { "", 0, "" }
+        }
+      },
       {	0,
 	"",
 	"",
@@ -2559,6 +2615,26 @@ check_cfb_cipher (void)
 	"1.2.643.2.2.31.2"
       },
 #endif
+      { GCRY_CIPHER_SM4, 0,
+        "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+	{ { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+            32,
+            "\xac\x32\x36\xcb\x86\x1d\xd3\x16\xe6\x41\x3b\x4e\x3c\x75\x24\xb7"
+            "\x69\xd4\xc5\x4e\xd4\x33\xb9\xa0\x34\x60\x09\xbe\xb3\x7b\x2b\x3f" },
+	}
+      },
+      { GCRY_CIPHER_SM4, 0,
+        "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+	{ { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+            32,
+            "\x5d\xcc\xcd\x25\xa8\x4b\xa1\x65\x60\xd7\xf2\x65\x88\x70\x68\x49"
+            "\x0d\x9b\x86\xff\x20\xc3\xbf\xe1\x15\xff\xa0\x2c\xa6\x19\x2c\xc5" },
+	}
+      },
     };
   gcry_cipher_hd_t hde, hdd;
   unsigned char out[MAX_DATA_LEN];
@@ -2753,6 +2829,26 @@ check_ofb_cipher (void)
             16,
             "\x01\x26\x14\x1d\x67\xf3\x7b\xe8\x53\x8f\x5a\x8b\xe7\x40\xe4\x84" }
         }
+      },
+      { GCRY_CIPHER_SM4,
+        "\x01\x23\x45\x67\x89\xab\xcd\xef\xfe\xdc\xba\x98\x76\x54\x32\x10",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+        { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+            32,
+            "\xac\x32\x36\xcb\x86\x1d\xd3\x16\xe6\x41\x3b\x4e\x3c\x75\x24\xb7"
+            "\x1d\x01\xac\xa2\x48\x7c\xa5\x82\xcb\xf5\x46\x3e\x66\x98\x53\x9b" },
+        }
+      },
+      { GCRY_CIPHER_SM4,
+        "\xfe\xdc\xba\x98\x76\x54\x32\x10\x01\x23\x45\x67\x89\xab\xcd\xef",
+        "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+        { { "\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb\xcc\xcc\xcc\xcc\xdd\xdd\xdd\xdd"
+            "\xee\xee\xee\xee\xff\xff\xff\xff\xaa\xaa\xaa\xaa\xbb\xbb\xbb\xbb",
+            32,
+            "\x5d\xcc\xcd\x25\xa8\x4b\xa1\x65\x60\xd7\xf2\x65\x88\x70\x68\x49"
+            "\x33\xfa\x16\xbd\x5c\xd9\xc8\x56\xca\xca\xa1\xe1\x01\x89\x7a\x97" },
+        }
       }
     };
   gcry_cipher_hd_t hde, hdd;
@@ -9444,6 +9540,9 @@ check_ciphers (void)
 #if USE_GOST28147
     GCRY_CIPHER_GOST28147,
     GCRY_CIPHER_GOST28147_MESH,
+#endif
+#if USE_SM4
+    GCRY_CIPHER_SM4,
 #endif
     0
   };
-- 
2.17.1


From tianjia.zhang at linux.alibaba.com  Tue Jun 16 11:19:51 2020
From: tianjia.zhang at linux.alibaba.com (Tianjia Zhang)
Date: Tue, 16 Jun 2020 17:19:51 +0800
Subject: [PATCH 1/1] Add SM4 symmetric cipher algorithm
In-Reply-To: <cb2e9d1b-a19b-fc07-ebd6-0d2a7132581b@iki.fi>
References: <20200608110051.49173-1-tianjia.zhang@linux.alibaba.com>
 <20200608110051.49173-2-tianjia.zhang@linux.alibaba.com>
 <0f400063-f273-eee0-a29e-137b8c9651d2@iki.fi>
 <cb2e9d1b-a19b-fc07-ebd6-0d2a7132581b@iki.fi>
Message-ID: <817a569d-d15a-ca45-3eae-ef0e86d1adeb@linux.alibaba.com>


On 2020/6/14 5:34, Jussi Kivilinna wrote:
> On 9.6.2020 20.59, Jussi Kivilinna wrote:
>> Hello,
>>
>> Patch looks mostly good. I have add few comments below.
>>
>> On 8.6.2020 14.00, Tianjia Zhang via Gcrypt-devel wrote:
>>> * cipher/Makefile.am (EXTRA_libcipher_la_SOURCES): Add sm4.c.
>>> * cipher/cipher.c (cipher_list, cipher_list_algo301):
>>> Add _gcry_cipher_spec_sm4.
>>> * cipher/sm4.c: New.
>>> * configure.ac (available_ciphers): Add sm4.
>>> * src/cipher.h: Add declarations for SM4.
>>> * src/gcrypt.h.in (gcry_cipher_algos): Add algorithm ID for SM4.
>>> * tests/basic.c (check_ciphers): Add sm4 check.
>>
>> Please also add GCRY_MAC_CMAC_SM4 support in mac.c/mac-cmac.c.
>>
> 
> Oh, and please add also SM4 to documentation, doc/gcrypt.texi.
> 
> -Jussi
> 

Thanks for your suggestion, I have submitted v2 patch.

Thanks and best,
Tianjia


From jussi.kivilinna at iki.fi  Tue Jun 16 21:28:22 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 16 Jun 2020 22:28:22 +0300
Subject: Optimization for SM4 and x86-64/AES-NI implementations
In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <20200616192825.1584395-1-jussi.kivilinna@iki.fi>

This patch-set adds optimizations for C implementation of SM4 cipher and
AES-NI accelerated AVX and AVX2 assembly implementations. Performance
improvement for whole patch-set is presented below. Intermediate results
are listed in each patch separately. 

As summary, on x86-64, generic C implementation is ~2 to ~4 times faster
than original C implementation. AES-NI implementations speed-up
parallelizable cipher modes and there AES-NI/AVX is ~11 times faster
and AES-NI/AVX2 ~18 times faster original C implementation.

Benchmark on AMD Ryzen 7 3700X:

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     17.69 ns/B     53.92 MiB/s     76.50 c/B      4326
        ECB dec |     17.74 ns/B     53.77 MiB/s     76.72 c/B      4325
        CBC enc |     18.14 ns/B     52.56 MiB/s     78.47 c/B      4325
        CBC dec |     18.05 ns/B     52.83 MiB/s     78.09 c/B      4326
        CFB enc |     18.19 ns/B     52.44 MiB/s     78.67 c/B      4326
        CFB dec |     18.16 ns/B     52.53 MiB/s     78.53 c/B      4326
        OFB enc |     16.82 ns/B     56.70 MiB/s     72.96 c/B      4338
        OFB dec |     16.87 ns/B     56.53 MiB/s     72.96 c/B      4325
        CTR enc |     18.17 ns/B     52.47 MiB/s     78.62 c/B      4326
        CTR dec |     18.02 ns/B     52.94 MiB/s     77.92 c/B      4325
        XTS enc |     17.70 ns/B     53.87 MiB/s     76.11 c/B      4300
        XTS dec |     17.65 ns/B     54.04 MiB/s     76.28 c/B      4323?1
        CCM enc |     33.76 ns/B     28.25 MiB/s     146.9 c/B      4350
        CCM dec |     34.07 ns/B     27.99 MiB/s     147.4 c/B      4326
       CCM auth |     16.97 ns/B     56.19 MiB/s     73.41 c/B      4325
        EAX enc |     34.02 ns/B     28.03 MiB/s     147.1 c/B      4325
        EAX dec |     36.56 ns/B     26.08 MiB/s     159.1 c/B      4350
       EAX auth |     17.02 ns/B     56.03 MiB/s     73.62 c/B      4325
        GCM enc |     16.76 ns/B     56.90 MiB/s     72.50 c/B      4325
        GCM dec |     18.01 ns/B     52.94 MiB/s     78.37 c/B      4350
       GCM auth |     0.120 ns/B      7975 MiB/s     0.517 c/B      4325
        OCB enc |     18.19 ns/B     52.43 MiB/s     78.68 c/B      4325
        OCB dec |     18.15 ns/B     52.54 MiB/s     78.51 c/B      4325
       OCB auth |     16.87 ns/B     56.54 MiB/s     72.95 c/B      4325

After:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      8.32 ns/B     114.6 MiB/s     36.01 c/B      4325
        ECB dec |      8.31 ns/B     114.7 MiB/s     35.75 c/B      4300
        CBC enc |      8.94 ns/B     106.7 MiB/s     38.67 c/B      4325
        CBC dec |     0.984 ns/B     969.2 MiB/s      4.23 c/B      4300
        CFB enc |      8.92 ns/B     107.0 MiB/s     38.57 c/B      4325
        CFB dec |     0.989 ns/B     964.1 MiB/s      4.23 c/B      4275
        OFB enc |      8.45 ns/B     112.8 MiB/s     36.35 c/B      4300
        OFB dec |      8.40 ns/B     113.5 MiB/s     36.34 c/B      4325
        CTR enc |      1.00 ns/B     952.6 MiB/s      4.31 c/B      4300
        CTR dec |     0.999 ns/B     954.9 MiB/s      4.29 c/B      4300
        XTS enc |      8.81 ns/B     108.3 MiB/s     38.11 c/B      4326
        XTS dec |      8.81 ns/B     108.3 MiB/s     38.09 c/B      4325
        CCM enc |      9.93 ns/B     96.07 MiB/s     42.69 c/B      4300
        CCM dec |      9.91 ns/B     96.20 MiB/s     42.89 c/B      4326
       CCM auth |      8.89 ns/B     107.3 MiB/s     38.45 c/B      4326
        EAX enc |      9.91 ns/B     96.27 MiB/s     42.85 c/B      4325
        EAX dec |      9.91 ns/B     96.19 MiB/s     42.80 c/B      4317
       EAX auth |      8.95 ns/B     106.6 MiB/s     38.71 c/B      4325
        GCM enc |      1.11 ns/B     856.8 MiB/s      4.79 c/B      4300
        GCM dec |      1.12 ns/B     849.4 MiB/s      4.80 c/B      4275
       GCM auth |     0.117 ns/B      8154 MiB/s     0.509 c/B      4350
        OCB enc |     0.999 ns/B     954.8 MiB/s      4.29 c/B      4300
        OCB dec |      1.00 ns/B     952.1 MiB/s      4.31 c/B      4300
       OCB auth |     0.989 ns/B     964.4 MiB/s      4.25 c/B      4300


From jussi.kivilinna at iki.fi  Tue Jun 16 21:28:23 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 16 Jun 2020 22:28:23 +0300
Subject: [PATCH 1/3] Optimizations for SM4 cipher
In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
 <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
Message-ID: <20200616192825.1584395-2-jussi.kivilinna@iki.fi>

* cipher/cipher.c (_gcry_cipher_open_internal): Add SM4 bulk
functions.
* cipher/sm4.c (ATTR_ALIGNED_64): New.
(sbox): Convert to ...
(sbox_table): ... this structure for sbox hardening as is done
for AES and GCM.
(prefetch_sbox_table): New.
(sm4_t_non_lin_sub): Make inline; Optimize sbox access pattern.
(sm4_key_lin_sub): Make inline; Tune slightly.
(sm4_key_sub, sm4_enc_sub): Make inline.
(sm4_round): Make inline; Take 'x' as separate parameters instead
of array.
(sm4_expand_key): Return void; Drop keylen; Unroll loops by 4;
Wipe sensitive variables at end; Move key-length check to
'sm4_setkey'.
(sm4_setkey): Add initial self-test step; Add key-length check;
Remove burn stack (as variables wiped in 'sm4_expand_key').
(sm4_do_crypt): Return burn stack depth; Unroll loops by 4.
(sm4_encrypt, sm4_decrypt): Prefetch sbox table; Return burn
stack from 'sm4_do_crypt', as allows tail-call optimization
by compiler.
(sm4_do_crypt_blks2): New two parallel block function for greater
instruction level parallelism.
(sm4_crypt_blocks, _gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec)
(_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New
bulk processing functions.
(selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): New
bulk processing self-tests.
(sm4_selftest): Clear SM4 context before use; Use 'sm4_expand_key'
instead of 'sm4_setkey'; Call bulk processing self-tests.
* src/cipher.h (_gcry_sm4_ctr_enc, _gcry_sm4_ctr_dec)
(_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New.
* tests/basic.c (check_ocb_cipher): Add SM4-OCB test vector.
--

Benchmark on AMD Ryzen 7 3700X (x86-64):

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     17.69 ns/B     53.92 MiB/s     76.50 c/B      4326
        ECB dec |     17.74 ns/B     53.77 MiB/s     76.72 c/B      4325
        CBC enc |     18.14 ns/B     52.56 MiB/s     78.47 c/B      4325
        CBC dec |     18.05 ns/B     52.83 MiB/s     78.09 c/B      4326
        CFB enc |     18.19 ns/B     52.44 MiB/s     78.67 c/B      4326
        CFB dec |     18.16 ns/B     52.53 MiB/s     78.53 c/B      4326
        OFB enc |     16.82 ns/B     56.70 MiB/s     72.96 c/B      4338
        OFB dec |     16.87 ns/B     56.53 MiB/s     72.96 c/B      4325
        CTR enc |     18.17 ns/B     52.47 MiB/s     78.62 c/B      4326
        CTR dec |     18.02 ns/B     52.94 MiB/s     77.92 c/B      4325
        XTS enc |     17.70 ns/B     53.87 MiB/s     76.11 c/B      4300
        XTS dec |     17.65 ns/B     54.04 MiB/s     76.28 c/B      4323?1
        CCM enc |     33.76 ns/B     28.25 MiB/s     146.9 c/B      4350
        CCM dec |     34.07 ns/B     27.99 MiB/s     147.4 c/B      4326
       CCM auth |     16.97 ns/B     56.19 MiB/s     73.41 c/B      4325
        EAX enc |     34.02 ns/B     28.03 MiB/s     147.1 c/B      4325
        EAX dec |     36.56 ns/B     26.08 MiB/s     159.1 c/B      4350
       EAX auth |     17.02 ns/B     56.03 MiB/s     73.62 c/B      4325
        GCM enc |     16.76 ns/B     56.90 MiB/s     72.50 c/B      4325
        GCM dec |     18.01 ns/B     52.94 MiB/s     78.37 c/B      4350
       GCM auth |     0.120 ns/B      7975 MiB/s     0.517 c/B      4325
        OCB enc |     18.19 ns/B     52.43 MiB/s     78.68 c/B      4325
        OCB dec |     18.15 ns/B     52.54 MiB/s     78.51 c/B      4325
       OCB auth |     16.87 ns/B     56.54 MiB/s     72.95 c/B      4325

After (non-parallalizeble modes ~2.0x faster, parallel modes ~3.8x):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |      8.28 ns/B     115.1 MiB/s     35.84 c/B      4327?1
        ECB dec |      8.33 ns/B     114.4 MiB/s     36.13 c/B      4336?1
        CBC enc |      8.94 ns/B     106.7 MiB/s     38.66 c/B      4325
        CBC dec |      4.78 ns/B     199.7 MiB/s     20.42 c/B      4275
        CFB enc |      8.95 ns/B     106.5 MiB/s     38.72 c/B      4325
        CFB dec |      4.81 ns/B     198.2 MiB/s     20.57 c/B      4275
        OFB enc |      8.48 ns/B     112.5 MiB/s     36.66 c/B      4325
        OFB dec |      8.42 ns/B     113.3 MiB/s     36.41 c/B      4325
        CTR enc |      4.81 ns/B     198.2 MiB/s     20.69 c/B      4300
        CTR dec |      4.80 ns/B     198.8 MiB/s     20.63 c/B      4300
        XTS enc |      8.75 ns/B     109.0 MiB/s     37.83 c/B      4325
        XTS dec |      8.86 ns/B     107.7 MiB/s     38.30 c/B      4326
        CCM enc |     13.74 ns/B     69.42 MiB/s     59.42 c/B      4325
        CCM dec |     13.77 ns/B     69.25 MiB/s     59.57 c/B      4326
       CCM auth |      8.87 ns/B     107.5 MiB/s     38.36 c/B      4325
        EAX enc |     13.76 ns/B     69.29 MiB/s     59.54 c/B      4326
        EAX dec |     13.77 ns/B     69.25 MiB/s     59.57 c/B      4325
       EAX auth |      8.89 ns/B     107.3 MiB/s     38.44 c/B      4325
        GCM enc |      4.96 ns/B     192.3 MiB/s     21.20 c/B      4275
        GCM dec |      4.91 ns/B     194.4 MiB/s     21.10 c/B      4300
       GCM auth |     0.116 ns/B      8232 MiB/s     0.504 c/B      4351
        OCB enc |      4.88 ns/B     195.5 MiB/s     20.86 c/B      4275
        OCB dec |      4.85 ns/B     196.6 MiB/s     20.86 c/B      4301
       OCB auth |      4.80 ns/B     198.9 MiB/s     20.62 c/B      4301

Benchmark on ARM Cortex-A53 (aarch64):

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     84.08 ns/B     11.34 MiB/s     54.48 c/B     648.0
        ECB dec |     84.07 ns/B     11.34 MiB/s     54.47 c/B     648.0
        CBC enc |     84.90 ns/B     11.23 MiB/s     55.01 c/B     647.9
        CBC dec |     84.69 ns/B     11.26 MiB/s     54.87 c/B     648.0
        CFB enc |     84.55 ns/B     11.28 MiB/s     54.79 c/B     648.0
        CFB dec |     84.55 ns/B     11.28 MiB/s     54.78 c/B     648.0
        OFB enc |     84.45 ns/B     11.29 MiB/s     54.72 c/B     647.9
        OFB dec |     84.45 ns/B     11.29 MiB/s     54.72 c/B     648.0
        CTR enc |     85.42 ns/B     11.16 MiB/s     55.35 c/B     648.0
        CTR dec |     85.42 ns/B     11.16 MiB/s     55.35 c/B     648.0
        XTS enc |     88.72 ns/B     10.75 MiB/s     57.49 c/B     648.0
        XTS dec |     88.71 ns/B     10.75 MiB/s     57.48 c/B     648.0
        CCM enc |     170.2 ns/B      5.60 MiB/s     110.3 c/B     647.9
        CCM dec |     170.2 ns/B      5.60 MiB/s     110.3 c/B     648.0
       CCM auth |     84.27 ns/B     11.32 MiB/s     54.60 c/B     648.0
        EAX enc |     170.6 ns/B      5.59 MiB/s     110.5 c/B     648.0
        EAX dec |     170.6 ns/B      5.59 MiB/s     110.5 c/B     648.0
       EAX auth |     84.51 ns/B     11.29 MiB/s     54.76 c/B     648.0
        GCM enc |     86.99 ns/B     10.96 MiB/s     56.36 c/B     648.0
        GCM dec |     87.00 ns/B     10.96 MiB/s     56.37 c/B     648.0
       GCM auth |      1.56 ns/B     609.9 MiB/s      1.01 c/B     648.0
        OCB enc |     86.77 ns/B     10.99 MiB/s     56.22 c/B     648.0
        OCB dec |     86.77 ns/B     10.99 MiB/s     56.22 c/B     648.0
       OCB auth |     86.20 ns/B     11.06 MiB/s     55.85 c/B     648.0

After (non-parallalizable modes ~30% faster, parallel modes ~80%):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        ECB enc |     64.85 ns/B     14.71 MiB/s     42.02 c/B     648.0
        ECB dec |     64.78 ns/B     14.72 MiB/s     41.98 c/B     648.0
        CBC enc |     64.53 ns/B     14.78 MiB/s     41.81 c/B     647.9
        CBC dec |     45.09 ns/B     21.15 MiB/s     29.21 c/B     648.0
        CFB enc |     64.56 ns/B     14.77 MiB/s     41.84 c/B     648.0
        CFB dec |     45.52 ns/B     20.95 MiB/s     29.49 c/B     647.9
        OFB enc |     64.14 ns/B     14.87 MiB/s     41.56 c/B     648.0
        OFB dec |     64.14 ns/B     14.87 MiB/s     41.56 c/B     648.0
        CTR enc |     45.54 ns/B     20.94 MiB/s     29.51 c/B     648.0
        CTR dec |     45.53 ns/B     20.95 MiB/s     29.50 c/B     648.0
        XTS enc |     67.88 ns/B     14.05 MiB/s     43.98 c/B     648.0
        XTS dec |     67.69 ns/B     14.09 MiB/s     43.86 c/B     648.0
        CCM enc |     110.6 ns/B      8.62 MiB/s     71.66 c/B     648.0
        CCM dec |     110.2 ns/B      8.65 MiB/s     71.42 c/B     648.0
       CCM auth |     64.87 ns/B     14.70 MiB/s     42.04 c/B     648.0
        EAX enc |     109.9 ns/B      8.68 MiB/s     71.22 c/B     648.0
        EAX dec |     109.9 ns/B      8.68 MiB/s     71.22 c/B     648.0
       EAX auth |     64.37 ns/B     14.81 MiB/s     41.71 c/B     648.0
        GCM enc |     47.07 ns/B     20.26 MiB/s     30.51 c/B     648.0
        GCM dec |     47.08 ns/B     20.26 MiB/s     30.51 c/B     648.0
       GCM auth |      1.55 ns/B     614.7 MiB/s      1.01 c/B     648.0
        OCB enc |     48.38 ns/B     19.71 MiB/s     31.35 c/B     648.0
        OCB dec |     48.11 ns/B     19.82 MiB/s     31.17 c/B     648.0
       OCB auth |     46.71 ns/B     20.42 MiB/s     30.27 c/B     648.0

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/cipher.c |   9 +
 cipher/sm4.c    | 709 ++++++++++++++++++++++++++++++++++++++++++------
 src/cipher.h    |  16 ++
 tests/basic.c   |   2 +
 4 files changed, 648 insertions(+), 88 deletions(-)

diff --git a/cipher/cipher.c b/cipher/cipher.c
index dfb083a0..c77c9682 100644
--- a/cipher/cipher.c
+++ b/cipher/cipher.c
@@ -707,6 +707,15 @@ _gcry_cipher_open_internal (gcry_cipher_hd_t *handle,
               h->bulk.ocb_auth  = _gcry_serpent_ocb_auth;
               break;
 #endif /*USE_SERPENT*/
+#ifdef USE_SM4
+	    case GCRY_CIPHER_SM4:
+              h->bulk.cbc_dec = _gcry_sm4_cbc_dec;
+              h->bulk.cfb_dec = _gcry_sm4_cfb_dec;
+              h->bulk.ctr_enc = _gcry_sm4_ctr_enc;
+              h->bulk.ocb_crypt = _gcry_sm4_ocb_crypt;
+              h->bulk.ocb_auth  = _gcry_sm4_ocb_auth;
+              break;
+#endif /*USE_SM4*/
 #ifdef USE_TWOFISH
 	    case GCRY_CIPHER_TWOFISH:
 	    case GCRY_CIPHER_TWOFISH128:
diff --git a/cipher/sm4.c b/cipher/sm4.c
index 061ee26e..621532fa 100644
--- a/cipher/sm4.c
+++ b/cipher/sm4.c
@@ -1,6 +1,7 @@
 /* sm4.c  -  SM4 Cipher Algorithm
  * Copyright (C) 2020 Alibaba Group.
  * Copyright (C) 2020 Tianjia Zhang <tianjia.zhang at linux.alibaba.com>
+ * Copyright (C) 2020 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -27,6 +28,17 @@
 #include "g10lib.h"
 #include "cipher.h"
 #include "bufhelp.h"
+#include "cipher-internal.h"
+#include "cipher-selftest.h"
+
+/* Helper macro to force alignment to 64 bytes.  */
+#ifdef HAVE_GCC_ATTRIBUTE_ALIGNED
+# define ATTR_ALIGNED_64  __attribute__ ((aligned (64)))
+#else
+# define ATTR_ALIGNED_64
+#endif
+
+static const char *sm4_selftest (void);
 
 typedef struct
 {
@@ -34,46 +46,60 @@ typedef struct
   u32 rkey_dec[32];
 } SM4_context;
 
-static const u32 fk[4] = {
+static const u32 fk[4] =
+{
   0xa3b1bac6, 0x56aa3350, 0x677d9197, 0xb27022dc
 };
 
-static const byte sbox[256] = {
-  0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
-  0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
-  0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
-  0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
-  0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
-  0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
-  0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
-  0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
-  0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
-  0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
-  0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
-  0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
-  0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
-  0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
-  0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
-  0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
-  0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
-  0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
-  0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
-  0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
-  0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
-  0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
-  0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
-  0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
-  0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
-  0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
-  0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
-  0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
-  0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
-  0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
-  0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
-  0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48
-};
+static struct
+{
+  volatile u32 counter_head;
+  u32 cacheline_align[64 / 4 - 1];
+  byte S[256];
+  volatile u32 counter_tail;
+} sbox_table ATTR_ALIGNED_64 =
+  {
+    0,
+    { 0, },
+    {
+      0xd6, 0x90, 0xe9, 0xfe, 0xcc, 0xe1, 0x3d, 0xb7,
+      0x16, 0xb6, 0x14, 0xc2, 0x28, 0xfb, 0x2c, 0x05,
+      0x2b, 0x67, 0x9a, 0x76, 0x2a, 0xbe, 0x04, 0xc3,
+      0xaa, 0x44, 0x13, 0x26, 0x49, 0x86, 0x06, 0x99,
+      0x9c, 0x42, 0x50, 0xf4, 0x91, 0xef, 0x98, 0x7a,
+      0x33, 0x54, 0x0b, 0x43, 0xed, 0xcf, 0xac, 0x62,
+      0xe4, 0xb3, 0x1c, 0xa9, 0xc9, 0x08, 0xe8, 0x95,
+      0x80, 0xdf, 0x94, 0xfa, 0x75, 0x8f, 0x3f, 0xa6,
+      0x47, 0x07, 0xa7, 0xfc, 0xf3, 0x73, 0x17, 0xba,
+      0x83, 0x59, 0x3c, 0x19, 0xe6, 0x85, 0x4f, 0xa8,
+      0x68, 0x6b, 0x81, 0xb2, 0x71, 0x64, 0xda, 0x8b,
+      0xf8, 0xeb, 0x0f, 0x4b, 0x70, 0x56, 0x9d, 0x35,
+      0x1e, 0x24, 0x0e, 0x5e, 0x63, 0x58, 0xd1, 0xa2,
+      0x25, 0x22, 0x7c, 0x3b, 0x01, 0x21, 0x78, 0x87,
+      0xd4, 0x00, 0x46, 0x57, 0x9f, 0xd3, 0x27, 0x52,
+      0x4c, 0x36, 0x02, 0xe7, 0xa0, 0xc4, 0xc8, 0x9e,
+      0xea, 0xbf, 0x8a, 0xd2, 0x40, 0xc7, 0x38, 0xb5,
+      0xa3, 0xf7, 0xf2, 0xce, 0xf9, 0x61, 0x15, 0xa1,
+      0xe0, 0xae, 0x5d, 0xa4, 0x9b, 0x34, 0x1a, 0x55,
+      0xad, 0x93, 0x32, 0x30, 0xf5, 0x8c, 0xb1, 0xe3,
+      0x1d, 0xf6, 0xe2, 0x2e, 0x82, 0x66, 0xca, 0x60,
+      0xc0, 0x29, 0x23, 0xab, 0x0d, 0x53, 0x4e, 0x6f,
+      0xd5, 0xdb, 0x37, 0x45, 0xde, 0xfd, 0x8e, 0x2f,
+      0x03, 0xff, 0x6a, 0x72, 0x6d, 0x6c, 0x5b, 0x51,
+      0x8d, 0x1b, 0xaf, 0x92, 0xbb, 0xdd, 0xbc, 0x7f,
+      0x11, 0xd9, 0x5c, 0x41, 0x1f, 0x10, 0x5a, 0xd8,
+      0x0a, 0xc1, 0x31, 0x88, 0xa5, 0xcd, 0x7b, 0xbd,
+      0x2d, 0x74, 0xd0, 0x12, 0xb8, 0xe5, 0xb4, 0xb0,
+      0x89, 0x69, 0x97, 0x4a, 0x0c, 0x96, 0x77, 0x7e,
+      0x65, 0xb9, 0xf1, 0x09, 0xc5, 0x6e, 0xc6, 0x84,
+      0x18, 0xf0, 0x7d, 0xec, 0x3a, 0xdc, 0x4d, 0x20,
+      0x79, 0xee, 0x5f, 0x3e, 0xd7, 0xcb, 0x39, 0x48
+    },
+    0
+  };
 
-static const u32 ck[] = {
+static const u32 ck[] =
+{
   0x00070e15, 0x1c232a31, 0x383f464d, 0x545b6269,
   0x70777e85, 0x8c939aa1, 0xa8afb6bd, 0xc4cbd2d9,
   0xe0e7eef5, 0xfc030a11, 0x181f262d, 0x343b4249,
@@ -84,68 +110,96 @@ static const u32 ck[] = {
   0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279
 };
 
-static u32 sm4_t_non_lin_sub(u32 x)
+static inline void prefetch_sbox_table(void)
 {
-  int i;
-  byte *b = (byte *)&x;
+  const volatile byte *vtab = (void *)&sbox_table;
+
+  /* Modify counters to trigger copy-on-write and unsharing if physical pages
+   * of look-up table are shared between processes.  Modifying counters also
+   * causes checksums for pages to change and hint same-page merging algorithm
+   * that these pages are frequently changing.  */
+  sbox_table.counter_head++;
+  sbox_table.counter_tail++;
+
+  /* Prefetch look-up table to cache.  */
+  (void)vtab[0 * 32];
+  (void)vtab[1 * 32];
+  (void)vtab[2 * 32];
+  (void)vtab[3 * 32];
+  (void)vtab[4 * 32];
+  (void)vtab[5 * 32];
+  (void)vtab[6 * 32];
+  (void)vtab[7 * 32];
+  (void)vtab[8 * 32 - 1];
+}
 
-  for (i = 0; i < 4; ++i)
-    b[i] = sbox[b[i]];
+static inline u32 sm4_t_non_lin_sub(u32 x)
+{
+  u32 out;
 
-  return x;
+  out  = (u32)sbox_table.S[(x >> 0) & 0xff] << 0;
+  out |= (u32)sbox_table.S[(x >> 8) & 0xff] << 8;
+  out |= (u32)sbox_table.S[(x >> 16) & 0xff] << 16;
+  out |= (u32)sbox_table.S[(x >> 24) & 0xff] << 24;
+
+  return out;
 }
 
-static u32 sm4_key_lin_sub(u32 x)
+static inline u32 sm4_key_lin_sub(u32 x)
 {
   return x ^ rol(x, 13) ^ rol(x, 23);
 }
 
-static u32 sm4_enc_lin_sub(u32 x)
+static inline u32 sm4_enc_lin_sub(u32 x)
 {
-  return x ^ rol(x, 2) ^ rol(x, 10) ^ rol(x, 18) ^ rol(x, 24);
+  u32 xrol2 = rol(x, 2);
+  return x ^ xrol2 ^ rol(xrol2, 8) ^ rol(xrol2, 16) ^ rol(x, 24);
 }
 
-static u32 sm4_key_sub(u32 x)
+static inline u32 sm4_key_sub(u32 x)
 {
   return sm4_key_lin_sub(sm4_t_non_lin_sub(x));
 }
 
-static u32 sm4_enc_sub(u32 x)
+static inline u32 sm4_enc_sub(u32 x)
 {
   return sm4_enc_lin_sub(sm4_t_non_lin_sub(x));
 }
 
-static u32 sm4_round(const u32 *x, const u32 rk)
+static inline u32
+sm4_round(const u32 x0, const u32 x1, const u32 x2, const u32 x3, const u32 rk)
 {
-  return x[0] ^ sm4_enc_sub(x[1] ^ x[2] ^ x[3] ^ rk);
+  return x0 ^ sm4_enc_sub(x1 ^ x2 ^ x3 ^ rk);
 }
 
-static gcry_err_code_t
-sm4_expand_key (SM4_context *ctx, const byte *key, const unsigned keylen)
+static void
+sm4_expand_key (SM4_context *ctx, const byte *key)
 {
-  u32 rk[4], t;
+  u32 rk[4];
   int i;
 
-  if (keylen != 16)
-    return GPG_ERR_INV_KEYLEN;
+  rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0];
+  rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1];
+  rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2];
+  rk[3] = buf_get_be32(key + 4 * 3) ^ fk[3];
 
-  for (i = 0; i < 4; ++i)
-    rk[i] = buf_get_be32(&key[i*4]) ^ fk[i];
-
-  for (i = 0; i < 32; ++i)
+  for (i = 0; i < 32; i += 4)
     {
-      t = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i]);
-      ctx->rkey_enc[i] = t;
-      rk[0] = rk[1];
-      rk[1] = rk[2];
-      rk[2] = rk[3];
-      rk[3] = t;
+      rk[0] = rk[0] ^ sm4_key_sub(rk[1] ^ rk[2] ^ rk[3] ^ ck[i + 0]);
+      rk[1] = rk[1] ^ sm4_key_sub(rk[2] ^ rk[3] ^ rk[0] ^ ck[i + 1]);
+      rk[2] = rk[2] ^ sm4_key_sub(rk[3] ^ rk[0] ^ rk[1] ^ ck[i + 2]);
+      rk[3] = rk[3] ^ sm4_key_sub(rk[0] ^ rk[1] ^ rk[2] ^ ck[i + 3]);
+      ctx->rkey_enc[i + 0] = rk[0];
+      ctx->rkey_enc[i + 1] = rk[1];
+      ctx->rkey_enc[i + 2] = rk[2];
+      ctx->rkey_enc[i + 3] = rk[3];
+      ctx->rkey_dec[31 - i - 0] = rk[0];
+      ctx->rkey_dec[31 - i - 1] = rk[1];
+      ctx->rkey_dec[31 - i - 2] = rk[2];
+      ctx->rkey_dec[31 - i - 3] = rk[3];
     }
 
-  for (i = 0; i < 32; ++i)
-    ctx->rkey_dec[i] = ctx->rkey_enc[31 - i];
-
-  return 0;
+  wipememory (rk, sizeof(rk));
 }
 
 static gcry_err_code_t
@@ -153,32 +207,53 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
             gcry_cipher_hd_t hd)
 {
   SM4_context *ctx = context;
-  int rc = sm4_expand_key (ctx, key, keylen);
+  static int init = 0;
+  static const char *selftest_failed = NULL;
+
   (void)hd;
-  _gcry_burn_stack (4*5 + sizeof(int)*2);
-  return rc;
+
+  if (!init)
+    {
+      init = 1;
+      selftest_failed = sm4_selftest();
+      if (selftest_failed)
+	log_error("%s\n", selftest_failed);
+    }
+  if (selftest_failed)
+    return GPG_ERR_SELFTEST_FAILED;
+
+  if (keylen != 16)
+    return GPG_ERR_INV_KEYLEN;
+
+  sm4_expand_key (ctx, key);
+  return 0;
 }
 
-static void
+static unsigned int
 sm4_do_crypt (const u32 *rk, byte *out, const byte *in)
 {
-  u32 x[4], t;
+  u32 x[4];
   int i;
 
-  for (i = 0; i < 4; ++i)
-    x[i] = buf_get_be32(&in[i*4]);
+  x[0] = buf_get_be32(in + 0 * 4);
+  x[1] = buf_get_be32(in + 1 * 4);
+  x[2] = buf_get_be32(in + 2 * 4);
+  x[3] = buf_get_be32(in + 3 * 4);
 
-  for (i = 0; i < 32; ++i)
+  for (i = 0; i < 32; i += 4)
     {
-      t = sm4_round(x, rk[i]);
-      x[0] = x[1];
-      x[1] = x[2];
-      x[2] = x[3];
-      x[3] = t;
+      x[0] = sm4_round(x[0], x[1], x[2], x[3], rk[i + 0]);
+      x[1] = sm4_round(x[1], x[2], x[3], x[0], rk[i + 1]);
+      x[2] = sm4_round(x[2], x[3], x[0], x[1], rk[i + 2]);
+      x[3] = sm4_round(x[3], x[0], x[1], x[2], rk[i + 3]);
     }
 
-  for (i = 0; i < 4; ++i)
-    buf_put_be32(&out[i*4], x[3 - i]);
+  buf_put_be32(out + 0 * 4, x[3 - 0]);
+  buf_put_be32(out + 1 * 4, x[3 - 1]);
+  buf_put_be32(out + 2 * 4, x[3 - 2]);
+  buf_put_be32(out + 3 * 4, x[3 - 3]);
+
+  return /*burn_stack*/ 4*6+sizeof(void*)*4;
 }
 
 static unsigned int
@@ -186,8 +261,9 @@ sm4_encrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
-  sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf);
-  return /*burn_stack*/ 4*6+sizeof(void*)*4;
+  prefetch_sbox_table ();
+
+  return sm4_do_crypt (ctx->rkey_enc, outbuf, inbuf);
 }
 
 static unsigned int
@@ -195,8 +271,453 @@ sm4_decrypt (void *context, byte *outbuf, const byte *inbuf)
 {
   SM4_context *ctx = context;
 
-  sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf);
-  return /*burn_stack*/ 4*6+sizeof(void*)*4;
+  prefetch_sbox_table ();
+
+  return sm4_do_crypt (ctx->rkey_dec, outbuf, inbuf);
+}
+
+static unsigned int
+sm4_do_crypt_blks2 (const u32 *rk, byte *out, const byte *in)
+{
+  u32 x[4];
+  u32 y[4];
+  u32 k;
+  int i;
+
+  /* Encrypts/Decrypts two blocks for higher instruction level
+   * parallelism. */
+
+  x[0] = buf_get_be32(in + 0 * 4);
+  x[1] = buf_get_be32(in + 1 * 4);
+  x[2] = buf_get_be32(in + 2 * 4);
+  x[3] = buf_get_be32(in + 3 * 4);
+  y[0] = buf_get_be32(in + 4 * 4);
+  y[1] = buf_get_be32(in + 5 * 4);
+  y[2] = buf_get_be32(in + 6 * 4);
+  y[3] = buf_get_be32(in + 7 * 4);
+
+  for (i = 0; i < 32; i += 4)
+    {
+      k = rk[i + 0];
+      x[0] = sm4_round(x[0], x[1], x[2], x[3], k);
+      y[0] = sm4_round(y[0], y[1], y[2], y[3], k);
+      k = rk[i + 1];
+      x[1] = sm4_round(x[1], x[2], x[3], x[0], k);
+      y[1] = sm4_round(y[1], y[2], y[3], y[0], k);
+      k = rk[i + 2];
+      x[2] = sm4_round(x[2], x[3], x[0], x[1], k);
+      y[2] = sm4_round(y[2], y[3], y[0], y[1], k);
+      k = rk[i + 3];
+      x[3] = sm4_round(x[3], x[0], x[1], x[2], k);
+      y[3] = sm4_round(y[3], y[0], y[1], y[2], k);
+    }
+
+  buf_put_be32(out + 0 * 4, x[3 - 0]);
+  buf_put_be32(out + 1 * 4, x[3 - 1]);
+  buf_put_be32(out + 2 * 4, x[3 - 2]);
+  buf_put_be32(out + 3 * 4, x[3 - 3]);
+  buf_put_be32(out + 4 * 4, y[3 - 0]);
+  buf_put_be32(out + 5 * 4, y[3 - 1]);
+  buf_put_be32(out + 6 * 4, y[3 - 2]);
+  buf_put_be32(out + 7 * 4, y[3 - 3]);
+
+  return /*burn_stack*/ 4*10+sizeof(void*)*4;
+}
+
+static unsigned int
+sm4_crypt_blocks (const u32 *rk, byte *out, const byte *in,
+		  unsigned int num_blks)
+{
+  unsigned int burn_depth = 0;
+  unsigned int nburn;
+
+  while (num_blks >= 2)
+    {
+      nburn = sm4_do_crypt_blks2 (rk, out, in);
+      burn_depth = nburn > burn_depth ? nburn : burn_depth;
+      out += 2 * 16;
+      in += 2 * 16;
+      num_blks -= 2;
+    }
+
+  while (num_blks)
+    {
+      nburn = sm4_do_crypt (rk, out, in);
+      burn_depth = nburn > burn_depth ? nburn : burn_depth;
+      out += 16;
+      in += 16;
+      num_blks--;
+    }
+
+  if (burn_depth)
+    burn_depth += sizeof(void *) * 5;
+  return burn_depth;
+}
+
+/* Bulk encryption of complete blocks in CTR mode.  This function is only
+   intended for the bulk encryption feature of cipher.c.  CTR is expected to be
+   of size 16. */
+void
+_gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
+                  void *outbuf_arg, const void *inbuf_arg,
+                  size_t nblocks)
+{
+  SM4_context *ctx = context;
+  byte *outbuf = outbuf_arg;
+  const byte *inbuf = inbuf_arg;
+  int burn_stack_depth = 0;
+
+  /* Process remaining blocks. */
+  if (nblocks)
+    {
+      unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
+				   unsigned int num_blks);
+      byte tmpbuf[16 * 8];
+      unsigned int tmp_used = 16;
+
+      if (0)
+	;
+      else
+	{
+	  prefetch_sbox_table ();
+	  crypt_blk1_8 = sm4_crypt_blocks;
+	}
+
+      /* Process remaining blocks. */
+      while (nblocks)
+	{
+	  size_t curr_blks = nblocks > 8 ? 8 : nblocks;
+	  size_t i;
+
+	  if (curr_blks * 16 > tmp_used)
+	    tmp_used = curr_blks * 16;
+
+	  cipher_block_cpy (tmpbuf + 0 * 16, ctr, 16);
+	  for (i = 1; i < curr_blks; i++)
+	    {
+	      cipher_block_cpy (&tmpbuf[i * 16], ctr, 16);
+	      cipher_block_add (&tmpbuf[i * 16], i, 16);
+	    }
+	  cipher_block_add (ctr, curr_blks, 16);
+
+	  burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf,
+					   curr_blks);
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      cipher_block_xor (outbuf, &tmpbuf[i * 16], inbuf, 16);
+	      outbuf += 16;
+	      inbuf += 16;
+	    }
+
+	  nblocks -= curr_blks;
+	}
+
+      wipememory(tmpbuf, tmp_used);
+    }
+
+  if (burn_stack_depth)
+    _gcry_burn_stack(burn_stack_depth);
+}
+
+/* Bulk decryption of complete blocks in CBC mode.  This function is only
+   intended for the bulk encryption feature of cipher.c. */
+void
+_gcry_sm4_cbc_dec(void *context, unsigned char *iv,
+                  void *outbuf_arg, const void *inbuf_arg,
+                  size_t nblocks)
+{
+  SM4_context *ctx = context;
+  unsigned char *outbuf = outbuf_arg;
+  const unsigned char *inbuf = inbuf_arg;
+  int burn_stack_depth = 0;
+
+  /* Process remaining blocks. */
+  if (nblocks)
+    {
+      unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
+				   unsigned int num_blks);
+      unsigned char savebuf[16 * 8];
+      unsigned int tmp_used = 16;
+
+      if (0)
+	;
+      else
+	{
+	  prefetch_sbox_table ();
+	  crypt_blk1_8 = sm4_crypt_blocks;
+	}
+
+      /* Process remaining blocks. */
+      while (nblocks)
+	{
+	  size_t curr_blks = nblocks > 8 ? 8 : nblocks;
+	  size_t i;
+
+	  if (curr_blks * 16 > tmp_used)
+	    tmp_used = curr_blks * 16;
+
+	  burn_stack_depth = crypt_blk1_8 (ctx->rkey_dec, savebuf, inbuf,
+					   curr_blks);
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      cipher_block_xor_n_copy_2(outbuf, &savebuf[i * 16], iv, inbuf,
+					16);
+	      outbuf += 16;
+	      inbuf += 16;
+	    }
+
+	  nblocks -= curr_blks;
+	}
+
+      wipememory(savebuf, tmp_used);
+    }
+
+  if (burn_stack_depth)
+    _gcry_burn_stack(burn_stack_depth);
+}
+
+/* Bulk decryption of complete blocks in CFB mode.  This function is only
+   intended for the bulk encryption feature of cipher.c. */
+void
+_gcry_sm4_cfb_dec(void *context, unsigned char *iv,
+                  void *outbuf_arg, const void *inbuf_arg,
+                  size_t nblocks)
+{
+  SM4_context *ctx = context;
+  unsigned char *outbuf = outbuf_arg;
+  const unsigned char *inbuf = inbuf_arg;
+  int burn_stack_depth = 0;
+
+  /* Process remaining blocks. */
+  if (nblocks)
+    {
+      unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
+				   unsigned int num_blks);
+      unsigned char ivbuf[16 * 8];
+      unsigned int tmp_used = 16;
+
+      if (0)
+	;
+      else
+	{
+	  prefetch_sbox_table ();
+	  crypt_blk1_8 = sm4_crypt_blocks;
+	}
+
+      /* Process remaining blocks. */
+      while (nblocks)
+	{
+	  size_t curr_blks = nblocks > 8 ? 8 : nblocks;
+	  size_t i;
+
+	  if (curr_blks * 16 > tmp_used)
+	    tmp_used = curr_blks * 16;
+
+	  cipher_block_cpy (&ivbuf[0 * 16], iv, 16);
+	  for (i = 1; i < curr_blks; i++)
+	    cipher_block_cpy (&ivbuf[i * 16], &inbuf[(i - 1) * 16], 16);
+	  cipher_block_cpy (iv, &inbuf[(i - 1) * 16], 16);
+
+	  burn_stack_depth = crypt_blk1_8 (ctx->rkey_enc, ivbuf, ivbuf,
+					   curr_blks);
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      cipher_block_xor (outbuf, inbuf, &ivbuf[i * 16], 16);
+	      outbuf += 16;
+	      inbuf += 16;
+	    }
+
+	  nblocks -= curr_blks;
+	}
+
+      wipememory(ivbuf, tmp_used);
+    }
+
+  if (burn_stack_depth)
+    _gcry_burn_stack(burn_stack_depth);
+}
+
+/* Bulk encryption/decryption of complete blocks in OCB mode. */
+size_t
+_gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
+		     const void *inbuf_arg, size_t nblocks, int encrypt)
+{
+  SM4_context *ctx = (void *)&c->context.c;
+  unsigned char *outbuf = outbuf_arg;
+  const unsigned char *inbuf = inbuf_arg;
+  u64 blkn = c->u_mode.ocb.data_nblocks;
+  int burn_stack_depth = 0;
+
+  if (nblocks)
+    {
+      unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
+				   unsigned int num_blks);
+      const u32 *rk = encrypt ? ctx->rkey_enc : ctx->rkey_dec;
+      unsigned char tmpbuf[16 * 8];
+      unsigned int tmp_used = 16;
+
+      if (0)
+	;
+      else
+	{
+	  prefetch_sbox_table ();
+	  crypt_blk1_8 = sm4_crypt_blocks;
+	}
+
+      while (nblocks)
+	{
+	  size_t curr_blks = nblocks > 8 ? 8 : nblocks;
+	  size_t i;
+
+	  if (curr_blks * 16 > tmp_used)
+	    tmp_used = curr_blks * 16;
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      const unsigned char *l = ocb_get_l(c, ++blkn);
+
+	      /* Checksum_i = Checksum_{i-1} xor P_i  */
+	      if (encrypt)
+		cipher_block_xor_1(c->u_ctr.ctr, &inbuf[i * 16], 16);
+
+	      /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	      cipher_block_xor_2dst (&tmpbuf[i * 16], c->u_iv.iv, l, 16);
+	      cipher_block_xor (&outbuf[i * 16], &inbuf[i * 16],
+				c->u_iv.iv, 16);
+	    }
+
+	  /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+	  crypt_blk1_8 (rk, outbuf, outbuf, curr_blks);
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      cipher_block_xor_1 (&outbuf[i * 16], &tmpbuf[i * 16], 16);
+
+	      /* Checksum_i = Checksum_{i-1} xor P_i  */
+	      if (!encrypt)
+		  cipher_block_xor_1(c->u_ctr.ctr, &outbuf[i * 16], 16);
+	    }
+
+	  outbuf += curr_blks * 16;
+	  inbuf  += curr_blks * 16;
+	  nblocks -= curr_blks;
+	}
+
+      wipememory(tmpbuf, tmp_used);
+    }
+
+  c->u_mode.ocb.data_nblocks = blkn;
+
+  if (burn_stack_depth)
+    _gcry_burn_stack(burn_stack_depth);
+
+  return 0;
+}
+
+/* Bulk authentication of complete blocks in OCB mode. */
+size_t
+_gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
+{
+  SM4_context *ctx = (void *)&c->context.c;
+  const unsigned char *abuf = abuf_arg;
+  u64 blkn = c->u_mode.ocb.aad_nblocks;
+
+  if (nblocks)
+    {
+      unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
+				   unsigned int num_blks);
+      unsigned char tmpbuf[16 * 8];
+      unsigned int tmp_used = 16;
+
+      if (0)
+	;
+      else
+	{
+	  prefetch_sbox_table ();
+	  crypt_blk1_8 = sm4_crypt_blocks;
+	}
+
+      while (nblocks)
+	{
+	  size_t curr_blks = nblocks > 8 ? 8 : nblocks;
+	  size_t i;
+
+	  if (curr_blks * 16 > tmp_used)
+	    tmp_used = curr_blks * 16;
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      const unsigned char *l = ocb_get_l(c, ++blkn);
+
+	      /* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	      cipher_block_xor_2dst (&tmpbuf[i * 16],
+				     c->u_mode.ocb.aad_offset, l, 16);
+	      cipher_block_xor_1 (&tmpbuf[i * 16], &abuf[i * 16], 16);
+	    }
+
+	  /* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+	  crypt_blk1_8 (ctx->rkey_enc, tmpbuf, tmpbuf, curr_blks);
+
+	  for (i = 0; i < curr_blks; i++)
+	    {
+	      cipher_block_xor_1 (c->u_mode.ocb.aad_sum, &tmpbuf[i * 16], 16);
+	    }
+
+	  abuf += curr_blks * 16;
+	  nblocks -= curr_blks;
+	}
+
+      wipememory(tmpbuf, tmp_used);
+    }
+
+  c->u_mode.ocb.aad_nblocks = blkn;
+
+  return 0;
+}
+
+/* Run the self-tests for SM4-CTR, tests IV increment of bulk CTR
+   encryption.  Returns NULL on success. */
+static const char*
+selftest_ctr_128 (void)
+{
+  const int nblocks = 16 - 1;
+  const int blocksize = 16;
+  const int context_size = sizeof(SM4_context);
+
+  return _gcry_selftest_helper_ctr("SM4", &sm4_setkey,
+           &sm4_encrypt, &_gcry_sm4_ctr_enc, nblocks, blocksize,
+	   context_size);
+}
+
+/* Run the self-tests for SM4-CBC, tests bulk CBC decryption.
+   Returns NULL on success. */
+static const char*
+selftest_cbc_128 (void)
+{
+  const int nblocks = 16 - 1;
+  const int blocksize = 16;
+  const int context_size = sizeof(SM4_context);
+
+  return _gcry_selftest_helper_cbc("SM4", &sm4_setkey,
+           &sm4_encrypt, &_gcry_sm4_cbc_dec, nblocks, blocksize,
+	   context_size);
+}
+
+/* Run the self-tests for SM4-CFB, tests bulk CFB decryption.
+   Returns NULL on success. */
+static const char*
+selftest_cfb_128 (void)
+{
+  const int nblocks = 16 - 1;
+  const int blocksize = 16;
+  const int context_size = sizeof(SM4_context);
+
+  return _gcry_selftest_helper_cfb("SM4", &sm4_setkey,
+           &sm4_encrypt, &_gcry_sm4_cfb_dec, nblocks, blocksize,
+	   context_size);
 }
 
 static const char *
@@ -204,6 +725,7 @@ sm4_selftest (void)
 {
   SM4_context ctx;
   byte scratch[16];
+  const char *r;
 
   static const byte plaintext[16] = {
     0x01, 0x23, 0x45, 0x67, 0x89, 0xAB, 0xCD, 0xEF,
@@ -218,7 +740,9 @@ sm4_selftest (void)
     0x86, 0xB3, 0xE9, 0x4F, 0x53, 0x6E, 0x42, 0x46
   };
 
-  sm4_setkey (&ctx, key, sizeof (key), NULL);
+  memset (&ctx, 0, sizeof(ctx));
+
+  sm4_expand_key (&ctx, key);
   sm4_encrypt (&ctx, scratch, plaintext);
   if (memcmp (scratch, ciphertext, sizeof (ciphertext)))
     return "SM4 test encryption failed.";
@@ -226,6 +750,15 @@ sm4_selftest (void)
   if (memcmp (scratch, plaintext, sizeof (plaintext)))
     return "SM4 test decryption failed.";
 
+  if ( (r = selftest_ctr_128 ()) )
+    return r;
+
+  if ( (r = selftest_cbc_128 ()) )
+    return r;
+
+  if ( (r = selftest_cfb_128 ()) )
+    return r;
+
   return NULL;
 }
 
diff --git a/src/cipher.h b/src/cipher.h
index c49bbda5..decdc4d1 100644
--- a/src/cipher.h
+++ b/src/cipher.h
@@ -241,6 +241,22 @@ size_t _gcry_serpent_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 size_t _gcry_serpent_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
 			       size_t nblocks);
 
+/*-- sm4.c --*/
+void _gcry_sm4_ctr_enc (void *context, unsigned char *ctr,
+			void *outbuf_arg, const void *inbuf_arg,
+			size_t nblocks);
+void _gcry_sm4_cbc_dec (void *context, unsigned char *iv,
+                        void *outbuf_arg, const void *inbuf_arg,
+			size_t nblocks);
+void _gcry_sm4_cfb_dec (void *context, unsigned char *iv,
+			void *outbuf_arg, const void *inbuf_arg,
+			size_t nblocks);
+size_t _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
+			    const void *inbuf_arg, size_t nblocks,
+			    int encrypt);
+size_t _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg,
+			   size_t nblocks);
+
 /*-- twofish.c --*/
 void _gcry_twofish_ctr_enc (void *context, unsigned char *ctr,
                             void *outbuf_arg, const void *inbuf_arg,
diff --git a/tests/basic.c b/tests/basic.c
index 5acbab84..8ccb9c66 100644
--- a/tests/basic.c
+++ b/tests/basic.c
@@ -7035,6 +7035,8 @@ check_ocb_cipher (void)
     "\x99\xeb\x35\xb0\x62\x4e\x7b\xf1\x5e\x9f\xed\x32\x78\x90\x0b\xd0");
   check_ocb_cipher_largebuf(GCRY_CIPHER_SERPENT256, 32,
     "\x71\x66\x2f\x68\xbf\xdd\xcc\xb1\xbf\x81\x56\x5f\x01\x73\xeb\x44");
+  check_ocb_cipher_largebuf(GCRY_CIPHER_SM4, 16,
+    "\x2c\x0b\x31\x0b\xf4\x71\x9b\x01\xf4\x18\x5d\xf1\xe9\x3d\xed\x6b");
 
   /* Check that the AAD data is correctly buffered.  */
   check_ocb_cipher_splitaad ();
-- 
2.25.1


From jussi.kivilinna at iki.fi  Tue Jun 16 21:28:24 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 16 Jun 2020 22:28:24 +0300
Subject: [PATCH 2/3] Add SM4 x86-64/AES-NI/AVX implementation
In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
 <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
Message-ID: <20200616192825.1584395-3-jussi.kivilinna@iki.fi>

* cipher/Makefile.am: Add 'sm4-aesni-avx-amd64.S'.
* cipher/sm4-aesni-avx-amd64.S: New.
* cipher/sm4.c (USE_AESNI_AVX, ASM_FUNC_ABI): New.
(SM4_context) [USE_AESNI_AVX]: Add 'use_aesni_avx'.
[USE_AESNI_AVX] (_gcry_sm4_aesni_avx_expand_key)
(_gcry_sm4_aesni_avx_crypt_blk1_8, _gcry_sm4_aesni_avx_ctr_enc)
(_gcry_sm4_aesni_avx_cbc_dec, _gcry_sm4_aesni_avx_cfb_dec)
(_gcry_sm4_aesni_avx_ocb_enc, _gcry_sm4_aesni_avx_ocb_dec)
(_gcry_sm4_aesni_avx_ocb_auth): New.
(sm4_expand_key) [USE_AESNI_AVX]: Use AES-NI/AVX key setup.
(sm4_setkey): Enable AES-NI/AVX if supported by HW.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX]: Add
AES-NI/AVX bulk functions.
* configure.ac: Add ''sm4-aesni-avx-amd64.lo'.
--

This patch adds x86-64/AES-NI/AVX bulk encryption/decryption and key
setup for SM4 cipher. Bulk functions process eight blocks in parallel.

Benchmark on AMD Ryzen 7 3700X:

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      8.94 ns/B     106.7 MiB/s     38.66 c/B      4325
        CBC dec |      4.78 ns/B     199.7 MiB/s     20.42 c/B      4275
        CFB enc |      8.95 ns/B     106.5 MiB/s     38.72 c/B      4325
        CFB dec |      4.81 ns/B     198.2 MiB/s     20.57 c/B      4275
        CTR enc |      4.81 ns/B     198.2 MiB/s     20.69 c/B      4300
        CTR dec |      4.80 ns/B     198.8 MiB/s     20.63 c/B      4300
       GCM auth |     0.116 ns/B      8232 MiB/s     0.504 c/B      4351
        OCB enc |      4.88 ns/B     195.5 MiB/s     20.86 c/B      4275
        OCB dec |      4.85 ns/B     196.6 MiB/s     20.86 c/B      4301
       OCB auth |      4.80 ns/B     198.9 MiB/s     20.62 c/B      4301

After (~3.0x faster):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      8.98 ns/B     106.2 MiB/s     38.62 c/B      4300
        CBC dec |      1.55 ns/B     613.7 MiB/s      6.64 c/B      4275
        CFB enc |      8.96 ns/B     106.4 MiB/s     38.52 c/B      4300
        CFB dec |      1.54 ns/B     617.4 MiB/s      6.60 c/B      4275
        CTR enc |      1.57 ns/B     607.8 MiB/s      6.75 c/B      4300
        CTR dec |      1.57 ns/B     608.9 MiB/s      6.74 c/B      4300
        OCB enc |      1.58 ns/B     603.8 MiB/s      6.75 c/B      4275
        OCB dec |      1.57 ns/B     605.7 MiB/s      6.73 c/B      4275
       OCB auth |      1.53 ns/B     624.5 MiB/s      6.57 c/B      4300

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am           |   2 +-
 cipher/sm4-aesni-avx-amd64.S | 987 +++++++++++++++++++++++++++++++++++
 cipher/sm4.c                 | 232 ++++++++
 configure.ac                 |   7 +
 4 files changed, 1227 insertions(+), 1 deletion(-)
 create mode 100644 cipher/sm4-aesni-avx-amd64.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 56661dcd..427922c6 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,7 +107,7 @@ EXTRA_libcipher_la_SOURCES = \
 	scrypt.c \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
-	sm4.c \
+	sm4.c sm4-aesni-avx-amd64.S \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/sm4-aesni-avx-amd64.S b/cipher/sm4-aesni-avx-amd64.S
new file mode 100644
index 00000000..3610b98c
--- /dev/null
+++ b/cipher/sm4-aesni-avx-amd64.S
@@ -0,0 +1,987 @@
+/* sm4-avx-aesni-amd64.S  -  AES-NI/AVX implementation of SM4 cipher
+ *
+ * Copyright (C) 2020 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* Based on SM4 AES-NI work by Markku-Juhani O. Saarinen at:
+ *  https://github.com/mjosaarinen/sm4ni
+ */
+
+#include <config.h>
+
+#ifdef __x86_64
+#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
+    defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)
+
+#include "asm-common-amd64.h"
+
+/* vector registers */
+#define RX0          %xmm0
+#define RX1          %xmm1
+#define MASK_4BIT    %xmm2
+#define RTMP0        %xmm3
+#define RTMP1        %xmm4
+#define RTMP2        %xmm5
+#define RTMP3        %xmm6
+#define RTMP4        %xmm7
+
+#define RA0          %xmm8
+#define RA1          %xmm9
+#define RA2          %xmm10
+#define RA3          %xmm11
+
+#define RB0          %xmm12
+#define RB1          %xmm13
+#define RB2          %xmm14
+#define RB3          %xmm15
+
+#define RNOT         %xmm0
+#define RBSWAP       %xmm1
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+/* Transpose four 32-bit words between 128-bit vectors. */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2) \
+	vpunpckhdq x1, x0, t2; \
+	vpunpckldq x1, x0, x0; \
+	\
+	vpunpckldq x3, x2, t1; \
+	vpunpckhdq x3, x2, x2; \
+	\
+	vpunpckhqdq t1, x0, x1; \
+	vpunpcklqdq t1, x0, x0; \
+	\
+	vpunpckhqdq x2, t2, x3; \
+	vpunpcklqdq x2, t2, x2;
+
+/* post-SubByte transform. */
+#define transform_pre(x, lo_t, hi_t, mask4bit, tmp0) \
+	vpand x, mask4bit, tmp0; \
+	vpandn x, mask4bit, x; \
+	vpsrld $4, x, x; \
+	\
+	vpshufb tmp0, lo_t, tmp0; \
+	vpshufb x, hi_t, x; \
+	vpxor tmp0, x, x;
+
+/* post-SubByte transform. Note: x has been XOR'ed with mask4bit by
+ * 'vaeslastenc' instruction. */
+#define transform_post(x, lo_t, hi_t, mask4bit, tmp0) \
+	vpandn mask4bit, x, tmp0; \
+	vpsrld $4, x, x; \
+	vpand x, mask4bit, x; \
+	\
+	vpshufb tmp0, lo_t, tmp0; \
+	vpshufb x, hi_t, x; \
+	vpxor tmp0, x, x;
+
+/**********************************************************************
+  4-way && 8-way SM4 with AES-NI and AVX
+ **********************************************************************/
+
+.text
+.align 16
+
+/*
+ * Following four affine transform look-up tables are from work by
+ * Markku-Juhani O. Saarinen, at https://github.com/mjosaarinen/sm4ni
+ *
+ * These allow exposing SM4 S-Box from AES SubByte.
+ */
+
+/* pre-SubByte affine transform, from SM4 field to AES field. */
+.Lpre_tf_lo_s:
+	.quad 0x9197E2E474720701, 0xC7C1B4B222245157
+.Lpre_tf_hi_s:
+	.quad 0xE240AB09EB49A200, 0xF052B91BF95BB012
+
+/* post-SubByte affine transform, from AES field to SM4 field. */
+.Lpost_tf_lo_s:
+	.quad 0x5B67F2CEA19D0834, 0xEDD14478172BBE82
+.Lpost_tf_hi_s:
+	.quad 0xAE7201DD73AFDC00, 0x11CDBE62CC1063BF
+
+/* For isolating SubBytes from AESENCLAST, inverse shift row */
+.Linv_shift_row:
+	.byte 0x00, 0x0d, 0x0a, 0x07, 0x04, 0x01, 0x0e, 0x0b
+	.byte 0x08, 0x05, 0x02, 0x0f, 0x0c, 0x09, 0x06, 0x03
+
+/* Inverse shift row + Rotate left by 8 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_8:
+	.byte 0x07, 0x00, 0x0d, 0x0a, 0x0b, 0x04, 0x01, 0x0e
+	.byte 0x0f, 0x08, 0x05, 0x02, 0x03, 0x0c, 0x09, 0x06
+
+/* Inverse shift row + Rotate left by 16 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_16:
+	.byte 0x0a, 0x07, 0x00, 0x0d, 0x0e, 0x0b, 0x04, 0x01
+	.byte 0x02, 0x0f, 0x08, 0x05, 0x06, 0x03, 0x0c, 0x09
+
+/* Inverse shift row + Rotate left by 24 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_24:
+	.byte 0x0d, 0x0a, 0x07, 0x00, 0x01, 0x0e, 0x0b, 0x04
+	.byte 0x05, 0x02, 0x0f, 0x08, 0x09, 0x06, 0x03, 0x0c
+
+/* For CTR-mode IV byteswap */
+.Lbswap128_mask:
+	.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+
+/* For input word byte-swap */
+.Lbswap32_mask:
+	.byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12
+
+.align 4
+/* 4-bit mask */
+.L0f0f0f0f:
+	.long 0x0f0f0f0f
+
+.align 8
+.globl _gcry_sm4_aesni_avx_expand_key
+ELF(.type   _gcry_sm4_aesni_avx_expand_key, at function;)
+_gcry_sm4_aesni_avx_expand_key:
+	/* input:
+	 *	%rdi: 128-bit key
+	 *	%rsi: rkey_enc
+	 *	%rdx: rkey_dec
+	 *	%rcx: fk array
+	 *	%r8: ck array
+	 */
+	CFI_STARTPROC();
+
+	vmovd 0*4(%rdi), RA0;
+	vmovd 1*4(%rdi), RA1;
+	vmovd 2*4(%rdi), RA2;
+	vmovd 3*4(%rdi), RA3;
+
+	vmovdqa .Lbswap32_mask rRIP, RTMP2;
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+
+	vmovd 0*4(%rcx), RB0;
+	vmovd 1*4(%rcx), RB1;
+	vmovd 2*4(%rcx), RB2;
+	vmovd 3*4(%rcx), RB3;
+	vpxor RB0, RA0, RA0;
+	vpxor RB1, RA1, RA1;
+	vpxor RB2, RA2, RA2;
+	vpxor RB3, RA3, RA3;
+
+	vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT;
+	vmovdqa .Lpre_tf_lo_s rRIP, RTMP4;
+	vmovdqa .Lpre_tf_hi_s rRIP, RB0;
+	vmovdqa .Lpost_tf_lo_s rRIP, RB1;
+	vmovdqa .Lpost_tf_hi_s rRIP, RB2;
+	vmovdqa .Linv_shift_row rRIP, RB3;
+
+#define ROUND(round, s0, s1, s2, s3) \
+	vbroadcastss (4*(round))(%r8), RX0; \
+	vpxor s1, RX0, RX0; \
+	vpxor s2, RX0, RX0; \
+	vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \
+	\
+	/* sbox, non-linear part */ \
+	transform_pre(RX0, RTMP4, RB0, MASK_4BIT, RTMP0); \
+	vaesenclast MASK_4BIT, RX0, RX0; \
+	transform_post(RX0, RB1, RB2, MASK_4BIT, RTMP0); \
+	\
+	/* linear part */ \
+	vpshufb RB3, RX0, RX0; \
+	vpxor RX0, s0, s0; /* s0 ^ x */ \
+	vpslld $13, RX0, RTMP0; \
+	vpsrld $19, RX0, RTMP1; \
+	vpslld $23, RX0, RTMP2; \
+	vpsrld $9, RX0, RTMP3; \
+	vpxor RTMP0, RTMP1, RTMP1;  \
+	vpxor RTMP2, RTMP3, RTMP3;  \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,13) */ \
+	vpxor RTMP3, s0, s0; /* s0 ^ x ^ rol(x,13) ^ rol(x,23) */
+
+	leaq (32*4)(%r8), %rax;
+	leaq (32*4)(%rdx), %rdx;
+.align 16
+.Lroundloop_expand_key:
+	leaq (-4*4)(%rdx), %rdx;
+	ROUND(0, RA0, RA1, RA2, RA3);
+	ROUND(1, RA1, RA2, RA3, RA0);
+	ROUND(2, RA2, RA3, RA0, RA1);
+	ROUND(3, RA3, RA0, RA1, RA2);
+	leaq (4*4)(%r8), %r8;
+	vmovd RA0, (0*4)(%rsi);
+	vmovd RA1, (1*4)(%rsi);
+	vmovd RA2, (2*4)(%rsi);
+	vmovd RA3, (3*4)(%rsi);
+	vmovd RA0, (3*4)(%rdx);
+	vmovd RA1, (2*4)(%rdx);
+	vmovd RA2, (1*4)(%rdx);
+	vmovd RA3, (0*4)(%rdx);
+	leaq (4*4)(%rsi), %rsi;
+	cmpq %rax, %r8;
+	jne .Lroundloop_expand_key;
+
+#undef ROUND
+
+	vzeroall;
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_expand_key,.-_gcry_sm4_aesni_avx_expand_key;)
+
+.align 8
+ELF(.type   sm4_aesni_avx_crypt_blk1_4, at function;)
+sm4_aesni_avx_crypt_blk1_4:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (1..4 blocks)
+	 *	%rdx: src (1..4 blocks)
+	 *	%rcx: num blocks (1..4)
+	 */
+	CFI_STARTPROC();
+
+	vmovdqu 0*16(%rdx), RA0;
+	vmovdqa RA0, RA1;
+	vmovdqa RA0, RA2;
+	vmovdqa RA0, RA3;
+	cmpq $2, %rcx;
+	jb .Lblk4_load_input_done;
+	vmovdqu 1*16(%rdx), RA1;
+	je .Lblk4_load_input_done;
+	vmovdqu 2*16(%rdx), RA2;
+	cmpq $3, %rcx;
+	je .Lblk4_load_input_done;
+	vmovdqu 3*16(%rdx), RA3;
+
+.Lblk4_load_input_done:
+
+	vmovdqa .Lbswap32_mask rRIP, RTMP2;
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+
+	vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT;
+	vmovdqa .Lpre_tf_lo_s rRIP, RTMP4;
+	vmovdqa .Lpre_tf_hi_s rRIP, RB0;
+	vmovdqa .Lpost_tf_lo_s rRIP, RB1;
+	vmovdqa .Lpost_tf_hi_s rRIP, RB2;
+	vmovdqa .Linv_shift_row rRIP, RB3;
+	vmovdqa .Linv_shift_row_rol_8 rRIP, RTMP2;
+	vmovdqa .Linv_shift_row_rol_16 rRIP, RTMP3;
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+
+#define ROUND(round, s0, s1, s2, s3) \
+	vbroadcastss (4*(round))(%rdi), RX0; \
+	vpxor s1, RX0, RX0; \
+	vpxor s2, RX0, RX0; \
+	vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \
+	\
+	/* sbox, non-linear part */ \
+	transform_pre(RX0, RTMP4, RB0, MASK_4BIT, RTMP0); \
+	vaesenclast MASK_4BIT, RX0, RX0; \
+	transform_post(RX0, RB1, RB2, MASK_4BIT, RTMP0); \
+	\
+	/* linear part */ \
+	vpshufb RB3, RX0, RTMP0; \
+	vpxor RTMP0, s0, s0; /* s0 ^ x */ \
+	vpshufb RTMP2, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \
+	vpshufb RTMP3, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \
+	vpshufb .Linv_shift_row_rol_24 rRIP, RX0, RTMP1; \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \
+	vpslld $2, RTMP0, RTMP1; \
+	vpsrld $30, RTMP0, RTMP0; \
+	vpxor RTMP0, s0, s0;  \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */
+
+	leaq (32*4)(%rdi), %rax;
+.align 16
+.Lroundloop_blk4:
+	ROUND(0, RA0, RA1, RA2, RA3);
+	ROUND(1, RA1, RA2, RA3, RA0);
+	ROUND(2, RA2, RA3, RA0, RA1);
+	ROUND(3, RA3, RA0, RA1, RA2);
+	leaq (4*4)(%rdi), %rdi;
+	cmpq %rax, %rdi;
+	jne .Lroundloop_blk4;
+
+#undef ROUND
+
+	vmovdqa .Lbswap128_mask rRIP, RTMP2;
+
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+
+	vmovdqu RA0, 0*16(%rsi);
+	cmpq $2, %rcx;
+	jb .Lblk4_store_output_done;
+	vmovdqu RA1, 1*16(%rsi);
+	je .Lblk4_store_output_done;
+	vmovdqu RA2, 2*16(%rsi);
+	cmpq $3, %rcx;
+	je .Lblk4_store_output_done;
+	vmovdqu RA3, 3*16(%rsi);
+
+.Lblk4_store_output_done:
+	vzeroall;
+	xorl %eax, %eax;
+	ret;
+	CFI_ENDPROC();
+ELF(.size sm4_aesni_avx_crypt_blk1_4,.-sm4_aesni_avx_crypt_blk1_4;)
+
+.align 8
+ELF(.type __sm4_crypt_blk8, at function;)
+__sm4_crypt_blk8:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel
+	 * 						ciphertext blocks
+	 * output:
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: eight parallel plaintext
+	 *						blocks
+	 */
+	CFI_STARTPROC();
+
+	vmovdqa .Lbswap32_mask rRIP, RTMP2;
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+	vpshufb RTMP2, RB0, RB0;
+	vpshufb RTMP2, RB1, RB1;
+	vpshufb RTMP2, RB2, RB2;
+	vpshufb RTMP2, RB3, RB3;
+
+	vbroadcastss .L0f0f0f0f rRIP, MASK_4BIT;
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+	transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1);
+
+#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \
+	vbroadcastss (4*(round))(%rdi), RX0; \
+	vmovdqa .Lpre_tf_lo_s rRIP, RTMP4; \
+	vmovdqa .Lpre_tf_hi_s rRIP, RTMP1; \
+	vmovdqa RX0, RX1; \
+	vpxor s1, RX0, RX0; \
+	vpxor s2, RX0, RX0; \
+	vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \
+	    vmovdqa .Lpost_tf_lo_s rRIP, RTMP2; \
+	    vmovdqa .Lpost_tf_hi_s rRIP, RTMP3; \
+	    vpxor r1, RX1, RX1; \
+	    vpxor r2, RX1, RX1; \
+	    vpxor r3, RX1, RX1; /* r1 ^ r2 ^ r3 ^ rk */ \
+	\
+	/* sbox, non-linear part */ \
+	transform_pre(RX0, RTMP4, RTMP1, MASK_4BIT, RTMP0); \
+	    transform_pre(RX1, RTMP4, RTMP1, MASK_4BIT, RTMP0); \
+	    vmovdqa .Linv_shift_row rRIP, RTMP4; \
+	vaesenclast MASK_4BIT, RX0, RX0; \
+	    vaesenclast MASK_4BIT, RX1, RX1; \
+	transform_post(RX0, RTMP2, RTMP3, MASK_4BIT, RTMP0); \
+	    transform_post(RX1, RTMP2, RTMP3, MASK_4BIT, RTMP0); \
+	\
+	/* linear part */ \
+	vpshufb RTMP4, RX0, RTMP0; \
+	vpxor RTMP0, s0, s0; /* s0 ^ x */ \
+	    vpshufb RTMP4, RX1, RTMP2; \
+	    vmovdqa .Linv_shift_row_rol_8 rRIP, RTMP4; \
+	    vpxor RTMP2, r0, r0; /* r0 ^ x */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vmovdqa .Linv_shift_row_rol_16 rRIP, RTMP4; \
+	    vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vmovdqa .Linv_shift_row_rol_24 rRIP, RTMP4; \
+	    vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) ^ rol(x,16) */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \
+	vpslld $2, RTMP0, RTMP1; \
+	vpsrld $30, RTMP0, RTMP0; \
+	vpxor RTMP0, s0, s0;  \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,24) */ \
+	    vpslld $2, RTMP2, RTMP3; \
+	    vpsrld $30, RTMP2, RTMP2; \
+	    vpxor RTMP2, r0, r0;  \
+	    vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */
+
+	leaq (32*4)(%rdi), %rax;
+.align 16
+.Lroundloop_blk8:
+	ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3);
+	ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0);
+	ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1);
+	ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2);
+	leaq (4*4)(%rdi), %rdi;
+	cmpq %rax, %rdi;
+	jne .Lroundloop_blk8;
+
+#undef ROUND
+
+	vmovdqa .Lbswap128_mask rRIP, RTMP2;
+
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+	transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1);
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+	vpshufb RTMP2, RB0, RB0;
+	vpshufb RTMP2, RB1, RB1;
+	vpshufb RTMP2, RB2, RB2;
+	vpshufb RTMP2, RB3, RB3;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size __sm4_crypt_blk8,.-__sm4_crypt_blk8;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_crypt_blk1_8
+ELF(.type   _gcry_sm4_aesni_avx_crypt_blk1_8, at function;)
+_gcry_sm4_aesni_avx_crypt_blk1_8:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (1..8 blocks)
+	 *	%rdx: src (1..8 blocks)
+	 *	%rcx: num blocks (1..8)
+	 */
+	CFI_STARTPROC();
+
+	cmpq $5, %rcx;
+	jb sm4_aesni_avx_crypt_blk1_4;
+	vmovdqu (0 * 16)(%rdx), RA0;
+	vmovdqu (1 * 16)(%rdx), RA1;
+	vmovdqu (2 * 16)(%rdx), RA2;
+	vmovdqu (3 * 16)(%rdx), RA3;
+	vmovdqu (4 * 16)(%rdx), RB0;
+	vmovdqa RB0, RB1;
+	vmovdqa RB0, RB2;
+	vmovdqa RB0, RB3;
+	je .Lblk8_load_input_done;
+	vmovdqu (5 * 16)(%rdx), RB1;
+	cmpq $7, %rcx;
+	jb .Lblk8_load_input_done;
+	vmovdqu (6 * 16)(%rdx), RB2;
+	je .Lblk8_load_input_done;
+	vmovdqu (7 * 16)(%rdx), RB3;
+
+.Lblk8_load_input_done:
+	call __sm4_crypt_blk8;
+
+	cmpq $6, %rcx;
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vmovdqu RB0, (4 * 16)(%rsi);
+	jb .Lblk8_store_output_done;
+	vmovdqu RB1, (5 * 16)(%rsi);
+	je .Lblk8_store_output_done;
+	vmovdqu RB2, (6 * 16)(%rsi);
+	cmpq $7, %rcx;
+	je .Lblk8_store_output_done;
+	vmovdqu RB3, (7 * 16)(%rsi);
+
+.Lblk8_store_output_done:
+	vzeroall;
+	xorl %eax, %eax;
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_crypt_blk1_8,.-_gcry_sm4_aesni_avx_crypt_blk1_8;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_ctr_enc
+ELF(.type   _gcry_sm4_aesni_avx_ctr_enc, at function;)
+_gcry_sm4_aesni_avx_ctr_enc:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (8 blocks)
+	 *	%rdx: src (8 blocks)
+	 *	%rcx: iv (big endian, 128bit)
+	 */
+	CFI_STARTPROC();
+
+	/* load IV and byteswap */
+	vmovdqu (%rcx), RA0;
+
+	vmovdqa .Lbswap128_mask rRIP, RBSWAP;
+	vpshufb RBSWAP, RA0, RTMP0; /* be => le */
+
+	vpcmpeqd RNOT, RNOT, RNOT;
+	vpsrldq $8, RNOT, RNOT; /* low: -1, high: 0 */
+
+#define inc_le128(x, minus_one, tmp) \
+	vpcmpeqq minus_one, x, tmp; \
+	vpsubq minus_one, x, x; \
+	vpslldq $8, tmp, tmp; \
+	vpsubq tmp, x, x;
+
+	/* construct IVs */
+	inc_le128(RTMP0, RNOT, RTMP2); /* +1 */
+	vpshufb RBSWAP, RTMP0, RA1;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +2 */
+	vpshufb RBSWAP, RTMP0, RA2;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +3 */
+	vpshufb RBSWAP, RTMP0, RA3;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +4 */
+	vpshufb RBSWAP, RTMP0, RB0;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +5 */
+	vpshufb RBSWAP, RTMP0, RB1;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +6 */
+	vpshufb RBSWAP, RTMP0, RB2;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +7 */
+	vpshufb RBSWAP, RTMP0, RB3;
+	inc_le128(RTMP0, RNOT, RTMP2); /* +8 */
+	vpshufb RBSWAP, RTMP0, RTMP1;
+
+	/* store new IV */
+	vmovdqu RTMP1, (%rcx);
+
+	call __sm4_crypt_blk8;
+
+	vpxor (0 * 16)(%rdx), RA0, RA0;
+	vpxor (1 * 16)(%rdx), RA1, RA1;
+	vpxor (2 * 16)(%rdx), RA2, RA2;
+	vpxor (3 * 16)(%rdx), RA3, RA3;
+	vpxor (4 * 16)(%rdx), RB0, RB0;
+	vpxor (5 * 16)(%rdx), RB1, RB1;
+	vpxor (6 * 16)(%rdx), RB2, RB2;
+	vpxor (7 * 16)(%rdx), RB3, RB3;
+
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vmovdqu RB0, (4 * 16)(%rsi);
+	vmovdqu RB1, (5 * 16)(%rsi);
+	vmovdqu RB2, (6 * 16)(%rsi);
+	vmovdqu RB3, (7 * 16)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_ctr_enc,.-_gcry_sm4_aesni_avx_ctr_enc;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_cbc_dec
+ELF(.type   _gcry_sm4_aesni_avx_cbc_dec, at function;)
+_gcry_sm4_aesni_avx_cbc_dec:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (8 blocks)
+	 *	%rdx: src (8 blocks)
+	 *	%rcx: iv
+	 */
+	CFI_STARTPROC();
+
+	vmovdqu (0 * 16)(%rdx), RA0;
+	vmovdqu (1 * 16)(%rdx), RA1;
+	vmovdqu (2 * 16)(%rdx), RA2;
+	vmovdqu (3 * 16)(%rdx), RA3;
+	vmovdqu (4 * 16)(%rdx), RB0;
+	vmovdqu (5 * 16)(%rdx), RB1;
+	vmovdqu (6 * 16)(%rdx), RB2;
+	vmovdqu (7 * 16)(%rdx), RB3;
+
+	call __sm4_crypt_blk8;
+
+	vmovdqu (7 * 16)(%rdx), RNOT;
+	vpxor (%rcx), RA0, RA0;
+	vpxor (0 * 16)(%rdx), RA1, RA1;
+	vpxor (1 * 16)(%rdx), RA2, RA2;
+	vpxor (2 * 16)(%rdx), RA3, RA3;
+	vpxor (3 * 16)(%rdx), RB0, RB0;
+	vpxor (4 * 16)(%rdx), RB1, RB1;
+	vpxor (5 * 16)(%rdx), RB2, RB2;
+	vpxor (6 * 16)(%rdx), RB3, RB3;
+	vmovdqu RNOT, (%rcx); /* store new IV */
+
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vmovdqu RB0, (4 * 16)(%rsi);
+	vmovdqu RB1, (5 * 16)(%rsi);
+	vmovdqu RB2, (6 * 16)(%rsi);
+	vmovdqu RB3, (7 * 16)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_cbc_dec,.-_gcry_sm4_aesni_avx_cbc_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_cfb_dec
+ELF(.type   _gcry_sm4_aesni_avx_cfb_dec, at function;)
+_gcry_sm4_aesni_avx_cfb_dec:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (8 blocks)
+	 *	%rdx: src (8 blocks)
+	 *	%rcx: iv
+	 */
+	CFI_STARTPROC();
+
+	/* Load input */
+	vmovdqu (%rcx), RA0;
+	vmovdqu 0 * 16(%rdx), RA1;
+	vmovdqu 1 * 16(%rdx), RA2;
+	vmovdqu 2 * 16(%rdx), RA3;
+	vmovdqu 3 * 16(%rdx), RB0;
+	vmovdqu 4 * 16(%rdx), RB1;
+	vmovdqu 5 * 16(%rdx), RB2;
+	vmovdqu 6 * 16(%rdx), RB3;
+
+	/* Update IV */
+	vmovdqu 7 * 16(%rdx), RNOT;
+	vmovdqu RNOT, (%rcx);
+
+	call __sm4_crypt_blk8;
+
+	vpxor (0 * 16)(%rdx), RA0, RA0;
+	vpxor (1 * 16)(%rdx), RA1, RA1;
+	vpxor (2 * 16)(%rdx), RA2, RA2;
+	vpxor (3 * 16)(%rdx), RA3, RA3;
+	vpxor (4 * 16)(%rdx), RB0, RB0;
+	vpxor (5 * 16)(%rdx), RB1, RB1;
+	vpxor (6 * 16)(%rdx), RB2, RB2;
+	vpxor (7 * 16)(%rdx), RB3, RB3;
+
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vmovdqu RB0, (4 * 16)(%rsi);
+	vmovdqu RB1, (5 * 16)(%rsi);
+	vmovdqu RB2, (6 * 16)(%rsi);
+	vmovdqu RB3, (7 * 16)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_cfb_dec,.-_gcry_sm4_aesni_avx_cfb_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_ocb_enc
+ELF(.type _gcry_sm4_aesni_avx_ocb_enc, at function;)
+
+_gcry_sm4_aesni_avx_ocb_enc:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (8 blocks)
+	 *	%rdx: src (8 blocks)
+	 *	%rcx: offset
+	 *	%r8 : checksum
+	 *	%r9 : L pointers (void *L[8])
+	 */
+	CFI_STARTPROC();
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	vmovdqu (%rcx), RTMP0;
+	vmovdqu (%r8), RTMP1;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* Checksum_i = Checksum_{i-1} xor P_i  */
+	/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+
+#define OCB_INPUT(n, lreg, xreg) \
+	  vmovdqu (n * 16)(%rdx), xreg; \
+	  vpxor (lreg), RTMP0, RTMP0; \
+	  vpxor xreg, RTMP1, RTMP1; \
+	  vpxor RTMP0, xreg, xreg; \
+	  vmovdqu RTMP0, (n * 16)(%rsi);
+	movq (0 * 8)(%r9), %r10;
+	movq (1 * 8)(%r9), %r11;
+	movq (2 * 8)(%r9), %r12;
+	movq (3 * 8)(%r9), %r13;
+	OCB_INPUT(0, %r10, RA0);
+	OCB_INPUT(1, %r11, RA1);
+	OCB_INPUT(2, %r12, RA2);
+	OCB_INPUT(3, %r13, RA3);
+	movq (4 * 8)(%r9), %r10;
+	movq (5 * 8)(%r9), %r11;
+	movq (6 * 8)(%r9), %r12;
+	movq (7 * 8)(%r9), %r13;
+	OCB_INPUT(4, %r10, RB0);
+	OCB_INPUT(5, %r11, RB1);
+	OCB_INPUT(6, %r12, RB2);
+	OCB_INPUT(7, %r13, RB3);
+#undef OCB_INPUT
+
+	vmovdqu RTMP0, (%rcx);
+	vmovdqu RTMP1, (%r8);
+
+	movq (0 * 8)(%rsp), %r10;
+	CFI_RESTORE(%r10);
+	movq (1 * 8)(%rsp), %r11;
+	CFI_RESTORE(%r11);
+	movq (2 * 8)(%rsp), %r12;
+	CFI_RESTORE(%r12);
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk8;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vpxor (0 * 16)(%rsi), RA0, RA0;
+	vpxor (1 * 16)(%rsi), RA1, RA1;
+	vpxor (2 * 16)(%rsi), RA2, RA2;
+	vpxor (3 * 16)(%rsi), RA3, RA3;
+	vpxor (4 * 16)(%rsi), RB0, RB0;
+	vpxor (5 * 16)(%rsi), RB1, RB1;
+	vpxor (6 * 16)(%rsi), RB2, RB2;
+	vpxor (7 * 16)(%rsi), RB3, RB3;
+
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vmovdqu RB0, (4 * 16)(%rsi);
+	vmovdqu RB1, (5 * 16)(%rsi);
+	vmovdqu RB2, (6 * 16)(%rsi);
+	vmovdqu RB3, (7 * 16)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_ocb_enc,.-_gcry_sm4_aesni_avx_ocb_enc;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_ocb_dec
+ELF(.type _gcry_sm4_aesni_avx_ocb_dec, at function;)
+
+_gcry_sm4_aesni_avx_ocb_dec:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: dst (8 blocks)
+	 *	%rdx: src (8 blocks)
+	 *	%rcx: offset
+	 *	%r8 : checksum
+	 *	%r9 : L pointers (void *L[8])
+	 */
+	CFI_STARTPROC();
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	movdqu (%rcx), RTMP0;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* P_i = Offset_i xor DECIPHER(K, C_i xor Offset_i)  */
+
+#define OCB_INPUT(n, lreg, xreg) \
+	  vmovdqu (n * 16)(%rdx), xreg; \
+	  vpxor (lreg), RTMP0, RTMP0; \
+	  vpxor RTMP0, xreg, xreg; \
+	  vmovdqu RTMP0, (n * 16)(%rsi);
+	movq (0 * 8)(%r9), %r10;
+	movq (1 * 8)(%r9), %r11;
+	movq (2 * 8)(%r9), %r12;
+	movq (3 * 8)(%r9), %r13;
+	OCB_INPUT(0, %r10, RA0);
+	OCB_INPUT(1, %r11, RA1);
+	OCB_INPUT(2, %r12, RA2);
+	OCB_INPUT(3, %r13, RA3);
+	movq (4 * 8)(%r9), %r10;
+	movq (5 * 8)(%r9), %r11;
+	movq (6 * 8)(%r9), %r12;
+	movq (7 * 8)(%r9), %r13;
+	OCB_INPUT(4, %r10, RB0);
+	OCB_INPUT(5, %r11, RB1);
+	OCB_INPUT(6, %r12, RB2);
+	OCB_INPUT(7, %r13, RB3);
+#undef OCB_INPUT
+
+	vmovdqu RTMP0, (%rcx);
+
+	movq (0 * 8)(%rsp), %r10;
+	CFI_RESTORE(%r10);
+	movq (1 * 8)(%rsp), %r11;
+	CFI_RESTORE(%r11);
+	movq (2 * 8)(%rsp), %r12;
+	CFI_RESTORE(%r12);
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk8;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vmovdqu (%r8), RTMP0;
+
+	vpxor (0 * 16)(%rsi), RA0, RA0;
+	vpxor (1 * 16)(%rsi), RA1, RA1;
+	vpxor (2 * 16)(%rsi), RA2, RA2;
+	vpxor (3 * 16)(%rsi), RA3, RA3;
+	vpxor (4 * 16)(%rsi), RB0, RB0;
+	vpxor (5 * 16)(%rsi), RB1, RB1;
+	vpxor (6 * 16)(%rsi), RB2, RB2;
+	vpxor (7 * 16)(%rsi), RB3, RB3;
+
+	/* Checksum_i = Checksum_{i-1} xor P_i  */
+
+	vmovdqu RA0, (0 * 16)(%rsi);
+	vpxor RA0, RTMP0, RTMP0;
+	vmovdqu RA1, (1 * 16)(%rsi);
+	vpxor RA1, RTMP0, RTMP0;
+	vmovdqu RA2, (2 * 16)(%rsi);
+	vpxor RA2, RTMP0, RTMP0;
+	vmovdqu RA3, (3 * 16)(%rsi);
+	vpxor RA3, RTMP0, RTMP0;
+	vmovdqu RB0, (4 * 16)(%rsi);
+	vpxor RB0, RTMP0, RTMP0;
+	vmovdqu RB1, (5 * 16)(%rsi);
+	vpxor RB1, RTMP0, RTMP0;
+	vmovdqu RB2, (6 * 16)(%rsi);
+	vpxor RB2, RTMP0, RTMP0;
+	vmovdqu RB3, (7 * 16)(%rsi);
+	vpxor RB3, RTMP0, RTMP0;
+
+	vmovdqu RTMP0, (%r8);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_ocb_dec,.-_gcry_sm4_aesni_avx_ocb_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx_ocb_auth
+ELF(.type _gcry_sm4_aesni_avx_ocb_auth, at function;)
+
+_gcry_sm4_aesni_avx_ocb_auth:
+	/* input:
+	 *	%rdi: round key array, CTX
+	 *	%rsi: abuf (8 blocks)
+	 *	%rdx: offset
+	 *	%rcx: checksum
+	 *	%r8 : L pointers (void *L[8])
+	 */
+	CFI_STARTPROC();
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	vmovdqu (%rdx), RTMP0;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i)  */
+
+#define OCB_INPUT(n, lreg, xreg) \
+	  vmovdqu (n * 16)(%rsi), xreg; \
+	  vpxor (lreg), RTMP0, RTMP0; \
+	  vpxor RTMP0, xreg, xreg;
+	movq (0 * 8)(%r8), %r10;
+	movq (1 * 8)(%r8), %r11;
+	movq (2 * 8)(%r8), %r12;
+	movq (3 * 8)(%r8), %r13;
+	OCB_INPUT(0, %r10, RA0);
+	OCB_INPUT(1, %r11, RA1);
+	OCB_INPUT(2, %r12, RA2);
+	OCB_INPUT(3, %r13, RA3);
+	movq (4 * 8)(%r8), %r10;
+	movq (5 * 8)(%r8), %r11;
+	movq (6 * 8)(%r8), %r12;
+	movq (7 * 8)(%r8), %r13;
+	OCB_INPUT(4, %r10, RB0);
+	OCB_INPUT(5, %r11, RB1);
+	OCB_INPUT(6, %r12, RB2);
+	OCB_INPUT(7, %r13, RB3);
+#undef OCB_INPUT
+
+	vmovdqu RTMP0, (%rdx);
+
+	movq (0 * 8)(%rsp), %r10;
+	CFI_RESTORE(%r10);
+	movq (1 * 8)(%rsp), %r11;
+	CFI_RESTORE(%r11);
+	movq (2 * 8)(%rsp), %r12;
+	CFI_RESTORE(%r12);
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk8;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vmovdqu (%rcx), RTMP0;
+	vpxor RB0, RA0, RA0;
+	vpxor RB1, RA1, RA1;
+	vpxor RB2, RA2, RA2;
+	vpxor RB3, RA3, RA3;
+
+	vpxor RTMP0, RA3, RA3;
+	vpxor RA2, RA0, RA0;
+	vpxor RA3, RA1, RA1;
+
+	vpxor RA1, RA0, RA0;
+	vmovdqu RA0, (%rcx);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx_ocb_auth,.-_gcry_sm4_aesni_avx_ocb_auth;)
+
+#endif /*defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)*/
+#endif /*__x86_64*/
diff --git a/cipher/sm4.c b/cipher/sm4.c
index 621532fa..da75cf87 100644
--- a/cipher/sm4.c
+++ b/cipher/sm4.c
@@ -38,12 +38,35 @@
 # define ATTR_ALIGNED_64
 #endif
 
+/* USE_AESNI_AVX inidicates whether to compile with Intel AES-NI/AVX code. */
+#undef USE_AESNI_AVX
+#if defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)
+# if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+#  define USE_AESNI_AVX 1
+# endif
+#endif
+
+/* Assembly implementations use SystemV ABI, ABI conversion and additional
+ * stack to store XMM6-XMM15 needed on Win64. */
+#undef ASM_FUNC_ABI
+#if defined(USE_AESNI_AVX)
+# ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
+#  define ASM_FUNC_ABI __attribute__((sysv_abi))
+# else
+#  define ASM_FUNC_ABI
+# endif
+#endif
+
 static const char *sm4_selftest (void);
 
 typedef struct
 {
   u32 rkey_enc[32];
   u32 rkey_dec[32];
+#ifdef USE_AESNI_AVX
+  unsigned int use_aesni_avx:1;
+#endif
 } SM4_context;
 
 static const u32 fk[4] =
@@ -110,6 +133,45 @@ static const u32 ck[] =
   0x10171e25, 0x2c333a41, 0x484f565d, 0x646b7279
 };
 
+#ifdef USE_AESNI_AVX
+extern void _gcry_sm4_aesni_avx_expand_key(const byte *key, u32 *rk_enc,
+					   u32 *rk_dec, const u32 *fk,
+					   const u32 *ck) ASM_FUNC_ABI;
+
+extern unsigned int
+_gcry_sm4_aesni_avx_crypt_blk1_8(const u32 *rk, byte *out, const byte *in,
+				 unsigned int num_blks) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_ctr_enc(const u32 *rk_enc, byte *out,
+					const byte *in, byte *ctr) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_cbc_dec(const u32 *rk_dec, byte *out,
+					const byte *in, byte *iv) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_cfb_dec(const u32 *rk_enc, byte *out,
+					const byte *in, byte *iv) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_ocb_enc(const u32 *rk_enc,
+					unsigned char *out,
+					const unsigned char *in,
+					unsigned char *offset,
+					unsigned char *checksum,
+					const u64 Ls[8]) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_ocb_dec(const u32 *rk_dec,
+					unsigned char *out,
+					const unsigned char *in,
+					unsigned char *offset,
+					unsigned char *checksum,
+					const u64 Ls[8]) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx_ocb_auth(const u32 *rk_enc,
+					 const unsigned char *abuf,
+					 unsigned char *offset,
+					 unsigned char *checksum,
+					 const u64 Ls[8]) ASM_FUNC_ABI;
+#endif /* USE_AESNI_AVX */
+
 static inline void prefetch_sbox_table(void)
 {
   const volatile byte *vtab = (void *)&sbox_table;
@@ -178,6 +240,15 @@ sm4_expand_key (SM4_context *ctx, const byte *key)
   u32 rk[4];
   int i;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      _gcry_sm4_aesni_avx_expand_key (key, ctx->rkey_enc, ctx->rkey_dec,
+				      fk, ck);
+      return;
+    }
+#endif
+
   rk[0] = buf_get_be32(key + 4 * 0) ^ fk[0];
   rk[1] = buf_get_be32(key + 4 * 1) ^ fk[1];
   rk[2] = buf_get_be32(key + 4 * 2) ^ fk[2];
@@ -209,8 +280,10 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
   SM4_context *ctx = context;
   static int init = 0;
   static const char *selftest_failed = NULL;
+  unsigned int hwf = _gcry_get_hw_features ();
 
   (void)hd;
+  (void)hwf;
 
   if (!init)
     {
@@ -225,6 +298,10 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
   if (keylen != 16)
     return GPG_ERR_INV_KEYLEN;
 
+#ifdef USE_AESNI_AVX
+  ctx->use_aesni_avx = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX);
+#endif
+
   sm4_expand_key (ctx, key);
   return 0;
 }
@@ -367,6 +444,21 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
   const byte *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_sm4_aesni_avx_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr);
+
+          nblocks -= 8;
+          outbuf += 8 * 16;
+          inbuf += 8 * 16;
+        }
+    }
+#endif
+
   /* Process remaining blocks. */
   if (nblocks)
     {
@@ -377,6 +469,12 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
 
       if (0)
 	;
+#ifdef USE_AESNI_AVX
+      else if (ctx->use_aesni_avx)
+	{
+	  crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8;
+	}
+#endif
       else
 	{
 	  prefetch_sbox_table ();
@@ -432,6 +530,21 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_sm4_aesni_avx_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv);
+
+          nblocks -= 8;
+          outbuf += 8 * 16;
+          inbuf += 8 * 16;
+        }
+    }
+#endif
+
   /* Process remaining blocks. */
   if (nblocks)
     {
@@ -442,6 +555,12 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
 
       if (0)
 	;
+#ifdef USE_AESNI_AVX
+      else if (ctx->use_aesni_avx)
+	{
+	  crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8;
+	}
+#endif
       else
 	{
 	  prefetch_sbox_table ();
@@ -490,6 +609,21 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      /* Process data in 8 block chunks. */
+      while (nblocks >= 8)
+        {
+          _gcry_sm4_aesni_avx_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv);
+
+          nblocks -= 8;
+          outbuf += 8 * 16;
+          inbuf += 8 * 16;
+        }
+    }
+#endif
+
   /* Process remaining blocks. */
   if (nblocks)
     {
@@ -500,6 +634,12 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
 
       if (0)
 	;
+#ifdef USE_AESNI_AVX
+      else if (ctx->use_aesni_avx)
+	{
+	  crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8;
+	}
+#endif
       else
 	{
 	  prefetch_sbox_table ();
@@ -551,6 +691,48 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   u64 blkn = c->u_mode.ocb.data_nblocks;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      u64 Ls[8];
+      unsigned int n = 8 - (blkn % 8);
+      u64 *l;
+
+      if (nblocks >= 8)
+	{
+	  /* Use u64 to store pointers for x32 support (assembly function
+	   * assumes 64-bit pointers). */
+	  Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	  Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	  Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	  Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  l = &Ls[(7 + n) % 8];
+
+	  /* Process data in 8 block chunks. */
+	  while (nblocks >= 8)
+	    {
+	      blkn += 8;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8);
+
+	      if (encrypt)
+		_gcry_sm4_aesni_avx_ocb_enc(ctx->rkey_enc, outbuf, inbuf,
+					    c->u_iv.iv, c->u_ctr.ctr, Ls);
+	      else
+		_gcry_sm4_aesni_avx_ocb_dec(ctx->rkey_dec, outbuf, inbuf,
+					    c->u_iv.iv, c->u_ctr.ctr, Ls);
+
+	      nblocks -= 8;
+	      outbuf += 8 * 16;
+	      inbuf += 8 * 16;
+	    }
+	}
+    }
+#endif
+
   if (nblocks)
     {
       unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
@@ -561,6 +743,12 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
 
       if (0)
 	;
+#ifdef USE_AESNI_AVX
+      else if (ctx->use_aesni_avx)
+	{
+	  crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8;
+	}
+#endif
       else
 	{
 	  prefetch_sbox_table ();
@@ -625,6 +813,44 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
   const unsigned char *abuf = abuf_arg;
   u64 blkn = c->u_mode.ocb.aad_nblocks;
 
+#ifdef USE_AESNI_AVX
+  if (ctx->use_aesni_avx)
+    {
+      u64 Ls[8];
+      unsigned int n = 8 - (blkn % 8);
+      u64 *l;
+
+      if (nblocks >= 8)
+	{
+	  /* Use u64 to store pointers for x32 support (assembly function
+	    * assumes 64-bit pointers). */
+	  Ls[(0 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(1 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	  Ls[(2 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(3 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	  Ls[(4 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(5 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	  Ls[(6 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	  Ls[(7 + n) % 8] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  l = &Ls[(7 + n) % 8];
+
+	  /* Process data in 8 block chunks. */
+	  while (nblocks >= 8)
+	    {
+	      blkn += 8;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 8);
+
+	      _gcry_sm4_aesni_avx_ocb_auth(ctx->rkey_enc, abuf,
+					   c->u_mode.ocb.aad_offset,
+					   c->u_mode.ocb.aad_sum, Ls);
+
+	      nblocks -= 8;
+	      abuf += 8 * 16;
+	    }
+	}
+    }
+#endif
+
   if (nblocks)
     {
       unsigned int (*crypt_blk1_8)(const u32 *rk, byte *out, const byte *in,
@@ -634,6 +860,12 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
 
       if (0)
 	;
+#ifdef USE_AESNI_AVX
+      else if (ctx->use_aesni_avx)
+	{
+	  crypt_blk1_8 = _gcry_sm4_aesni_avx_crypt_blk1_8;
+	}
+#endif
       else
 	{
 	  prefetch_sbox_table ();
diff --git a/configure.ac b/configure.ac
index f77476e0..2458acfc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2564,6 +2564,13 @@ LIST_MEMBER(sm4, $enabled_ciphers)
 if test "$found" = "1" ; then
    GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4.lo"
    AC_DEFINE(USE_SM4, 1, [Defined if this module should be included])
+
+   case "${host}" in
+      x86_64-*-*)
+         # Build with the assembly implementation
+         GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx-amd64.lo"
+      ;;
+   esac
 fi
 
 LIST_MEMBER(dsa, $enabled_pubkey_ciphers)
-- 
2.25.1


From jussi.kivilinna at iki.fi  Tue Jun 16 21:28:25 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 16 Jun 2020 22:28:25 +0300
Subject: [PATCH 3/3] Add SM4 x86-64/AES-NI/AVX2 implementation
In-Reply-To: <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
 <20200616192825.1584395-1-jussi.kivilinna@iki.fi>
Message-ID: <20200616192825.1584395-4-jussi.kivilinna@iki.fi>

* cipher/Makefile.am: Add 'sm4-aesni-avx2-amd64.S'.
* cipher/sm4-aesni-avx2-amd64.S: New.
* cipher/sm4.c (USE_AESNI_AVX2): New.
(SM4_context) [USE_AESNI_AVX2]: Add 'use_aesni_avx2'.
[USE_AESNI_AVX2] (_gcry_sm4_aesni_avx2_ctr_enc)
(_gcry_sm4_aesni_avx2_cbc_dec, _gcry_sm4_aesni_avx2_cfb_dec)
(_gcry_sm4_aesni_avx2_ocb_enc, _gcry_sm4_aesni_avx2_ocb_dec)
(_gcry_sm4_aesni_avx_ocb_auth): New.
(sm4_setkey): Enable AES-NI/AVX2 if supported by HW.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth) [USE_AESNI_AVX2]: Add
AES-NI/AVX2 bulk functions.
* configure.ac: Add ''sm4-aesni-avx2-amd64.lo'.
--

This patch adds x86-64/AES-NI/AVX2 bulk encryption/decryption. Bulk
functions process 16 blocks in parallel.

Benchmark on AMD Ryzen 7 3700X:

Before:
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      8.98 ns/B     106.2 MiB/s     38.62 c/B      4300
        CBC dec |      1.55 ns/B     613.7 MiB/s      6.64 c/B      4275
        CFB enc |      8.96 ns/B     106.4 MiB/s     38.52 c/B      4300
        CFB dec |      1.54 ns/B     617.4 MiB/s      6.60 c/B      4275
        CTR enc |      1.57 ns/B     607.8 MiB/s      6.75 c/B      4300
        CTR dec |      1.57 ns/B     608.9 MiB/s      6.74 c/B      4300
        OCB enc |      1.58 ns/B     603.8 MiB/s      6.75 c/B      4275
        OCB dec |      1.57 ns/B     605.7 MiB/s      6.73 c/B      4275
       OCB auth |      1.53 ns/B     624.5 MiB/s      6.57 c/B      4300

After (~56% faster than AES-NI/AVX impl.):
 SM4            |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC enc |      8.93 ns/B     106.8 MiB/s     38.61 c/B      4326
        CBC dec |     0.984 ns/B     969.5 MiB/s      4.23 c/B      4300
        CFB enc |      8.93 ns/B     106.8 MiB/s     38.62 c/B      4325
        CFB dec |     0.983 ns/B     970.3 MiB/s      4.23 c/B      4300
        CTR enc |     0.998 ns/B     955.1 MiB/s      4.29 c/B      4300
        CTR dec |     0.996 ns/B     957.4 MiB/s      4.28 c/B      4300
        OCB enc |      1.00 ns/B     951.8 MiB/s      4.31 c/B      4300
        OCB dec |      1.00 ns/B     951.8 MiB/s      4.31 c/B      4300
       OCB auth |     0.993 ns/B     960.2 MiB/s      4.28 c/B      4304?2

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/Makefile.am            |   2 +-
 cipher/sm4-aesni-avx2-amd64.S | 851 ++++++++++++++++++++++++++++++++++
 cipher/sm4.c                  | 186 +++++++-
 configure.ac                  |   1 +
 4 files changed, 1038 insertions(+), 2 deletions(-)
 create mode 100644 cipher/sm4-aesni-avx2-amd64.S

diff --git a/cipher/Makefile.am b/cipher/Makefile.am
index 427922c6..4798d456 100644
--- a/cipher/Makefile.am
+++ b/cipher/Makefile.am
@@ -107,7 +107,7 @@ EXTRA_libcipher_la_SOURCES = \
 	scrypt.c \
 	seed.c \
 	serpent.c serpent-sse2-amd64.S \
-	sm4.c sm4-aesni-avx-amd64.S \
+	sm4.c sm4-aesni-avx-amd64.S sm4-aesni-avx2-amd64.S \
 	serpent-avx2-amd64.S serpent-armv7-neon.S \
 	sha1.c sha1-ssse3-amd64.S sha1-avx-amd64.S sha1-avx-bmi2-amd64.S \
 	sha1-avx2-bmi2-amd64.S sha1-armv7-neon.S sha1-armv8-aarch32-ce.S \
diff --git a/cipher/sm4-aesni-avx2-amd64.S b/cipher/sm4-aesni-avx2-amd64.S
new file mode 100644
index 00000000..6e46c0dc
--- /dev/null
+++ b/cipher/sm4-aesni-avx2-amd64.S
@@ -0,0 +1,851 @@
+/* sm4-avx2-amd64.S  -  AVX2 implementation of SM4 cipher
+ *
+ * Copyright (C) 2020 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ *
+ * This file is part of Libgcrypt.
+ *
+ * Libgcrypt is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as
+ * published by the Free Software Foundation; either version 2.1 of
+ * the License, or (at your option) any later version.
+ *
+ * Libgcrypt is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* Based on SM4 AES-NI work by Markku-Juhani O. Saarinen at:
+ *  https://github.com/mjosaarinen/sm4ni
+ */
+
+#include <config.h>
+
+#ifdef __x86_64
+#if (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS)) && \
+    defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT)
+
+#include "asm-common-amd64.h"
+
+/* vector registers */
+#define RX0          %ymm0
+#define RX1          %ymm1
+#define MASK_4BIT    %ymm2
+#define RTMP0        %ymm3
+#define RTMP1        %ymm4
+#define RTMP2        %ymm5
+#define RTMP3        %ymm6
+#define RTMP4        %ymm7
+
+#define RA0          %ymm8
+#define RA1          %ymm9
+#define RA2          %ymm10
+#define RA3          %ymm11
+
+#define RB0          %ymm12
+#define RB1          %ymm13
+#define RB2          %ymm14
+#define RB3          %ymm15
+
+#define RNOT         %ymm0
+#define RBSWAP       %ymm1
+
+#define RX0x         %xmm0
+#define RX1x         %xmm1
+#define MASK_4BITx   %xmm2
+
+#define RNOTx        %xmm0
+#define RBSWAPx      %xmm1
+
+#define RTMP0x       %xmm3
+#define RTMP1x       %xmm4
+#define RTMP2x       %xmm5
+#define RTMP3x       %xmm6
+#define RTMP4x       %xmm7
+
+/**********************************************************************
+  helper macros
+ **********************************************************************/
+
+/* Transpose four 32-bit words between 128-bit vector lanes. */
+#define transpose_4x4(x0, x1, x2, x3, t1, t2) \
+	vpunpckhdq x1, x0, t2; \
+	vpunpckldq x1, x0, x0; \
+	\
+	vpunpckldq x3, x2, t1; \
+	vpunpckhdq x3, x2, x2; \
+	\
+	vpunpckhqdq t1, x0, x1; \
+	vpunpcklqdq t1, x0, x0; \
+	\
+	vpunpckhqdq x2, t2, x3; \
+	vpunpcklqdq x2, t2, x2;
+
+/* post-SubByte transform. */
+#define transform_pre(x, lo_t, hi_t, mask4bit, tmp0) \
+	vpand x, mask4bit, tmp0; \
+	vpandn x, mask4bit, x; \
+	vpsrld $4, x, x; \
+	\
+	vpshufb tmp0, lo_t, tmp0; \
+	vpshufb x, hi_t, x; \
+	vpxor tmp0, x, x;
+
+/* post-SubByte transform. Note: x has been XOR'ed with mask4bit by
+ * 'vaeslastenc' instruction. */
+#define transform_post(x, lo_t, hi_t, mask4bit, tmp0) \
+	vpandn mask4bit, x, tmp0; \
+	vpsrld $4, x, x; \
+	vpand x, mask4bit, x; \
+	\
+	vpshufb tmp0, lo_t, tmp0; \
+	vpshufb x, hi_t, x; \
+	vpxor tmp0, x, x;
+
+/**********************************************************************
+  16-way SM4 with AES-NI and AVX
+ **********************************************************************/
+
+.text
+.align 16
+
+/*
+ * Following four affine transform look-up tables are from work by
+ * Markku-Juhani O. Saarinen, at https://github.com/mjosaarinen/sm4ni
+ *
+ * These allow exposing SM4 S-Box from AES SubByte.
+ */
+
+/* pre-SubByte affine transform, from SM4 field to AES field. */
+.Lpre_tf_lo_s:
+	.quad 0x9197E2E474720701, 0xC7C1B4B222245157
+.Lpre_tf_hi_s:
+	.quad 0xE240AB09EB49A200, 0xF052B91BF95BB012
+
+/* post-SubByte affine transform, from AES field to SM4 field. */
+.Lpost_tf_lo_s:
+	.quad 0x5B67F2CEA19D0834, 0xEDD14478172BBE82
+.Lpost_tf_hi_s:
+	.quad 0xAE7201DD73AFDC00, 0x11CDBE62CC1063BF
+
+/* For isolating SubBytes from AESENCLAST, inverse shift row */
+.Linv_shift_row:
+	.byte 0x00, 0x0d, 0x0a, 0x07, 0x04, 0x01, 0x0e, 0x0b
+	.byte 0x08, 0x05, 0x02, 0x0f, 0x0c, 0x09, 0x06, 0x03
+
+/* Inverse shift row + Rotate left by 8 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_8:
+	.byte 0x07, 0x00, 0x0d, 0x0a, 0x0b, 0x04, 0x01, 0x0e
+	.byte 0x0f, 0x08, 0x05, 0x02, 0x03, 0x0c, 0x09, 0x06
+
+/* Inverse shift row + Rotate left by 16 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_16:
+	.byte 0x0a, 0x07, 0x00, 0x0d, 0x0e, 0x0b, 0x04, 0x01
+	.byte 0x02, 0x0f, 0x08, 0x05, 0x06, 0x03, 0x0c, 0x09
+
+/* Inverse shift row + Rotate left by 24 bits on 32-bit words with vpshufb */
+.Linv_shift_row_rol_24:
+	.byte 0x0d, 0x0a, 0x07, 0x00, 0x01, 0x0e, 0x0b, 0x04
+	.byte 0x05, 0x02, 0x0f, 0x08, 0x09, 0x06, 0x03, 0x0c
+
+/* For CTR-mode IV byteswap */
+.Lbswap128_mask:
+	.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
+
+/* For input word byte-swap */
+.Lbswap32_mask:
+	.byte 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12
+
+.align 4
+/* 4-bit mask */
+.L0f0f0f0f:
+	.long 0x0f0f0f0f
+
+.align 8
+ELF(.type   __sm4_crypt_blk16, at function;)
+__sm4_crypt_blk16:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel
+	 *						plaintext blocks
+	 * output:
+	 *	RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3: sixteen parallel
+	 * 						ciphertext blocks
+	 */
+	CFI_STARTPROC();
+
+	vbroadcasti128 .Lbswap32_mask rRIP, RTMP2;
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+	vpshufb RTMP2, RB0, RB0;
+	vpshufb RTMP2, RB1, RB1;
+	vpshufb RTMP2, RB2, RB2;
+	vpshufb RTMP2, RB3, RB3;
+
+	vpbroadcastd .L0f0f0f0f rRIP, MASK_4BIT;
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+	transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1);
+
+#define ROUND(round, s0, s1, s2, s3, r0, r1, r2, r3) \
+	vpbroadcastd (4*(round))(%rdi), RX0; \
+	vbroadcasti128 .Lpre_tf_lo_s rRIP, RTMP4; \
+	vbroadcasti128 .Lpre_tf_hi_s rRIP, RTMP1; \
+	vmovdqa RX0, RX1; \
+	vpxor s1, RX0, RX0; \
+	vpxor s2, RX0, RX0; \
+	vpxor s3, RX0, RX0; /* s1 ^ s2 ^ s3 ^ rk */ \
+	    vbroadcasti128 .Lpost_tf_lo_s rRIP, RTMP2; \
+	    vbroadcasti128 .Lpost_tf_hi_s rRIP, RTMP3; \
+	    vpxor r1, RX1, RX1; \
+	    vpxor r2, RX1, RX1; \
+	    vpxor r3, RX1, RX1; /* r1 ^ r2 ^ r3 ^ rk */ \
+	\
+	/* sbox, non-linear part */ \
+	transform_pre(RX0, RTMP4, RTMP1, MASK_4BIT, RTMP0); \
+	    transform_pre(RX1, RTMP4, RTMP1, MASK_4BIT, RTMP0); \
+	vextracti128 $1, RX0, RTMP4x; \
+	    vextracti128 $1, RX1, RTMP0x; \
+	vaesenclast MASK_4BITx, RX0x, RX0x; \
+	vaesenclast MASK_4BITx, RTMP4x, RTMP4x; \
+	    vaesenclast MASK_4BITx, RX1x, RX1x; \
+	    vaesenclast MASK_4BITx, RTMP0x, RTMP0x; \
+	vinserti128 $1, RTMP4x, RX0, RX0; \
+	vbroadcasti128 .Linv_shift_row rRIP, RTMP4; \
+	    vinserti128 $1, RTMP0x, RX1, RX1; \
+	transform_post(RX0, RTMP2, RTMP3, MASK_4BIT, RTMP0); \
+	    transform_post(RX1, RTMP2, RTMP3, MASK_4BIT, RTMP0); \
+	\
+	/* linear part */ \
+	vpshufb RTMP4, RX0, RTMP0; \
+	vpxor RTMP0, s0, s0; /* s0 ^ x */ \
+	    vpshufb RTMP4, RX1, RTMP2; \
+	    vbroadcasti128 .Linv_shift_row_rol_8 rRIP, RTMP4; \
+	    vpxor RTMP2, r0, r0; /* r0 ^ x */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vbroadcasti128 .Linv_shift_row_rol_16 rRIP, RTMP4; \
+	    vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, RTMP0, RTMP0; /* x ^ rol(x,8) ^ rol(x,16) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vbroadcasti128 .Linv_shift_row_rol_24 rRIP, RTMP4; \
+	    vpxor RTMP3, RTMP2, RTMP2; /* x ^ rol(x,8) ^ rol(x,16) */ \
+	vpshufb RTMP4, RX0, RTMP1; \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,24) */ \
+	vpslld $2, RTMP0, RTMP1; \
+	vpsrld $30, RTMP0, RTMP0; \
+	vpxor RTMP0, s0, s0;  \
+	vpxor RTMP1, s0, s0; /* s0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */ \
+	    vpshufb RTMP4, RX1, RTMP3; \
+	    vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,24) */ \
+	    vpslld $2, RTMP2, RTMP3; \
+	    vpsrld $30, RTMP2, RTMP2; \
+	    vpxor RTMP2, r0, r0;  \
+	    vpxor RTMP3, r0, r0; /* r0 ^ x ^ rol(x,2) ^ rol(x,10) ^ rol(x,18) ^ rol(x,24) */
+
+	leaq (32*4)(%rdi), %rax;
+.align 16
+.Lroundloop_blk8:
+	ROUND(0, RA0, RA1, RA2, RA3, RB0, RB1, RB2, RB3);
+	ROUND(1, RA1, RA2, RA3, RA0, RB1, RB2, RB3, RB0);
+	ROUND(2, RA2, RA3, RA0, RA1, RB2, RB3, RB0, RB1);
+	ROUND(3, RA3, RA0, RA1, RA2, RB3, RB0, RB1, RB2);
+	leaq (4*4)(%rdi), %rdi;
+	cmpq %rax, %rdi;
+	jne .Lroundloop_blk8;
+
+#undef ROUND
+
+	vbroadcasti128 .Lbswap128_mask rRIP, RTMP2;
+
+	transpose_4x4(RA0, RA1, RA2, RA3, RTMP0, RTMP1);
+	transpose_4x4(RB0, RB1, RB2, RB3, RTMP0, RTMP1);
+	vpshufb RTMP2, RA0, RA0;
+	vpshufb RTMP2, RA1, RA1;
+	vpshufb RTMP2, RA2, RA2;
+	vpshufb RTMP2, RA3, RA3;
+	vpshufb RTMP2, RB0, RB0;
+	vpshufb RTMP2, RB1, RB1;
+	vpshufb RTMP2, RB2, RB2;
+	vpshufb RTMP2, RB3, RB3;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size __sm4_crypt_blk16,.-__sm4_crypt_blk16;)
+
+#define inc_le128(x, minus_one, tmp) \
+	vpcmpeqq minus_one, x, tmp; \
+	vpsubq minus_one, x, x; \
+	vpslldq $8, tmp, tmp; \
+	vpsubq tmp, x, x;
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_ctr_enc
+ELF(.type   _gcry_sm4_aesni_avx2_ctr_enc, at function;)
+_gcry_sm4_aesni_avx2_ctr_enc:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
+	 *	%rcx: iv (big endian, 128bit)
+	 */
+	CFI_STARTPROC();
+
+	movq 8(%rcx), %rax;
+	bswapq %rax;
+
+	vzeroupper;
+
+	vbroadcasti128 .Lbswap128_mask rRIP, RTMP3;
+	vpcmpeqd RNOT, RNOT, RNOT;
+	vpsrldq $8, RNOT, RNOT;   /* ab: -1:0 ; cd: -1:0 */
+	vpaddq RNOT, RNOT, RTMP2; /* ab: -2:0 ; cd: -2:0 */
+
+	/* load IV and byteswap */
+	vmovdqu (%rcx), RTMP4x;
+	vpshufb RTMP3x, RTMP4x, RTMP4x;
+	vmovdqa RTMP4x, RTMP0x;
+	inc_le128(RTMP4x, RNOTx, RTMP1x);
+	vinserti128 $1, RTMP4x, RTMP0, RTMP0;
+	vpshufb RTMP3, RTMP0, RA0; /* +1 ; +0 */
+
+	/* check need for handling 64-bit overflow and carry */
+	cmpq $(0xffffffffffffffff - 16), %rax;
+	ja .Lhandle_ctr_carry;
+
+	/* construct IVs */
+	vpsubq RTMP2, RTMP0, RTMP0; /* +3 ; +2 */
+	vpshufb RTMP3, RTMP0, RA1;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +5 ; +4 */
+	vpshufb RTMP3, RTMP0, RA2;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +7 ; +6 */
+	vpshufb RTMP3, RTMP0, RA3;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +9 ; +8 */
+	vpshufb RTMP3, RTMP0, RB0;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +11 ; +10 */
+	vpshufb RTMP3, RTMP0, RB1;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +13 ; +12 */
+	vpshufb RTMP3, RTMP0, RB2;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +15 ; +14 */
+	vpshufb RTMP3, RTMP0, RB3;
+	vpsubq RTMP2, RTMP0, RTMP0; /* +16 */
+	vpshufb RTMP3x, RTMP0x, RTMP0x;
+
+	jmp .Lctr_carry_done;
+
+.Lhandle_ctr_carry:
+	/* construct IVs */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RA1; /* +3 ; +2 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RA2; /* +5 ; +4 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RA3; /* +7 ; +6 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RB0; /* +9 ; +8 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RB1; /* +11 ; +10 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RB2; /* +13 ; +12 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vpshufb RTMP3, RTMP0, RB3; /* +15 ; +14 */
+	inc_le128(RTMP0, RNOT, RTMP1);
+	vextracti128 $1, RTMP0, RTMP0x;
+	vpshufb RTMP3x, RTMP0x, RTMP0x; /* +16 */
+
+.align 4
+.Lctr_carry_done:
+	/* store new IV */
+	vmovdqu RTMP0x, (%rcx);
+
+	call __sm4_crypt_blk16;
+
+	vpxor (0 * 32)(%rdx), RA0, RA0;
+	vpxor (1 * 32)(%rdx), RA1, RA1;
+	vpxor (2 * 32)(%rdx), RA2, RA2;
+	vpxor (3 * 32)(%rdx), RA3, RA3;
+	vpxor (4 * 32)(%rdx), RB0, RB0;
+	vpxor (5 * 32)(%rdx), RB1, RB1;
+	vpxor (6 * 32)(%rdx), RB2, RB2;
+	vpxor (7 * 32)(%rdx), RB3, RB3;
+
+	vmovdqu RA0, (0 * 32)(%rsi);
+	vmovdqu RA1, (1 * 32)(%rsi);
+	vmovdqu RA2, (2 * 32)(%rsi);
+	vmovdqu RA3, (3 * 32)(%rsi);
+	vmovdqu RB0, (4 * 32)(%rsi);
+	vmovdqu RB1, (5 * 32)(%rsi);
+	vmovdqu RB2, (6 * 32)(%rsi);
+	vmovdqu RB3, (7 * 32)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_ctr_enc,.-_gcry_sm4_aesni_avx2_ctr_enc;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_cbc_dec
+ELF(.type   _gcry_sm4_aesni_avx2_cbc_dec, at function;)
+_gcry_sm4_aesni_avx2_cbc_dec:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
+	 *	%rcx: iv
+	 */
+	CFI_STARTPROC();
+
+	vzeroupper;
+
+	vmovdqu (0 * 32)(%rdx), RA0;
+	vmovdqu (1 * 32)(%rdx), RA1;
+	vmovdqu (2 * 32)(%rdx), RA2;
+	vmovdqu (3 * 32)(%rdx), RA3;
+	vmovdqu (4 * 32)(%rdx), RB0;
+	vmovdqu (5 * 32)(%rdx), RB1;
+	vmovdqu (6 * 32)(%rdx), RB2;
+	vmovdqu (7 * 32)(%rdx), RB3;
+
+	call __sm4_crypt_blk16;
+
+	vmovdqu (%rcx), RNOTx;
+	vinserti128 $1, (%rdx), RNOT, RNOT;
+	vpxor RNOT, RA0, RA0;
+	vpxor (0 * 32 + 16)(%rdx), RA1, RA1;
+	vpxor (1 * 32 + 16)(%rdx), RA2, RA2;
+	vpxor (2 * 32 + 16)(%rdx), RA3, RA3;
+	vpxor (3 * 32 + 16)(%rdx), RB0, RB0;
+	vpxor (4 * 32 + 16)(%rdx), RB1, RB1;
+	vpxor (5 * 32 + 16)(%rdx), RB2, RB2;
+	vpxor (6 * 32 + 16)(%rdx), RB3, RB3;
+	vmovdqu (7 * 32 + 16)(%rdx), RNOTx;
+	vmovdqu RNOTx, (%rcx); /* store new IV */
+
+	vmovdqu RA0, (0 * 32)(%rsi);
+	vmovdqu RA1, (1 * 32)(%rsi);
+	vmovdqu RA2, (2 * 32)(%rsi);
+	vmovdqu RA3, (3 * 32)(%rsi);
+	vmovdqu RB0, (4 * 32)(%rsi);
+	vmovdqu RB1, (5 * 32)(%rsi);
+	vmovdqu RB2, (6 * 32)(%rsi);
+	vmovdqu RB3, (7 * 32)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_cbc_dec,.-_gcry_sm4_aesni_avx2_cbc_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_cfb_dec
+ELF(.type   _gcry_sm4_aesni_avx2_cfb_dec, at function;)
+_gcry_sm4_aesni_avx2_cfb_dec:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
+	 *	%rcx: iv
+	 */
+	CFI_STARTPROC();
+
+	vzeroupper;
+
+	/* Load input */
+	vmovdqu (%rcx), RNOTx;
+	vinserti128 $1, (%rdx), RNOT, RA0;
+	vmovdqu (0 * 32 + 16)(%rdx), RA1;
+	vmovdqu (1 * 32 + 16)(%rdx), RA2;
+	vmovdqu (2 * 32 + 16)(%rdx), RA3;
+	vmovdqu (3 * 32 + 16)(%rdx), RB0;
+	vmovdqu (4 * 32 + 16)(%rdx), RB1;
+	vmovdqu (5 * 32 + 16)(%rdx), RB2;
+	vmovdqu (6 * 32 + 16)(%rdx), RB3;
+
+	/* Update IV */
+	vmovdqu (7 * 32 + 16)(%rdx), RNOTx;
+	vmovdqu RNOTx, (%rcx);
+
+	call __sm4_crypt_blk16;
+
+	vpxor (0 * 32)(%rdx), RA0, RA0;
+	vpxor (1 * 32)(%rdx), RA1, RA1;
+	vpxor (2 * 32)(%rdx), RA2, RA2;
+	vpxor (3 * 32)(%rdx), RA3, RA3;
+	vpxor (4 * 32)(%rdx), RB0, RB0;
+	vpxor (5 * 32)(%rdx), RB1, RB1;
+	vpxor (6 * 32)(%rdx), RB2, RB2;
+	vpxor (7 * 32)(%rdx), RB3, RB3;
+
+	vmovdqu RA0, (0 * 32)(%rsi);
+	vmovdqu RA1, (1 * 32)(%rsi);
+	vmovdqu RA2, (2 * 32)(%rsi);
+	vmovdqu RA3, (3 * 32)(%rsi);
+	vmovdqu RB0, (4 * 32)(%rsi);
+	vmovdqu RB1, (5 * 32)(%rsi);
+	vmovdqu RB2, (6 * 32)(%rsi);
+	vmovdqu RB3, (7 * 32)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_cfb_dec,.-_gcry_sm4_aesni_avx2_cfb_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_ocb_enc
+ELF(.type _gcry_sm4_aesni_avx2_ocb_enc, at function;)
+
+_gcry_sm4_aesni_avx2_ocb_enc:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
+	 *	%rcx: offset
+	 *	%r8 : checksum
+	 *	%r9 : L pointers (void *L[16])
+	 */
+	CFI_STARTPROC();
+
+	vzeroupper;
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	vmovdqu (%rcx), RTMP0x;
+	vmovdqu (%r8), RTMP1x;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* Checksum_i = Checksum_{i-1} xor P_i  */
+	/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+
+#define OCB_INPUT(n, l0reg, l1reg, yreg) \
+	  vmovdqu (n * 32)(%rdx), yreg; \
+	  vpxor (l0reg), RTMP0x, RNOTx; \
+	  vpxor (l1reg), RNOTx, RTMP0x; \
+	  vinserti128 $1, RTMP0x, RNOT, RNOT; \
+	  vpxor yreg, RTMP1, RTMP1; \
+	  vpxor yreg, RNOT, yreg; \
+	  vmovdqu RNOT, (n * 32)(%rsi);
+
+	movq (0 * 8)(%r9), %r10;
+	movq (1 * 8)(%r9), %r11;
+	movq (2 * 8)(%r9), %r12;
+	movq (3 * 8)(%r9), %r13;
+	OCB_INPUT(0, %r10, %r11, RA0);
+	OCB_INPUT(1, %r12, %r13, RA1);
+	movq (4 * 8)(%r9), %r10;
+	movq (5 * 8)(%r9), %r11;
+	movq (6 * 8)(%r9), %r12;
+	movq (7 * 8)(%r9), %r13;
+	OCB_INPUT(2, %r10, %r11, RA2);
+	OCB_INPUT(3, %r12, %r13, RA3);
+	movq (8 * 8)(%r9), %r10;
+	movq (9 * 8)(%r9), %r11;
+	movq (10 * 8)(%r9), %r12;
+	movq (11 * 8)(%r9), %r13;
+	OCB_INPUT(4, %r10, %r11, RB0);
+	OCB_INPUT(5, %r12, %r13, RB1);
+	movq (12 * 8)(%r9), %r10;
+	movq (13 * 8)(%r9), %r11;
+	movq (14 * 8)(%r9), %r12;
+	movq (15 * 8)(%r9), %r13;
+	OCB_INPUT(6, %r10, %r11, RB2);
+	OCB_INPUT(7, %r12, %r13, RB3);
+#undef OCB_INPUT
+
+	vextracti128 $1, RTMP1, RNOTx;
+	vmovdqu RTMP0x, (%rcx);
+	vpxor RNOTx, RTMP1x, RTMP1x;
+	vmovdqu RTMP1x, (%r8);
+
+	movq (0 * 8)(%rsp), %r10;
+	movq (1 * 8)(%rsp), %r11;
+	movq (2 * 8)(%rsp), %r12;
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r10);
+	CFI_RESTORE(%r11);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk16;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vpxor (0 * 32)(%rsi), RA0, RA0;
+	vpxor (1 * 32)(%rsi), RA1, RA1;
+	vpxor (2 * 32)(%rsi), RA2, RA2;
+	vpxor (3 * 32)(%rsi), RA3, RA3;
+	vpxor (4 * 32)(%rsi), RB0, RB0;
+	vpxor (5 * 32)(%rsi), RB1, RB1;
+	vpxor (6 * 32)(%rsi), RB2, RB2;
+	vpxor (7 * 32)(%rsi), RB3, RB3;
+
+	vmovdqu RA0, (0 * 32)(%rsi);
+	vmovdqu RA1, (1 * 32)(%rsi);
+	vmovdqu RA2, (2 * 32)(%rsi);
+	vmovdqu RA3, (3 * 32)(%rsi);
+	vmovdqu RB0, (4 * 32)(%rsi);
+	vmovdqu RB1, (5 * 32)(%rsi);
+	vmovdqu RB2, (6 * 32)(%rsi);
+	vmovdqu RB3, (7 * 32)(%rsi);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_ocb_enc,.-_gcry_sm4_aesni_avx2_ocb_enc;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_ocb_dec
+ELF(.type _gcry_sm4_aesni_avx2_ocb_dec, at function;)
+
+_gcry_sm4_aesni_avx2_ocb_dec:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: dst (16 blocks)
+	 *	%rdx: src (16 blocks)
+	 *	%rcx: offset
+	 *	%r8 : checksum
+	 *	%r9 : L pointers (void *L[16])
+	 */
+	CFI_STARTPROC();
+
+	vzeroupper;
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	vmovdqu (%rcx), RTMP0x;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* C_i = Offset_i xor ENCIPHER(K, P_i xor Offset_i)  */
+
+#define OCB_INPUT(n, l0reg, l1reg, yreg) \
+	  vmovdqu (n * 32)(%rdx), yreg; \
+	  vpxor (l0reg), RTMP0x, RNOTx; \
+	  vpxor (l1reg), RNOTx, RTMP0x; \
+	  vinserti128 $1, RTMP0x, RNOT, RNOT; \
+	  vpxor yreg, RNOT, yreg; \
+	  vmovdqu RNOT, (n * 32)(%rsi);
+
+	movq (0 * 8)(%r9), %r10;
+	movq (1 * 8)(%r9), %r11;
+	movq (2 * 8)(%r9), %r12;
+	movq (3 * 8)(%r9), %r13;
+	OCB_INPUT(0, %r10, %r11, RA0);
+	OCB_INPUT(1, %r12, %r13, RA1);
+	movq (4 * 8)(%r9), %r10;
+	movq (5 * 8)(%r9), %r11;
+	movq (6 * 8)(%r9), %r12;
+	movq (7 * 8)(%r9), %r13;
+	OCB_INPUT(2, %r10, %r11, RA2);
+	OCB_INPUT(3, %r12, %r13, RA3);
+	movq (8 * 8)(%r9), %r10;
+	movq (9 * 8)(%r9), %r11;
+	movq (10 * 8)(%r9), %r12;
+	movq (11 * 8)(%r9), %r13;
+	OCB_INPUT(4, %r10, %r11, RB0);
+	OCB_INPUT(5, %r12, %r13, RB1);
+	movq (12 * 8)(%r9), %r10;
+	movq (13 * 8)(%r9), %r11;
+	movq (14 * 8)(%r9), %r12;
+	movq (15 * 8)(%r9), %r13;
+	OCB_INPUT(6, %r10, %r11, RB2);
+	OCB_INPUT(7, %r12, %r13, RB3);
+#undef OCB_INPUT
+
+	vmovdqu RTMP0x, (%rcx);
+
+	movq (0 * 8)(%rsp), %r10;
+	movq (1 * 8)(%rsp), %r11;
+	movq (2 * 8)(%rsp), %r12;
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r10);
+	CFI_RESTORE(%r11);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk16;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vmovdqu (%r8), RTMP1x;
+
+	vpxor (0 * 32)(%rsi), RA0, RA0;
+	vpxor (1 * 32)(%rsi), RA1, RA1;
+	vpxor (2 * 32)(%rsi), RA2, RA2;
+	vpxor (3 * 32)(%rsi), RA3, RA3;
+	vpxor (4 * 32)(%rsi), RB0, RB0;
+	vpxor (5 * 32)(%rsi), RB1, RB1;
+	vpxor (6 * 32)(%rsi), RB2, RB2;
+	vpxor (7 * 32)(%rsi), RB3, RB3;
+
+	/* Checksum_i = Checksum_{i-1} xor P_i  */
+
+	vmovdqu RA0, (0 * 32)(%rsi);
+	vpxor RA0, RTMP1, RTMP1;
+	vmovdqu RA1, (1 * 32)(%rsi);
+	vpxor RA1, RTMP1, RTMP1;
+	vmovdqu RA2, (2 * 32)(%rsi);
+	vpxor RA2, RTMP1, RTMP1;
+	vmovdqu RA3, (3 * 32)(%rsi);
+	vpxor RA3, RTMP1, RTMP1;
+	vmovdqu RB0, (4 * 32)(%rsi);
+	vpxor RB0, RTMP1, RTMP1;
+	vmovdqu RB1, (5 * 32)(%rsi);
+	vpxor RB1, RTMP1, RTMP1;
+	vmovdqu RB2, (6 * 32)(%rsi);
+	vpxor RB2, RTMP1, RTMP1;
+	vmovdqu RB3, (7 * 32)(%rsi);
+	vpxor RB3, RTMP1, RTMP1;
+
+	vextracti128 $1, RTMP1, RNOTx;
+	vpxor RNOTx, RTMP1x, RTMP1x;
+	vmovdqu RTMP1x, (%r8);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_ocb_dec,.-_gcry_sm4_aesni_avx2_ocb_dec;)
+
+.align 8
+.globl _gcry_sm4_aesni_avx2_ocb_auth
+ELF(.type _gcry_sm4_aesni_avx2_ocb_auth, at function;)
+
+_gcry_sm4_aesni_avx2_ocb_auth:
+	/* input:
+	 *	%rdi: ctx, CTX
+	 *	%rsi: abuf (16 blocks)
+	 *	%rdx: offset
+	 *	%rcx: checksum
+	 *	%r8 : L pointers (void *L[16])
+	 */
+	CFI_STARTPROC();
+
+	vzeroupper;
+
+	subq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(4 * 8);
+
+	movq %r10, (0 * 8)(%rsp);
+	movq %r11, (1 * 8)(%rsp);
+	movq %r12, (2 * 8)(%rsp);
+	movq %r13, (3 * 8)(%rsp);
+	CFI_REL_OFFSET(%r10, 0 * 8);
+	CFI_REL_OFFSET(%r11, 1 * 8);
+	CFI_REL_OFFSET(%r12, 2 * 8);
+	CFI_REL_OFFSET(%r13, 3 * 8);
+
+	vmovdqu (%rdx), RTMP0x;
+
+	/* Offset_i = Offset_{i-1} xor L_{ntz(i)} */
+	/* Sum_i = Sum_{i-1} xor ENCIPHER(K, A_i xor Offset_i)  */
+
+#define OCB_INPUT(n, l0reg, l1reg, yreg) \
+	  vmovdqu (n * 32)(%rsi), yreg; \
+	  vpxor (l0reg), RTMP0x, RNOTx; \
+	  vpxor (l1reg), RNOTx, RTMP0x; \
+	  vinserti128 $1, RTMP0x, RNOT, RNOT; \
+	  vpxor yreg, RNOT, yreg;
+
+	movq (0 * 8)(%r8), %r10;
+	movq (1 * 8)(%r8), %r11;
+	movq (2 * 8)(%r8), %r12;
+	movq (3 * 8)(%r8), %r13;
+	OCB_INPUT(0, %r10, %r11, RA0);
+	OCB_INPUT(1, %r12, %r13, RA1);
+	movq (4 * 8)(%r8), %r10;
+	movq (5 * 8)(%r8), %r11;
+	movq (6 * 8)(%r8), %r12;
+	movq (7 * 8)(%r8), %r13;
+	OCB_INPUT(2, %r10, %r11, RA2);
+	OCB_INPUT(3, %r12, %r13, RA3);
+	movq (8 * 8)(%r8), %r10;
+	movq (9 * 8)(%r8), %r11;
+	movq (10 * 8)(%r8), %r12;
+	movq (11 * 8)(%r8), %r13;
+	OCB_INPUT(4, %r10, %r11, RB0);
+	OCB_INPUT(5, %r12, %r13, RB1);
+	movq (12 * 8)(%r8), %r10;
+	movq (13 * 8)(%r8), %r11;
+	movq (14 * 8)(%r8), %r12;
+	movq (15 * 8)(%r8), %r13;
+	OCB_INPUT(6, %r10, %r11, RB2);
+	OCB_INPUT(7, %r12, %r13, RB3);
+#undef OCB_INPUT
+
+	vmovdqu RTMP0x, (%rdx);
+
+	movq (0 * 8)(%rsp), %r10;
+	movq (1 * 8)(%rsp), %r11;
+	movq (2 * 8)(%rsp), %r12;
+	movq (3 * 8)(%rsp), %r13;
+	CFI_RESTORE(%r10);
+	CFI_RESTORE(%r11);
+	CFI_RESTORE(%r12);
+	CFI_RESTORE(%r13);
+
+	call __sm4_crypt_blk16;
+
+	addq $(4 * 8), %rsp;
+	CFI_ADJUST_CFA_OFFSET(-4 * 8);
+
+	vpxor RA0, RB0, RA0;
+	vpxor RA1, RB1, RA1;
+	vpxor RA2, RB2, RA2;
+	vpxor RA3, RB3, RA3;
+
+	vpxor RA1, RA0, RA0;
+	vpxor RA3, RA2, RA2;
+
+	vpxor RA2, RA0, RTMP1;
+
+	vextracti128 $1, RTMP1, RNOTx;
+	vpxor (%rcx), RTMP1x, RTMP1x;
+	vpxor RNOTx, RTMP1x, RTMP1x;
+	vmovdqu RTMP1x, (%rcx);
+
+	vzeroall;
+
+	ret;
+	CFI_ENDPROC();
+ELF(.size _gcry_sm4_aesni_avx2_ocb_auth,.-_gcry_sm4_aesni_avx2_ocb_auth;)
+
+#endif /*defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX_SUPPORT)*/
+#endif /*__x86_64*/
diff --git a/cipher/sm4.c b/cipher/sm4.c
index da75cf87..0da095a5 100644
--- a/cipher/sm4.c
+++ b/cipher/sm4.c
@@ -47,10 +47,19 @@
 # endif
 #endif
 
+/* USE_AESNI_AVX inidicates whether to compile with Intel AES-NI/AVX2 code. */
+#undef USE_AESNI_AVX2
+#if defined(ENABLE_AESNI_SUPPORT) && defined(ENABLE_AVX2_SUPPORT)
+# if defined(__x86_64__) && (defined(HAVE_COMPATIBLE_GCC_AMD64_PLATFORM_AS) || \
+     defined(HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS))
+#  define USE_AESNI_AVX2 1
+# endif
+#endif
+
 /* Assembly implementations use SystemV ABI, ABI conversion and additional
  * stack to store XMM6-XMM15 needed on Win64. */
 #undef ASM_FUNC_ABI
-#if defined(USE_AESNI_AVX)
+#if defined(USE_AESNI_AVX) || defined(USE_AESNI_AVX2)
 # ifdef HAVE_COMPATIBLE_GCC_WIN64_PLATFORM_AS
 #  define ASM_FUNC_ABI __attribute__((sysv_abi))
 # else
@@ -67,6 +76,9 @@ typedef struct
 #ifdef USE_AESNI_AVX
   unsigned int use_aesni_avx:1;
 #endif
+#ifdef USE_AESNI_AVX2
+  unsigned int use_aesni_avx2:1;
+#endif
 } SM4_context;
 
 static const u32 fk[4] =
@@ -172,6 +184,40 @@ extern void _gcry_sm4_aesni_avx_ocb_auth(const u32 *rk_enc,
 					 const u64 Ls[8]) ASM_FUNC_ABI;
 #endif /* USE_AESNI_AVX */
 
+#ifdef USE_AESNI_AVX2
+extern void _gcry_sm4_aesni_avx2_ctr_enc(const u32 *rk_enc, byte *out,
+					 const byte *in,
+					 byte *ctr) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx2_cbc_dec(const u32 *rk_dec, byte *out,
+					 const byte *in,
+					 byte *iv) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx2_cfb_dec(const u32 *rk_enc, byte *out,
+					 const byte *in,
+					 byte *iv) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx2_ocb_enc(const u32 *rk_enc,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *offset,
+					 unsigned char *checksum,
+					 const u64 Ls[16]) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx2_ocb_dec(const u32 *rk_dec,
+					 unsigned char *out,
+					 const unsigned char *in,
+					 unsigned char *offset,
+					 unsigned char *checksum,
+					 const u64 Ls[16]) ASM_FUNC_ABI;
+
+extern void _gcry_sm4_aesni_avx2_ocb_auth(const u32 *rk_enc,
+					  const unsigned char *abuf,
+					  unsigned char *offset,
+					  unsigned char *checksum,
+					  const u64 Ls[16]) ASM_FUNC_ABI;
+#endif /* USE_AESNI_AVX2 */
+
 static inline void prefetch_sbox_table(void)
 {
   const volatile byte *vtab = (void *)&sbox_table;
@@ -301,6 +347,9 @@ sm4_setkey (void *context, const byte *key, const unsigned keylen,
 #ifdef USE_AESNI_AVX
   ctx->use_aesni_avx = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX);
 #endif
+#ifdef USE_AESNI_AVX2
+  ctx->use_aesni_avx2 = (hwf & HWF_INTEL_AESNI) && (hwf & HWF_INTEL_AVX2);
+#endif
 
   sm4_expand_key (ctx, key);
   return 0;
@@ -444,6 +493,21 @@ _gcry_sm4_ctr_enc(void *context, unsigned char *ctr,
   const byte *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX2
+  if (ctx->use_aesni_avx2)
+    {
+      /* Process data in 16 block chunks. */
+      while (nblocks >= 16)
+        {
+          _gcry_sm4_aesni_avx2_ctr_enc(ctx->rkey_enc, outbuf, inbuf, ctr);
+
+          nblocks -= 16;
+          outbuf += 16 * 16;
+          inbuf += 16 * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AESNI_AVX
   if (ctx->use_aesni_avx)
     {
@@ -530,6 +594,21 @@ _gcry_sm4_cbc_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX2
+  if (ctx->use_aesni_avx2)
+    {
+      /* Process data in 16 block chunks. */
+      while (nblocks >= 16)
+        {
+          _gcry_sm4_aesni_avx2_cbc_dec(ctx->rkey_dec, outbuf, inbuf, iv);
+
+          nblocks -= 16;
+          outbuf += 16 * 16;
+          inbuf += 16 * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AESNI_AVX
   if (ctx->use_aesni_avx)
     {
@@ -609,6 +688,21 @@ _gcry_sm4_cfb_dec(void *context, unsigned char *iv,
   const unsigned char *inbuf = inbuf_arg;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX2
+  if (ctx->use_aesni_avx2)
+    {
+      /* Process data in 16 block chunks. */
+      while (nblocks >= 16)
+        {
+          _gcry_sm4_aesni_avx2_cfb_dec(ctx->rkey_enc, outbuf, inbuf, iv);
+
+          nblocks -= 16;
+          outbuf += 16 * 16;
+          inbuf += 16 * 16;
+        }
+    }
+#endif
+
 #ifdef USE_AESNI_AVX
   if (ctx->use_aesni_avx)
     {
@@ -691,6 +785,53 @@ _gcry_sm4_ocb_crypt (gcry_cipher_hd_t c, void *outbuf_arg,
   u64 blkn = c->u_mode.ocb.data_nblocks;
   int burn_stack_depth = 0;
 
+#ifdef USE_AESNI_AVX2
+  if (ctx->use_aesni_avx2)
+    {
+      u64 Ls[16];
+      unsigned int n = 16 - (blkn % 16);
+      u64 *l;
+      int i;
+
+      if (nblocks >= 16)
+	{
+	  for (i = 0; i < 16; i += 8)
+	    {
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	    }
+
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  l = &Ls[(15 + n) % 16];
+
+	  /* Process data in 16 block chunks. */
+	  while (nblocks >= 16)
+	    {
+	      blkn += 16;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16);
+
+	      if (encrypt)
+		_gcry_sm4_aesni_avx2_ocb_enc(ctx->rkey_enc, outbuf, inbuf,
+					     c->u_iv.iv, c->u_ctr.ctr, Ls);
+	      else
+		_gcry_sm4_aesni_avx2_ocb_dec(ctx->rkey_dec, outbuf, inbuf,
+					     c->u_iv.iv, c->u_ctr.ctr, Ls);
+
+	      nblocks -= 16;
+	      outbuf += 16 * 16;
+	      inbuf += 16 * 16;
+	    }
+	}
+    }
+#endif
+
 #ifdef USE_AESNI_AVX
   if (ctx->use_aesni_avx)
     {
@@ -813,6 +954,49 @@ _gcry_sm4_ocb_auth (gcry_cipher_hd_t c, const void *abuf_arg, size_t nblocks)
   const unsigned char *abuf = abuf_arg;
   u64 blkn = c->u_mode.ocb.aad_nblocks;
 
+#ifdef USE_AESNI_AVX2
+  if (ctx->use_aesni_avx2)
+    {
+      u64 Ls[16];
+      unsigned int n = 16 - (blkn % 16);
+      u64 *l;
+      int i;
+
+      if (nblocks >= 16)
+	{
+	  for (i = 0; i < 16; i += 8)
+	    {
+	      /* Use u64 to store pointers for x32 support (assembly function
+	       * assumes 64-bit pointers). */
+	      Ls[(i + 0 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 1 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 2 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 3 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[2];
+	      Ls[(i + 4 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	      Ls[(i + 5 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[1];
+	      Ls[(i + 6 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[0];
+	    }
+
+	  Ls[(7 + n) % 16] = (uintptr_t)(void *)c->u_mode.ocb.L[3];
+	  l = &Ls[(15 + n) % 16];
+
+	  /* Process data in 16 block chunks. */
+	  while (nblocks >= 16)
+	    {
+	      blkn += 16;
+	      *l = (uintptr_t)(void *)ocb_get_l(c, blkn - blkn % 16);
+
+	      _gcry_sm4_aesni_avx2_ocb_auth(ctx->rkey_enc, abuf,
+					    c->u_mode.ocb.aad_offset,
+					    c->u_mode.ocb.aad_sum, Ls);
+
+	      nblocks -= 16;
+	      abuf += 16 * 16;
+	    }
+	}
+    }
+#endif
+
 #ifdef USE_AESNI_AVX
   if (ctx->use_aesni_avx)
     {
diff --git a/configure.ac b/configure.ac
index 2458acfc..1f03e79f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2569,6 +2569,7 @@ if test "$found" = "1" ; then
       x86_64-*-*)
          # Build with the assembly implementation
          GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx-amd64.lo"
+         GCRYPT_CIPHERS="$GCRYPT_CIPHERS sm4-aesni-avx2-amd64.lo"
       ;;
    esac
 fi
-- 
2.25.1


From jussi.kivilinna at iki.fi  Tue Jun 16 21:35:49 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Tue, 16 Jun 2020 22:35:49 +0300
Subject: [PATCH v2 0/2] Add SM4 symmetric cipher algorithm
In-Reply-To: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
References: <20200616090929.102931-1-tianjia.zhang@linux.alibaba.com>
Message-ID: <a8bb9252-1d15-9a34-b671-70a6363a5476@iki.fi>

Hello,

On 16.6.2020 12.09, Tianjia Zhang via Gcrypt-devel wrote:
> SM4 (GBT.32907-2016) is a cryptographic standard issued by the
> Organization of State Commercial Administration of China (OSCCA)
> as an authorized cryptographic algorithms for the use within China.
> 
> SMS4 was originally created for use in protecting wireless
> networks, and is mandated in the Chinese National Standard for
> Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure)
> (GB.15629.11-2003).
> 
> Tianjia Zhang (2):
>   Add SM4 symmetric cipher algorithm
>   tests: Add basic test-vectors for SM4
> 

Thanks, pushed to master with small fixes.

-Jussi


From jussi.kivilinna at iki.fi  Sat Jun 20 14:08:28 2020
From: jussi.kivilinna at iki.fi (Jussi Kivilinna)
Date: Sat, 20 Jun 2020 15:08:28 +0300
Subject: [PATCH] Camellia AES-NI/AVX/AVX2 size optimization
Message-ID: <20200620120828.2892006-1-jussi.kivilinna@iki.fi>

* cipher/camellia-aesni-avx-amd64.S: Use loop for handling repeating
'(enc|dec)_rounds16/fls16' portions of encryption/decryption.
* cipher/camellia-aesni-avx2-amd64.S: Use loop for handling repeating
'(enc|dec)_rounds32/fls32' portions of encryption/decryption.
--

Use rounds+fls loop to reduce binary size of Camellia AES-NI/AVX/AVX2
implementations. This also gives small performance boost on AMD Zen2.

Before:
   text    data     bss     dec     hex filename
  63877       0       0   63877    f985 cipher/.libs/camellia-aesni-avx2-amd64.o
  59623       0       0   59623    e8e7 cipher/.libs/camellia-aesni-avx-amd64.o

After:
   text    data     bss     dec     hex filename
  22999       0       0   22999    59d7 cipher/.libs/camellia-aesni-avx2-amd64.o
  25047       0       0   25047    61d7 cipher/.libs/camellia-aesni-avx-amd64.o

Benchmark on AMD Ryzen 7 3700X:

Before:
Cipher:
 CAMELLIA128    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC dec |     0.670 ns/B      1424 MiB/s      2.88 c/B      4300
        CFB dec |     0.667 ns/B      1430 MiB/s      2.87 c/B      4300
        CTR enc |     0.677 ns/B      1410 MiB/s      2.91 c/B      4300
        CTR dec |     0.676 ns/B      1412 MiB/s      2.90 c/B      4300
        OCB enc |     0.696 ns/B      1370 MiB/s      2.98 c/B      4275
        OCB dec |     0.698 ns/B      1367 MiB/s      2.98 c/B      4275
       OCB auth |     0.683 ns/B      1395 MiB/s      2.94 c/B      4300

After (~8% faster):
 CAMELLIA128    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
        CBC dec |     0.611 ns/B      1561 MiB/s      2.64 c/B      4313
        CFB dec |     0.616 ns/B      1549 MiB/s      2.65 c/B      4312
        CTR enc |     0.625 ns/B      1525 MiB/s      2.69 c/B      4300
        CTR dec |     0.625 ns/B      1526 MiB/s      2.69 c/B      4299
        OCB enc |     0.639 ns/B      1493 MiB/s      2.75 c/B      4307
        OCB dec |     0.642 ns/B      1485 MiB/s      2.76 c/B      4301
       OCB auth |     0.631 ns/B      1512 MiB/s      2.71 c/B      4300

Signed-off-by: Jussi Kivilinna <jussi.kivilinna at iki.fi>
---
 cipher/camellia-aesni-avx-amd64.S  | 136 +++++++++++------------------
 cipher/camellia-aesni-avx2-amd64.S | 135 +++++++++++-----------------
 2 files changed, 106 insertions(+), 165 deletions(-)

diff --git a/cipher/camellia-aesni-avx-amd64.S b/cipher/camellia-aesni-avx-amd64.S
index 4671bcfe..64cabaa5 100644
--- a/cipher/camellia-aesni-avx-amd64.S
+++ b/cipher/camellia-aesni-avx-amd64.S
@@ -1,6 +1,6 @@
 /* camellia-avx-aesni-amd64.S  -  AES-NI/AVX implementation of Camellia cipher
  *
- * Copyright (C) 2013-2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ * Copyright (C) 2013-2015,2020 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -35,7 +35,6 @@
 
 /* register macros */
 #define CTX %rdi
-#define RIO %r8
 
 /**********************************************************************
   helper macros
@@ -772,6 +771,7 @@ __camellia_enc_blk16:
 	/* input:
 	 *	%rdi: ctx, CTX
 	 *	%rax: temporary storage, 256 bytes
+	 *	%r8d: 24 for 16 byte key, 32 for larger
 	 *	%xmm0..%xmm15: 16 plaintext blocks
 	 * output:
 	 *	%xmm0..%xmm15: 16 encrypted blocks, order swapped:
@@ -781,42 +781,32 @@ __camellia_enc_blk16:
 
 	leaq 8 * 16(%rax), %rcx;
 
+	leaq (-8 * 8)(CTX, %r8, 8), %r8;
+
 	inpack16_post(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 		      %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
 		      %xmm15, %rax, %rcx);
 
+.align 8
+.Lenc_loop:
 	enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
 		     %xmm15, %rax, %rcx, 0);
 
-	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-	      %xmm15,
-	      ((key_table + (8) * 8) + 0)(CTX),
-	      ((key_table + (8) * 8) + 4)(CTX),
-	      ((key_table + (8) * 8) + 8)(CTX),
-	      ((key_table + (8) * 8) + 12)(CTX));
-
-	enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 8);
+	cmpq %r8, CTX;
+	je .Lenc_done;
+	leaq (8 * 8)(CTX), CTX;
 
 	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
 	      %xmm15,
-	      ((key_table + (16) * 8) + 0)(CTX),
-	      ((key_table + (16) * 8) + 4)(CTX),
-	      ((key_table + (16) * 8) + 8)(CTX),
-	      ((key_table + (16) * 8) + 12)(CTX));
-
-	enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 16);
-
-	movl $24, %r8d;
-	cmpl $128, key_bitlength(CTX);
-	jne .Lenc_max32;
+	      ((key_table) + 0)(CTX),
+	      ((key_table) + 4)(CTX),
+	      ((key_table) + 8)(CTX),
+	      ((key_table) + 12)(CTX));
+	jmp .Lenc_loop;
 
+.align 8
 .Lenc_done:
 	/* load CD for output */
 	vmovdqu 0 * 16(%rcx), %xmm8;
@@ -830,27 +820,9 @@ __camellia_enc_blk16:
 
 	outunpack16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 		    %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		    %xmm15, (key_table)(CTX, %r8, 8), (%rax), 1 * 16(%rax));
+		    %xmm15, ((key_table) + 8 * 8)(%r8), (%rax), 1 * 16(%rax));
 
 	ret;
-
-.align 8
-.Lenc_max32:
-	movl $32, %r8d;
-
-	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-	      %xmm15,
-	      ((key_table + (24) * 8) + 0)(CTX),
-	      ((key_table + (24) * 8) + 4)(CTX),
-	      ((key_table + (24) * 8) + 8)(CTX),
-	      ((key_table + (24) * 8) + 12)(CTX));
-
-	enc_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 24);
-
-	jmp .Lenc_done;
 	CFI_ENDPROC();
 ELF(.size __camellia_enc_blk16,.-__camellia_enc_blk16;)
 
@@ -869,44 +841,38 @@ __camellia_dec_blk16:
 	 */
 	CFI_STARTPROC();
 
+	movq %r8, %rcx;
+	movq CTX, %r8
+	leaq (-8 * 8)(CTX, %rcx, 8), CTX;
+
 	leaq 8 * 16(%rax), %rcx;
 
 	inpack16_post(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 		      %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
 		      %xmm15, %rax, %rcx);
 
-	cmpl $32, %r8d;
-	je .Ldec_max32;
-
-.Ldec_max24:
+.align 8
+.Ldec_loop:
 	dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 16);
-
-	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-	      %xmm15,
-	      ((key_table + (16) * 8) + 8)(CTX),
-	      ((key_table + (16) * 8) + 12)(CTX),
-	      ((key_table + (16) * 8) + 0)(CTX),
-	      ((key_table + (16) * 8) + 4)(CTX));
+		     %xmm15, %rax, %rcx, 0);
 
-	dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 8);
+	cmpq %r8, CTX;
+	je .Ldec_done;
 
 	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
 	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
 	      %xmm15,
-	      ((key_table + (8) * 8) + 8)(CTX),
-	      ((key_table + (8) * 8) + 12)(CTX),
-	      ((key_table + (8) * 8) + 0)(CTX),
-	      ((key_table + (8) * 8) + 4)(CTX));
+	      ((key_table) + 8)(CTX),
+	      ((key_table) + 12)(CTX),
+	      ((key_table) + 0)(CTX),
+	      ((key_table) + 4)(CTX));
 
-	dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 0);
+	leaq (-8 * 8)(CTX), CTX;
+	jmp .Ldec_loop;
 
+.align 8
+.Ldec_done:
 	/* load CD for output */
 	vmovdqu 0 * 16(%rcx), %xmm8;
 	vmovdqu 1 * 16(%rcx), %xmm9;
@@ -922,22 +888,6 @@ __camellia_dec_blk16:
 		    %xmm15, (key_table)(CTX), (%rax), 1 * 16(%rax));
 
 	ret;
-
-.align 8
-.Ldec_max32:
-	dec_rounds16(%xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-		     %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-		     %xmm15, %rax, %rcx, 24);
-
-	fls16(%rax, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7,
-	      %rcx, %xmm8, %xmm9, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14,
-	      %xmm15,
-	      ((key_table + (24) * 8) + 8)(CTX),
-	      ((key_table + (24) * 8) + 12)(CTX),
-	      ((key_table + (24) * 8) + 0)(CTX),
-	      ((key_table + (24) * 8) + 4)(CTX));
-
-	jmp .Ldec_max24;
 	CFI_ENDPROC();
 ELF(.size __camellia_dec_blk16,.-__camellia_dec_blk16;)
 
@@ -967,6 +917,11 @@ _gcry_camellia_aesni_avx_ctr_enc:
 
 	vzeroupper;
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %eax;
+	cmovel %eax, %r8d; /* max */
+
 	subq $(16 * 16), %rsp;
 	andq $~31, %rsp;
 	movq %rsp, %rax;
@@ -1163,6 +1118,11 @@ _gcry_camellia_aesni_avx_cfb_dec:
 
 	vzeroupper;
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %eax;
+	cmovel %eax, %r8d; /* max */
+
 	subq $(16 * 16), %rsp;
 	andq $~31, %rsp;
 	movq %rsp, %rax;
@@ -1307,6 +1267,11 @@ _gcry_camellia_aesni_avx_ocb_enc:
 	vmovdqu %xmm14, (%rcx);
 	vmovdqu %xmm15, (%r8);
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %r10d;
+	cmovel %r10d, %r8d; /* max */
+
 	/* inpack16_pre: */
 	vmovq (key_table)(CTX), %xmm15;
 	vpshufb .Lpack_bswap rRIP, %xmm15, %xmm15;
@@ -1617,6 +1582,11 @@ _gcry_camellia_aesni_avx_ocb_auth:
 	OCB_INPUT(15, %r13, %xmm0);
 #undef OCB_INPUT
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %r10d;
+	cmovel %r10d, %r8d; /* max */
+
 	vmovdqu %xmm15, (%rdx);
 
 	movq %rcx, %r10;
diff --git a/cipher/camellia-aesni-avx2-amd64.S b/cipher/camellia-aesni-avx2-amd64.S
index 517e6880..f620f040 100644
--- a/cipher/camellia-aesni-avx2-amd64.S
+++ b/cipher/camellia-aesni-avx2-amd64.S
@@ -1,6 +1,6 @@
 /* camellia-avx2-aesni-amd64.S  -  AES-NI/AVX2 implementation of Camellia cipher
  *
- * Copyright (C) 2013-2015 Jussi Kivilinna <jussi.kivilinna at iki.fi>
+ * Copyright (C) 2013-2015,2020 Jussi Kivilinna <jussi.kivilinna at iki.fi>
  *
  * This file is part of Libgcrypt.
  *
@@ -751,6 +751,7 @@ __camellia_enc_blk32:
 	/* input:
 	 *	%rdi: ctx, CTX
 	 *	%rax: temporary storage, 512 bytes
+	 *	%r8d: 24 for 16 byte key, 32 for larger
 	 *	%ymm0..%ymm15: 32 plaintext blocks
 	 * output:
 	 *	%ymm0..%ymm15: 32 encrypted blocks, order swapped:
@@ -760,42 +761,32 @@ __camellia_enc_blk32:
 
 	leaq 8 * 32(%rax), %rcx;
 
+	leaq (-8 * 8)(CTX, %r8, 8), %r8;
+
 	inpack32_post(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		      %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 		      %ymm15, %rax, %rcx);
 
+.align 8
+.Lenc_loop:
 	enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 		     %ymm15, %rax, %rcx, 0);
 
-	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-	      %ymm15,
-	      ((key_table + (8) * 8) + 0)(CTX),
-	      ((key_table + (8) * 8) + 4)(CTX),
-	      ((key_table + (8) * 8) + 8)(CTX),
-	      ((key_table + (8) * 8) + 12)(CTX));
-
-	enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 8);
+	cmpq %r8, CTX;
+	je .Lenc_done;
+	leaq (8 * 8)(CTX), CTX;
 
 	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 	      %ymm15,
-	      ((key_table + (16) * 8) + 0)(CTX),
-	      ((key_table + (16) * 8) + 4)(CTX),
-	      ((key_table + (16) * 8) + 8)(CTX),
-	      ((key_table + (16) * 8) + 12)(CTX));
-
-	enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 16);
-
-	movl $24, %r8d;
-	cmpl $128, key_bitlength(CTX);
-	jne .Lenc_max32;
+	      ((key_table) + 0)(CTX),
+	      ((key_table) + 4)(CTX),
+	      ((key_table) + 8)(CTX),
+	      ((key_table) + 12)(CTX));
+	jmp .Lenc_loop;
 
+.align 8
 .Lenc_done:
 	/* load CD for output */
 	vmovdqu 0 * 32(%rcx), %ymm8;
@@ -809,27 +800,9 @@ __camellia_enc_blk32:
 
 	outunpack32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		    %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		    %ymm15, (key_table)(CTX, %r8, 8), (%rax), 1 * 32(%rax));
+		    %ymm15, ((key_table) + 8 * 8)(%r8), (%rax), 1 * 32(%rax));
 
 	ret;
-
-.align 8
-.Lenc_max32:
-	movl $32, %r8d;
-
-	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-	      %ymm15,
-	      ((key_table + (24) * 8) + 0)(CTX),
-	      ((key_table + (24) * 8) + 4)(CTX),
-	      ((key_table + (24) * 8) + 8)(CTX),
-	      ((key_table + (24) * 8) + 12)(CTX));
-
-	enc_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 24);
-
-	jmp .Lenc_done;
 	CFI_ENDPROC();
 ELF(.size __camellia_enc_blk32,.-__camellia_enc_blk32;)
 
@@ -848,44 +821,38 @@ __camellia_dec_blk32:
 	 */
 	CFI_STARTPROC();
 
+	movq %r8, %rcx;
+	movq CTX, %r8
+	leaq (-8 * 8)(CTX, %rcx, 8), CTX;
+
 	leaq 8 * 32(%rax), %rcx;
 
 	inpack32_post(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		      %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 		      %ymm15, %rax, %rcx);
 
-	cmpl $32, %r8d;
-	je .Ldec_max32;
-
-.Ldec_max24:
+.align 8
+.Ldec_loop:
 	dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 16);
-
-	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-	      %ymm15,
-	      ((key_table + (16) * 8) + 8)(CTX),
-	      ((key_table + (16) * 8) + 12)(CTX),
-	      ((key_table + (16) * 8) + 0)(CTX),
-	      ((key_table + (16) * 8) + 4)(CTX));
+		     %ymm15, %rax, %rcx, 0);
 
-	dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 8);
+	cmpq %r8, CTX;
+	je .Ldec_done;
 
 	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
 	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
 	      %ymm15,
-	      ((key_table + (8) * 8) + 8)(CTX),
-	      ((key_table + (8) * 8) + 12)(CTX),
-	      ((key_table + (8) * 8) + 0)(CTX),
-	      ((key_table + (8) * 8) + 4)(CTX));
+	      ((key_table) + 8)(CTX),
+	      ((key_table) + 12)(CTX),
+	      ((key_table) + 0)(CTX),
+	      ((key_table) + 4)(CTX));
 
-	dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 0);
+	leaq (-8 * 8)(CTX), CTX;
+	jmp .Ldec_loop;
 
+.align 8
+.Ldec_done:
 	/* load CD for output */
 	vmovdqu 0 * 32(%rcx), %ymm8;
 	vmovdqu 1 * 32(%rcx), %ymm9;
@@ -901,22 +868,6 @@ __camellia_dec_blk32:
 		    %ymm15, (key_table)(CTX), (%rax), 1 * 32(%rax));
 
 	ret;
-
-.align 8
-.Ldec_max32:
-	dec_rounds32(%ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-		     %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-		     %ymm15, %rax, %rcx, 24);
-
-	fls32(%rax, %ymm0, %ymm1, %ymm2, %ymm3, %ymm4, %ymm5, %ymm6, %ymm7,
-	      %rcx, %ymm8, %ymm9, %ymm10, %ymm11, %ymm12, %ymm13, %ymm14,
-	      %ymm15,
-	      ((key_table + (24) * 8) + 8)(CTX),
-	      ((key_table + (24) * 8) + 12)(CTX),
-	      ((key_table + (24) * 8) + 0)(CTX),
-	      ((key_table + (24) * 8) + 4)(CTX));
-
-	jmp .Ldec_max24;
 	CFI_ENDPROC();
 ELF(.size __camellia_dec_blk32,.-__camellia_dec_blk32;)
 
@@ -949,6 +900,11 @@ _gcry_camellia_aesni_avx2_ctr_enc:
 
 	vzeroupper;
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %eax;
+	cmovel %eax, %r8d; /* max */
+
 	subq $(16 * 32), %rsp;
 	andq $~63, %rsp;
 	movq %rsp, %rax;
@@ -1216,6 +1172,11 @@ _gcry_camellia_aesni_avx2_cfb_dec:
 
 	vzeroupper;
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %eax;
+	cmovel %eax, %r8d; /* max */
+
 	subq $(16 * 32), %rsp;
 	andq $~63, %rsp;
 	movq %rsp, %rax;
@@ -1384,6 +1345,11 @@ _gcry_camellia_aesni_avx2_ocb_enc:
 	vpxor %xmm13, %xmm15, %xmm15;
 	vmovdqu %xmm15, (%r8);
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %r10d;
+	cmovel %r10d, %r8d; /* max */
+
 	/* inpack16_pre: */
 	vpbroadcastq (key_table)(CTX), %ymm15;
 	vpshufb .Lpack_bswap rRIP, %ymm15, %ymm15;
@@ -1742,6 +1708,11 @@ _gcry_camellia_aesni_avx2_ocb_auth:
 
 	vmovdqu %xmm14, (%rdx);
 
+	cmpl $128, key_bitlength(CTX);
+	movl $32, %r8d;
+	movl $24, %r10d;
+	cmovel %r10d, %r8d; /* max */
+
 	movq %rcx, %r10;
 
 	/* inpack16_pre: */
-- 
2.25.1


From stanermetin at gmail.com  Sun Jun 21 23:59:29 2020
From: stanermetin at gmail.com (Taner)
Date: Sun, 21 Jun 2020 23:59:29 +0200
Subject: Leave
Message-ID: <CAGdqjjEdrVuzeLZcG4pcnVEXg8-meexYRYMuEF8c3gwMWCLUKA@mail.gmail.com>

Hello,

I would like to get out from mailing group.

Thank you,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gnupg.org/pipermail/gcrypt-devel/attachments/20200621/936cbfbc/attachment-0001.html>

From mandar.apte409 at gmail.com  Mon Jun 22 14:44:31 2020
From: mandar.apte409 at gmail.com (mandar.apte409 at gmail.com)
Date: Mon, 22 Jun 2020 05:44:31 -0700 (MST)
Subject: Generate ECDH shared key - NIST P256
Message-ID: <518964648.31228.1592829871223.JavaMail.administrator@n7.nabble.com>

Hi all, 
   I am trying to generate ECC shared secret key using Libgcrypt 1.8.5. Based on documentation of Libgcrypt, I used gcry_pk_genkey() to generate public-private key pair on server and client. The S-Expression I used is "(genkey(ecdh(curve NIST-P256)(use-fips186)))" to generate public-private key pair based on ECC NIST P256 curve. 
 Now I need to generate shared secret key (ECDH agreement) using Local private key and remote public key. I see that, there is no single function like "ECDH_compute_key()" in openssl, to generate secret shared key.

After browsing lot of websites, white paper etc, I figured out that gcry_pk_encrypt() is suppose to be used to generate shared secret. But when I tried to use that function, it yielded me different shared secret. On client side, I am using client's private key and server's public key and on server side I am using Server's private key and client's public key.

Can anyone help me in this please, to generate shared secret key using Libgcrypt 1.8.5 version ?
Any help is highly appreciated.

Thank you in advance.
Best Regards,
Mandar

_____________________________________
Sent from http://gnupg.10057.n7.nabble.com