From ametzler at downhill.at.eu.org Sat Nov 3 18:29:22 2012 From: ametzler at downhill.at.eu.org (Andreas Metzler) Date: Sat, 3 Nov 2012 18:29:22 +0100 Subject: Bug#566351: libgcrypt11: should not change user id as a side effect In-Reply-To: <20100123134725.GA3309@downhill.g.la> References: <20100123045523.5883.159.reportbug@marvin.43-1.org> <20100123134725.GA3309@downhill.g.la> Message-ID: <20121103172922.GD3104@downhill.g.la> On 2010-01-23 Andreas Metzler wrote: > On 2010-01-23 Ansgar Burchardt wrote: > > the function lock_pool from src/secmem.c has the side effect of changing > > user ids if real uid != effective uid. This causes strange behaviour in > > other programs: > > A program using libnss-ldap for querying group membership with SSL > > enabled, but without nscd might suddenly change the user id when calling > > getgroups (or initgroups). An example for this is the atd daemon[1]. There is very long Ubuntu bug about the issue , this comment sums it up: Ubuntu is now shipping libgcrypt with this patch -------------------------------- +--- a/src/global.c ++++ b/src/global.c +@@ -445,8 +445,6 @@ + + case GCRYCTL_SET_THREAD_CBS: + err = ath_install (va_arg (arg_ptr, void *), any_init_done); +- if (! err) +- global_init (); + break; + + case GCRYCTL_FAST_POLL: -------------------------------- which might be replaced by the following one to fix . ------------------------------ --- libgcrypt11-1.5.0.orig/src/global.c +++ libgcrypt11-1.5.0/src/global.c @@ -370,11 +370,13 @@ _gcry_vcontrol (enum gcry_ctl_cmds cmd, break; case GCRYCTL_DISABLE_SECMEM_WARN: + global_init (); _gcry_secmem_set_flags ((_gcry_secmem_get_flags () | GCRY_SECMEM_FLAG_NO_WARNING)); break; case GCRYCTL_SUSPEND_SECMEM_WARN: + global_init (); _gcry_secmem_set_flags ((_gcry_secmem_get_flags () | GCRY_SECMEM_FLAG_SUSPEND_WARNING)); break; @@ -445,8 +447,6 @@ _gcry_vcontrol (enum gcry_ctl_cmds cmd, case GCRYCTL_SET_THREAD_CBS: err = ath_install (va_arg (arg_ptr, void *), any_init_done); - if (! err) - global_init (); break; case GCRYCTL_FAST_POLL: ------------------------------ cu andreas -- `What a good friend you are to him, Dr. Maturin. His other friends are so grateful to you.' `I sew his ears on from time to time, sure' From wk at gnupg.org Wed Nov 7 10:31:16 2012 From: wk at gnupg.org (Werner Koch) Date: Wed, 07 Nov 2012 10:31:16 +0100 Subject: Bug#566351: libgcrypt11: should not change user id as a side effect In-Reply-To: <20121103172922.GD3104@downhill.g.la> (Andreas Metzler's message of "Sat, 3 Nov 2012 18:29:22 +0100") References: <20100123045523.5883.159.reportbug@marvin.43-1.org> <20100123134725.GA3309@downhill.g.la> <20121103172922.GD3104@downhill.g.la> Message-ID: <87k3txwyhn.fsf@vigenere.g10code.de> On Sat, 3 Nov 2012 18:29, ametzler at downhill.at.eu.org said: > comment sums it up: > Well, it is the usual problem with inter-library dependencies. We will never be able to get this right. The DSO is just not designed to work with completely independent libraries. I don't like to say, but in this regard Windows DLLs are a better solution. Although we can't solve all the problems we will be able to solve the thread initialization problem. Libgcrypt 1.6 will ignore the thread callbacks and assume pthread. Semaphores are then used for locking and provide a way to do thread-safe initialization. The hopefully minor drawback is that one needs to link against librt. > + case GCRYCTL_SET_THREAD_CBS: > + err = ath_install (va_arg (arg_ptr, void *), any_init_done); > +- if (! err) > +- global_init (); Okay, if that works, fine. It might break other things; I don't know. There are enough selftests to hopefully detect such a break (in particular in FIPS mode). Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From stadtkind2 at gmx.de Thu Nov 8 19:44:10 2012 From: stadtkind2 at gmx.de (Stefan =?iso-8859-1?Q?Kr=FCger?=) Date: Thu, 8 Nov 2012 19:44:10 +0100 Subject: hardware crypto (padlock, aesni) support for 64bit cpus Message-ID: <20121108184410.GM936@web.de> Hi, I'm using libgcrypt on a 64bit Padlock CPU and noticed hw crypto support does not kick in. After some digging I found that src/hwfeatures.c only works when __i386__ is defined (and SIZEOF_UNSIGNED_LONG is 4, which happens to be 8 on __amd64), which is not the case with 64bit AES-NI CPUs from Intel (and nowadays even AMD) and newer chips from Via. Sad thing is, I'm not a programmer but I could test a patch on a 64bit Via Nano CPU (and maybe even 64bit AMD CPUs with AES support) if someone else feels free to do it. Regards, Stefan From funman at videolan.org Fri Nov 9 11:53:52 2012 From: funman at videolan.org (=?ISO-8859-1?Q?Rafa=EBl_Carr=E9?=) Date: Fri, 09 Nov 2012 11:53:52 +0100 Subject: hardware crypto (padlock, aesni) support for 64bit cpus In-Reply-To: <20121108184410.GM936@web.de> References: <20121108184410.GM936@web.de> Message-ID: <509CE0C0.70206@videolan.org> Hello, Le 2012-11-08 19:44, Stefan Kr?ger a ?crit : > Hi, > > I'm using libgcrypt on a 64bit Padlock CPU and noticed hw crypto support does > not kick in. > > After some digging I found that src/hwfeatures.c only works when __i386__ is > defined (and SIZEOF_UNSIGNED_LONG is 4, which happens to be 8 on __amd64), > which is not the case with 64bit AES-NI CPUs from Intel (and nowadays even > AMD) and newer chips from Via. > > Sad thing is, I'm not a programmer but I could test a patch on a 64bit Via > Nano CPU (and maybe even 64bit AMD CPUs with AES support) if someone else > feels free to do it. http://lists.gnupg.org/pipermail/gcrypt-devel/2012-April/001944.html should work, I tested it on a Nano as well. From wk at gnupg.org Fri Nov 9 13:10:40 2012 From: wk at gnupg.org (Werner Koch) Date: Fri, 09 Nov 2012 13:10:40 +0100 Subject: hardware crypto (padlock, aesni) support for 64bit cpus In-Reply-To: <509CE0C0.70206@videolan.org> (=?utf-8?Q?=22Rafa=C3=ABl_Carr?= =?utf-8?Q?=C3=A9=22's?= message of "Fri, 09 Nov 2012 11:53:52 +0100") References: <20121108184410.GM936@web.de> <509CE0C0.70206@videolan.org> Message-ID: <87d2znrn7j.fsf@vigenere.g10code.de> On Fri, 9 Nov 2012 11:53, funman at videolan.org said: > http://lists.gnupg.org/pipermail/gcrypt-devel/2012-April/001944.html > should work, I tested it on a Nano as well. I plan to do a 1.5 maintenance release. Shall I backport your fix to 1.5.1? Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From funman at videolan.org Fri Nov 9 15:05:28 2012 From: funman at videolan.org (=?UTF-8?B?UmFmYcOrbCBDYXJyw6k=?=) Date: Fri, 9 Nov 2012 15:05:28 +0100 Subject: hardware crypto (padlock, aesni) support for 64bit cpus In-Reply-To: <87d2znrn7j.fsf@vigenere.g10code.de> References: <20121108184410.GM936@web.de> <509CE0C0.70206@videolan.org> <87d2znrn7j.fsf@vigenere.g10code.de> Message-ID: 2012/11/9 Werner Koch > On Fri, 9 Nov 2012 11:53, funman at videolan.org said: > > > http://lists.gnupg.org/pipermail/gcrypt-devel/2012-April/001944.html > > should work, I tested it on a Nano as well. > > I plan to do a 1.5 maintenance release. Shall I backport your fix to > 1.5.1? Why not, it would be useful. -- Rafa?l Carr? -------------- next part -------------- An HTML attachment was scrubbed... URL: From stadtkind2 at gmx.de Fri Nov 9 16:47:19 2012 From: stadtkind2 at gmx.de (Stefan =?iso-8859-1?Q?Kr=FCger?=) Date: Fri, 9 Nov 2012 16:47:19 +0100 Subject: hardware crypto (padlock, aesni) support for 64bit cpus In-Reply-To: <20121108184410.GM936@web.de> References: <20121108184410.GM936@web.de> Message-ID: <20121109154719.GP936@gmx.de> On Thu, 08 Nov 2012, Stefan Kr?ger wrote: > Hi, > > I'm using libgcrypt on a 64bit Padlock CPU and noticed hw crypto support does > not kick in. > > After some digging I found that src/hwfeatures.c only works when __i386__ is > defined (and SIZEOF_UNSIGNED_LONG is 4, which happens to be 8 on __amd64), > which is not the case with 64bit AES-NI CPUs from Intel (and nowadays even > AMD) and newer chips from Via. > > Sad thing is, I'm not a programmer but I could test a patch on a 64bit Via > Nano CPU (and maybe even 64bit AMD CPUs with AES support) if someone else > feels free to do it. > > Regards, > > Stefan Hi, sorry for replying directly to my mail. I'm not subscribed that's why I could only read your answers via mail archive. Anyway, I applied your patch from http://lists.gnupg.org/pipermail/gcrypt-devel/2012-April/001944.html and, well?, shouldn't these numbers be diffrent somehow? $ ./tests/benchmark --disable-hwf padlock-aes --cipher-repetitions 100 --alignment 16 cipher aes128 Running each test 100 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 2080ms 2090ms 1220ms 1170ms 1170ms 1160ms 2540ms 2530ms 2890ms 2900ms $ ./tests//benchmark --cipher-repetitions 100 --alignment 16 cipher aes128 Running each test 100 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 2080ms 2090ms 1220ms 1160ms 1170ms 1160ms 2530ms 2530ms 3010ms 3000ms Regards, Stefan From jussi.kivilinna at mbnet.fi Wed Nov 14 23:22:32 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 00:22:32 +0200 Subject: DCO signature Message-ID: <20121115002232.17963o3vc5so53ok@www.dalek.fi> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Libgcrypt Developer's Certificate of Origin. Version 1.0 ========================================================= By making a contribution to the Libgcrypt project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the free software license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate free software license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same free software license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the free software license(s) involved. Signed-off-by: Jussi Kivilinna -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQGcBAEBAgAGBQJQpAWxAAoJEAaL+yOpMWaGexoL/jdsMTiY+mh+9b5CmCLsLkiG 5R+TnjB7Ll4Ou32Uoq/LikqnISJ574lpgBGxGnwsOUQfJNauR/NtcqiAHvdXliuu U8a1BsDK/inu9hKQkPx2eQY2Sde1jW+MMNvMn0HDUm9Q1BRfJtX/vmp5WEcghWmf 7vKj5Fi4pff2oMoLvvDt/drTCJpmTM+qleflxd2SWhFbKjpY+yhClMM40xBxgC/9 9JZL5c/VzKuY9fyxpfXQ65C9gT7CN4uMaEriRb2JpUO6sQngEpzfEd62i7DVe5UJ vD/psxHbbMUAvE6+EzzHeXQxuVjG8NcgM6Q1bbUodEIwcYchi3ixtN657WSOGcFh 4jV8NEDWAQlH53/5guhBfPIqY0o6VoTnEVRim5YrDnbCTQkbPtLJosq+BuSuzZcR wDRD46+fHWrZghq136Nvft21LYma0VWRSCtet9GKrv8iYCCUE0kiFTOYHNKbeXdo eE5sS0SUZiMd1EqTDW5s/JEEy+rHt6YfwsIvqPcZkw== =cua9 -----END PGP SIGNATURE----- From jussi.kivilinna at mbnet.fi Wed Nov 14 22:11:08 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Wed, 14 Nov 2012 23:11:08 +0200 Subject: DCO signature Message-ID: <20121114231108.21383kagcjwldps0@www.dalek.fi> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Libgcrypt Developer's Certificate of Origin. Version 1.0 ========================================================= By making a contribution to the Libgcrypt project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the free software license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate free software license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same free software license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the free software license(s) involved. Signed-off-by: Jussi Kivilinna -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQGcBAEBAgAGBQJQpAWxAAoJEAaL+yOpMWaGexoL/jdsMTiY+mh+9b5CmCLsLkiG 5R+TnjB7Ll4Ou32Uoq/LikqnISJ574lpgBGxGnwsOUQfJNauR/NtcqiAHvdXliuu U8a1BsDK/inu9hKQkPx2eQY2Sde1jW+MMNvMn0HDUm9Q1BRfJtX/vmp5WEcghWmf 7vKj5Fi4pff2oMoLvvDt/drTCJpmTM+qleflxd2SWhFbKjpY+yhClMM40xBxgC/9 9JZL5c/VzKuY9fyxpfXQ65C9gT7CN4uMaEriRb2JpUO6sQngEpzfEd62i7DVe5UJ vD/psxHbbMUAvE6+EzzHeXQxuVjG8NcgM6Q1bbUodEIwcYchi3ixtN657WSOGcFh 4jV8NEDWAQlH53/5guhBfPIqY0o6VoTnEVRim5YrDnbCTQkbPtLJosq+BuSuzZcR wDRD46+fHWrZghq136Nvft21LYma0VWRSCtet9GKrv8iYCCUE0kiFTOYHNKbeXdo eE5sS0SUZiMd1EqTDW5s/JEEy+rHt6YfwsIvqPcZkw== =cua9 -----END PGP SIGNATURE----- From jussi.kivilinna at mbnet.fi Wed Nov 14 23:30:07 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 00:30:07 +0200 Subject: [PATCH 1/3] Fix hwdetect assembler clobbers Message-ID: <20121114223007.24626.617.stgit@localhost6.localdomain6> detect_x86_64_gnuc() and detect_ia32_gnuc() have missing clobbers in assembler statements. "%ebx" is missing in x86-64, probably because copy-paste error (i386 code saves and restores %ebx to/from stack). "%ecx" is missing from PadLock detection. Signed-off-by: Jussi Kivilinna --- src/hwfeatures.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/hwfeatures.c b/src/hwfeatures.c index cf80fe0..456c07a 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -56,7 +56,7 @@ detect_x86_64_gnuc (void) "movl %%ecx, 8(%0)\n\t" : : "S" (&vendor_id[0]) - : "%eax", "%ecx", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); vendor_id[12] = 0; @@ -105,7 +105,7 @@ detect_x86_64_gnuc (void) ".Lready%=:\n" : "+r" (hw_features) : - : "%eax", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); } #endif /*ENABLE_PADLOCK_SUPPORT*/ @@ -122,7 +122,7 @@ detect_x86_64_gnuc (void) ".Lno_aes%=:\n" : "+r" (hw_features) : - : "%eax", "%ecx", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); } else if (!strcmp (vendor_id, "AuthenticAMD")) @@ -230,7 +230,7 @@ detect_ia32_gnuc (void) ".Lready%=:\n" : "+r" (hw_features) : - : "%eax", "%edx", "cc" + : "%eax", "%ecx", "%edx", "cc" ); } #endif /*ENABLE_PADLOCK_SUPPORT*/ From jussi.kivilinna at mbnet.fi Wed Nov 14 23:30:12 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 00:30:12 +0200 Subject: [PATCH 2/3] Fix cpuid vendor-id check for i386 and x86-64 In-Reply-To: <20121114223007.24626.617.stgit@localhost6.localdomain6> References: <20121114223007.24626.617.stgit@localhost6.localdomain6> Message-ID: <20121114223012.24626.95147.stgit@localhost6.localdomain6> detect_x86_64_gnuc() and detect_ia32_gnuc() incorrectly exclude Intel features on all other vendor CPUs. What we want here, is to detect if CPU from any vendor support said Intel feature (in this case AES-NI). Signed-off-by: Jussi Kivilinna --- src/hwfeatures.c | 59 +++++++++++++++++++++++++++++------------------------- 1 file changed, 32 insertions(+), 27 deletions(-) diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 456c07a..606f3e7 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -112,24 +112,26 @@ detect_x86_64_gnuc (void) else if (!strcmp (vendor_id, "GenuineIntel")) { /* This is an Intel CPU. */ - asm volatile - ("movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ - "cpuid\n" - "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ - "jz .Lno_aes%=\n\t" /* No AES support. */ - "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ - - ".Lno_aes%=:\n" - : "+r" (hw_features) - : - : "%eax", "%ebx", "%ecx", "%edx", "cc" - ); } else if (!strcmp (vendor_id, "AuthenticAMD")) { /* This is an AMD CPU. */ - } + + /* Detect Intel features, that might be supported also by other vendors + * also. */ + asm volatile + ("movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ + "cpuid\n" + "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ + "jz .Lno_aes%=\n\t" /* No AES support. */ + "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ + + ".Lno_aes%=:\n" + : "+r" (hw_features) + : + : "%eax", "%ebx", "%ecx", "%edx", "cc" + ); } #endif /* __x86_64__ && __GNUC__ */ @@ -237,26 +239,29 @@ detect_ia32_gnuc (void) else if (!strcmp (vendor_id, "GenuineIntel")) { /* This is an Intel CPU. */ - asm volatile - ("pushl %%ebx\n\t" /* Save GOT register. */ - "movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ - "cpuid\n" - "popl %%ebx\n\t" /* Restore GOT register. */ - "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ - "jz .Lno_aes%=\n\t" /* No AES support. */ - "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ - - ".Lno_aes%=:\n" - : "+r" (hw_features) - : - : "%eax", "%ecx", "%edx", "cc" - ); } else if (!strcmp (vendor_id, "AuthenticAMD")) { /* This is an AMD CPU. */ } + + /* Detect Intel features, that might be supported also by other vendors + * also. */ + asm volatile + ("pushl %%ebx\n\t" /* Save GOT register. */ + "movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ + "cpuid\n" + "popl %%ebx\n\t" /* Restore GOT register. */ + "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ + "jz .Lno_aes%=\n\t" /* No AES support. */ + "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ + + ".Lno_aes%=:\n" + : "+r" (hw_features) + : + : "%eax", "%ecx", "%edx", "cc" + ); } #endif /* __i386__ && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ */ From jussi.kivilinna at mbnet.fi Wed Nov 14 23:30:17 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 00:30:17 +0200 Subject: [PATCH 3/3] Add x86_64 support for AES-NI In-Reply-To: <20121114223007.24626.617.stgit@localhost6.localdomain6> References: <20121114223007.24626.617.stgit@localhost6.localdomain6> Message-ID: <20121114223017.24626.14695.stgit@localhost6.localdomain6> AES-NI assembler uses %%esi for key-material pointer register. However %[key] is already register and on x86-64 is 64bit and on i386 is 32bit. So use %[key] for pointer register instead of %esi and that way make same AES-NI code work on both x86-64 and i386. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 191 ++++++++++++++++++++++++++--------------------------- src/hwfeatures.c | 3 - 2 files changed, 92 insertions(+), 102 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index d9a95cb..15645d7 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -75,7 +75,7 @@ gcc 3. However, to be on the safe side we require at least gcc 4. */ #undef USE_AESNI #ifdef ENABLE_AESNI_SUPPORT -# if defined (__i386__) && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ >= 4 +# if ((defined (__i386__) && SIZEOF_UNSIGNED_LONG == 4) || defined(__x86_64__)) && __GNUC__ >= 4 # define USE_AESNI 1 # endif #endif /* ENABLE_AESNI_SUPPORT */ @@ -297,40 +297,38 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen) than using the standard key schedule. We disable it for now and don't put any effort into implementing this for AES-192 and AES-256. */ - asm volatile ("movl %[key], %%esi\n\t" - "movdqu (%%esi), %%xmm1\n\t" /* xmm1 := key */ - "movl %[ksch], %%esi\n\t" - "movdqa %%xmm1, (%%esi)\n\t" /* ksch[0] := xmm1 */ + asm volatile ("movdqu (%[key]), %%xmm1\n\t" /* xmm1 := key */ + "movdqa %%xmm1, (%[key])\n\t" /* ksch[0] := xmm1 */ "aeskeygenassist $0x01, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x10(%%esi)\n\t" /* ksch[1] := xmm1 */ + "movdqa %%xmm1, 0x10(%[key])\n\t" /* ksch[1] := xmm1 */ "aeskeygenassist $0x02, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x20(%%esi)\n\t" /* ksch[2] := xmm1 */ + "movdqa %%xmm1, 0x20(%[key])\n\t" /* ksch[2] := xmm1 */ "aeskeygenassist $0x04, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x30(%%esi)\n\t" /* ksch[3] := xmm1 */ + "movdqa %%xmm1, 0x30(%[key])\n\t" /* ksch[3] := xmm1 */ "aeskeygenassist $0x08, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x40(%%esi)\n\t" /* ksch[4] := xmm1 */ + "movdqa %%xmm1, 0x40(%[key])\n\t" /* ksch[4] := xmm1 */ "aeskeygenassist $0x10, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x50(%%esi)\n\t" /* ksch[5] := xmm1 */ + "movdqa %%xmm1, 0x50(%[key])\n\t" /* ksch[5] := xmm1 */ "aeskeygenassist $0x20, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x60(%%esi)\n\t" /* ksch[6] := xmm1 */ + "movdqa %%xmm1, 0x60(%[key])\n\t" /* ksch[6] := xmm1 */ "aeskeygenassist $0x40, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x70(%%esi)\n\t" /* ksch[7] := xmm1 */ + "movdqa %%xmm1, 0x70(%[key])\n\t" /* ksch[7] := xmm1 */ "aeskeygenassist $0x80, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x80(%%esi)\n\t" /* ksch[8] := xmm1 */ + "movdqa %%xmm1, 0x80(%[key])\n\t" /* ksch[8] := xmm1 */ "aeskeygenassist $0x1b, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x90(%%esi)\n\t" /* ksch[9] := xmm1 */ + "movdqa %%xmm1, 0x90(%[key])\n\t" /* ksch[9] := xmm1 */ "aeskeygenassist $0x36, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0xa0(%%esi)\n\t" /* ksch[10] := xmm1 */ + "movdqa %%xmm1, 0xa0(%[key])\n\t" /* ksch[10] := xmm1 */ "jmp .Lleave%=\n" ".Lexpand128_%=:\n\t" @@ -351,7 +349,7 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen) "pxor %%xmm3, %%xmm3\n" : : [key] "g" (key), [ksch] "g" (ctx->keyschenc) - : "%esi", "cc", "memory" ); + : "cc", "memory" ); } #endif /*USE_AESNI*/ else @@ -722,40 +720,39 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, aligned but that is a special case. We should better implement CFB direct in asm. */ asm volatile ("movdqu %[src], %%xmm0\n\t" /* xmm0 := *a */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -764,7 +761,7 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, : [src] "m" (*a), [key] "r" (ctx->keyschenc), [rounds] "r" (ctx->rounds) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -777,40 +774,39 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, #define aesdec_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xde, 0xc1\n\t" #define aesdeclast_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xc1\n\t" asm volatile ("movdqu %[src], %%xmm0\n\t" /* xmm0 := *a */ - "movl %[key], %%esi\n\t" - "movdqa (%%esi), %%xmm1\n\t" + "movdqa (%[key]), %%xmm1\n\t" "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Ldeclast%=:\n\t" aesdeclast_xmm1_xmm0 @@ -819,7 +815,7 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, : [src] "m" (*a), [key] "r" (ctx->keyschdec), [rounds] "r" (ctx->rounds) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesdec_xmm1_xmm0 #undef aesdeclast_xmm1_xmm0 } @@ -836,40 +832,39 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, #define aesenc_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xc1\n\t" #define aesenclast_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xc1\n\t" asm volatile ("movdqa %[iv], %%xmm0\n\t" /* xmm0 := IV */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -889,7 +884,7 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, [key] "g" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [decrypt] "m" (decrypt_flag) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -915,40 +910,39 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, "pshufb %[mask], %%xmm2\n\t" "movdqa %%xmm2, %[ctr]\n" /* Update CTR. */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -1012,82 +1006,81 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, "pshufb %[mask], %%xmm5\n\t" /* xmm5 := be(xmm5) */ "movdqa %%xmm5, %[ctr]\n" /* Update CTR. */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ "pxor %%xmm1, %%xmm2\n\t" /* xmm2 ^= key[0] */ "pxor %%xmm1, %%xmm3\n\t" /* xmm3 ^= key[0] */ "pxor %%xmm1, %%xmm4\n\t" /* xmm4 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 606f3e7..89d7685 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -292,9 +292,6 @@ _gcry_detect_hw_features (unsigned int disabled_features) # ifdef __GNUC__ { detect_x86_64_gnuc (); - /* We don't have AESNI support for 64 bit yet. Thus we should not - announce it. */ - hw_features &= ~HWF_INTEL_AESNI; } # endif #endif From jussi.kivilinna at mbnet.fi Wed Nov 14 23:30:22 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 00:30:22 +0200 Subject: [PATCH] Fix too large burn_stack in camellia-glue.c Message-ID: <20121114223021.24692.74702.stgit@localhost6.localdomain6> KEY_TABLE_TYPE is array type, and sizeof(KEY_TABLE_TYPE) gives full size of array. However what is wanted here is size of array argument in stack, so change sizeof(KEY_TABLE_TYPE) to sizeof(void*). This gives boost in speed for camellia cipher. On AMD Phenom II, x86-64: Before: $ tests/benchmark --cipher-repetitions 10 cipher camellia128 Running each test 10 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 250ms 240ms 270ms 260ms 250ms 250ms 260ms 250ms 340ms 330ms After: $ tests/benchmark --cipher-repetitions 10 cipher camellia128 Running each test 10 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 140ms 130ms 150ms 160ms 150ms 150ms 150ms 140ms 220ms 220ms Signed-off-by: Jussi Kivilinna --- cipher/camellia-glue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index a263621..c5019d0 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -111,7 +111,7 @@ camellia_encrypt(void *c, byte *outbuf, const byte *inbuf) Camellia_EncryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); _gcry_burn_stack - (sizeof(int)+2*sizeof(unsigned char *)+sizeof(KEY_TABLE_TYPE) + (sizeof(int)+2*sizeof(unsigned char *)+sizeof(void*/*KEY_TABLE_TYPE*/) +4*sizeof(u32) +2*sizeof(u32*)+4*sizeof(u32) +2*2*sizeof(void*) /* Function calls. */ @@ -125,7 +125,7 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) Camellia_DecryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); _gcry_burn_stack - (sizeof(int)+2*sizeof(unsigned char *)+sizeof(KEY_TABLE_TYPE) + (sizeof(int)+2*sizeof(unsigned char *)+sizeof(void*/*KEY_TABLE_TYPE*/) +4*sizeof(u32) +2*sizeof(u32*)+4*sizeof(u32) +2*2*sizeof(void*) /* Function calls. */ From wk at gnupg.org Thu Nov 15 14:08:03 2012 From: wk at gnupg.org (Werner Koch) Date: Thu, 15 Nov 2012 14:08:03 +0100 Subject: DCO signature In-Reply-To: <20121115002232.17963o3vc5so53ok@www.dalek.fi> (Jussi Kivilinna's message of "Thu, 15 Nov 2012 00:22:32 +0200") References: <20121115002232.17963o3vc5so53ok@www.dalek.fi> Message-ID: <871ufvc8uk.fsf@vigenere.g10code.de> Hi, I will look at your patches soon. One problem I identified is are the missing ChangeLog entries. I can write them this time but for future stuff please come up with your own. see doc/HACKING on how to do it. The reason we want them is to continue the history in the style of the ChangeLogs. Other comments may be written int the commit log after a tear-off line and won't show up in the ChangeLog. On Wed, 14 Nov 2012 23:22, jussi.kivilinna at mbnet.fi said: > Libgcrypt Developer's Certificate of Origin. Version 1.0 > ========================================================= I get a bad signature, does your MUA mangle them? Please try sending as attachment. Thanks, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From jussi.kivilinna at mbnet.fi Thu Nov 15 16:23:31 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 15 Nov 2012 17:23:31 +0200 Subject: DCO signature In-Reply-To: <871ufvc8uk.fsf@vigenere.g10code.de> References: <20121115002232.17963o3vc5so53ok@www.dalek.fi> <871ufvc8uk.fsf@vigenere.g10code.de> Message-ID: <20121115172331.150537dzb5i6jmy8@www.dalek.fi> Quoting Werner Koch : > Hi, > > I will look at your patches soon. One problem I identified is are the > missing ChangeLog entries. I can write them this time but for future > stuff please come up with your own. see doc/HACKING on how to do it. > The reason we want them is to continue the history in the style of the > ChangeLogs. Other comments may be written int the commit log after a > tear-off line and won't show up in the ChangeLog. Ah, I can fix them and resubmit. Also I noticed small problem with the "AES-NI support for x86-64" patch, as it breaks the currently unused AES-NI keysched part. > > On Wed, 14 Nov 2012 23:22, jussi.kivilinna at mbnet.fi said: >> Libgcrypt Developer's Certificate of Origin. Version 1.0 >> ========================================================= > > I get a bad signature, does your MUA mangle them? Please try sending as > attachment. Ok, attached. Hm, I haven't used gpg alot. Strangely, when I copy-paste this text from email client to gpg gui-tool, signature is ok. However if I copy-paste from raw-email, signature fails. -Jussi > > > Thanks, > > Werner > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > > > -------------- next part -------------- -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Libgcrypt Developer's Certificate of Origin. Version 1.0 ========================================================= By making a contribution to the Libgcrypt project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the free software license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate free software license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same free software license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the free software license(s) involved. Signed-off-by: Jussi Kivilinna -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQGcBAEBAgAGBQJQpAWxAAoJEAaL+yOpMWaGexoL/jdsMTiY+mh+9b5CmCLsLkiG 5R+TnjB7Ll4Ou32Uoq/LikqnISJ574lpgBGxGnwsOUQfJNauR/NtcqiAHvdXliuu U8a1BsDK/inu9hKQkPx2eQY2Sde1jW+MMNvMn0HDUm9Q1BRfJtX/vmp5WEcghWmf 7vKj5Fi4pff2oMoLvvDt/drTCJpmTM+qleflxd2SWhFbKjpY+yhClMM40xBxgC/9 9JZL5c/VzKuY9fyxpfXQ65C9gT7CN4uMaEriRb2JpUO6sQngEpzfEd62i7DVe5UJ vD/psxHbbMUAvE6+EzzHeXQxuVjG8NcgM6Q1bbUodEIwcYchi3ixtN657WSOGcFh 4jV8NEDWAQlH53/5guhBfPIqY0o6VoTnEVRim5YrDnbCTQkbPtLJosq+BuSuzZcR wDRD46+fHWrZghq136Nvft21LYma0VWRSCtet9GKrv8iYCCUE0kiFTOYHNKbeXdo eE5sS0SUZiMd1EqTDW5s/JEEy+rHt6YfwsIvqPcZkw== =cua9 -----END PGP SIGNATURE----- From jussi.kivilinna at mbnet.fi Fri Nov 16 09:44:44 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 16 Nov 2012 10:44:44 +0200 Subject: [PATCH v2 1/3] Fix hwdetect assembler clobbers Message-ID: <20121116084444.17558.55230.stgit@localhost6.localdomain6> * src/hwfeatures.c (detect_x86_64_gnuc): Add missing %ebx assembler clobbers. (detect_x86_64_gnuc, detect_ia32_gnuc) [ENABLE_PADLOCK_SUPPORT]: Add missing %ecx assembler clobbers. -- detect_x86_64_gnuc() and detect_ia32_gnuc() have missing clobbers in assembler statements. "%ebx" is missing in x86-64, probably because copy-paste error (i386 code saves and restores %ebx to/from stack). "%ecx" is missing from PadLock detection. [v2] - add GNU style changelog Signed-off-by: Jussi Kivilinna --- src/hwfeatures.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/hwfeatures.c b/src/hwfeatures.c index cf80fe0..456c07a 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -56,7 +56,7 @@ detect_x86_64_gnuc (void) "movl %%ecx, 8(%0)\n\t" : : "S" (&vendor_id[0]) - : "%eax", "%ecx", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); vendor_id[12] = 0; @@ -105,7 +105,7 @@ detect_x86_64_gnuc (void) ".Lready%=:\n" : "+r" (hw_features) : - : "%eax", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); } #endif /*ENABLE_PADLOCK_SUPPORT*/ @@ -122,7 +122,7 @@ detect_x86_64_gnuc (void) ".Lno_aes%=:\n" : "+r" (hw_features) : - : "%eax", "%ecx", "%edx", "cc" + : "%eax", "%ebx", "%ecx", "%edx", "cc" ); } else if (!strcmp (vendor_id, "AuthenticAMD")) @@ -230,7 +230,7 @@ detect_ia32_gnuc (void) ".Lready%=:\n" : "+r" (hw_features) : - : "%eax", "%edx", "cc" + : "%eax", "%ecx", "%edx", "cc" ); } #endif /*ENABLE_PADLOCK_SUPPORT*/ From jussi.kivilinna at mbnet.fi Fri Nov 16 09:44:49 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 16 Nov 2012 10:44:49 +0200 Subject: [PATCH v2 2/3] Fix cpuid vendor-id check for i386 and x86-64 In-Reply-To: <20121116084444.17558.55230.stgit@localhost6.localdomain6> References: <20121116084444.17558.55230.stgit@localhost6.localdomain6> Message-ID: <20121116084449.17558.40397.stgit@localhost6.localdomain6> * src/hwfeatures.c (detect_x86_64_gnuc, detect_ia32_gnuc): Allow Intel features be detect from CPU by other vendors too. -- detect_x86_64_gnuc() and detect_ia32_gnuc() incorrectly exclude Intel features on all other vendor CPUs. What we want here, is to detect if CPU from any vendor support said Intel feature (in this case AES-NI). [v2] - Add GNU style changelog Signed-off-by: Jussi Kivilinna --- src/hwfeatures.c | 59 +++++++++++++++++++++++++++++------------------------- 1 file changed, 32 insertions(+), 27 deletions(-) diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 456c07a..606f3e7 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -112,24 +112,26 @@ detect_x86_64_gnuc (void) else if (!strcmp (vendor_id, "GenuineIntel")) { /* This is an Intel CPU. */ - asm volatile - ("movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ - "cpuid\n" - "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ - "jz .Lno_aes%=\n\t" /* No AES support. */ - "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ - - ".Lno_aes%=:\n" - : "+r" (hw_features) - : - : "%eax", "%ebx", "%ecx", "%edx", "cc" - ); } else if (!strcmp (vendor_id, "AuthenticAMD")) { /* This is an AMD CPU. */ - } + + /* Detect Intel features, that might be supported also by other vendors + * also. */ + asm volatile + ("movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ + "cpuid\n" + "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ + "jz .Lno_aes%=\n\t" /* No AES support. */ + "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ + + ".Lno_aes%=:\n" + : "+r" (hw_features) + : + : "%eax", "%ebx", "%ecx", "%edx", "cc" + ); } #endif /* __x86_64__ && __GNUC__ */ @@ -237,26 +239,29 @@ detect_ia32_gnuc (void) else if (!strcmp (vendor_id, "GenuineIntel")) { /* This is an Intel CPU. */ - asm volatile - ("pushl %%ebx\n\t" /* Save GOT register. */ - "movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ - "cpuid\n" - "popl %%ebx\n\t" /* Restore GOT register. */ - "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ - "jz .Lno_aes%=\n\t" /* No AES support. */ - "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ - - ".Lno_aes%=:\n" - : "+r" (hw_features) - : - : "%eax", "%ecx", "%edx", "cc" - ); } else if (!strcmp (vendor_id, "AuthenticAMD")) { /* This is an AMD CPU. */ } + + /* Detect Intel features, that might be supported also by other vendors + * also. */ + asm volatile + ("pushl %%ebx\n\t" /* Save GOT register. */ + "movl $1, %%eax\n\t" /* Get CPU info and feature flags. */ + "cpuid\n" + "popl %%ebx\n\t" /* Restore GOT register. */ + "testl $0x02000000, %%ecx\n\t" /* Test bit 25. */ + "jz .Lno_aes%=\n\t" /* No AES support. */ + "orl $256, %0\n" /* Set our HWF_INTEL_AES bit. */ + + ".Lno_aes%=:\n" + : "+r" (hw_features) + : + : "%eax", "%ecx", "%edx", "cc" + ); } #endif /* __i386__ && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ */ From jussi.kivilinna at mbnet.fi Fri Nov 16 09:44:54 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 16 Nov 2012 10:44:54 +0200 Subject: [PATCH v2 3/3] Add x86_64 support for AES-NI In-Reply-To: <20121116084444.17558.55230.stgit@localhost6.localdomain6> References: <20121116084444.17558.55230.stgit@localhost6.localdomain6> Message-ID: <20121116084454.17558.47724.stgit@localhost6.localdomain6> * cipher/rijndael.c [ENABLE_AESNI_SUPPORT]: Enable USE_AESNI on x86-64. (do_setkey) [USE_AESNI_is_disabled_here]: Use %[key] and %[ksch] directly as registers instead of using temporary register %%esi. [USE_AESNI] (do_aesni_enc_aligned, do_aesni_dec_aligned, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Use %[key] directly as register instead of using temporary register %%esi. [USE_AESNI] (do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Change %[key] from generic "g" type to register "r". * src/hwfeatures.c (_gcry_detect_hw_features) [__x86_64__]: Do not clear AES-NI feature flag. -- AES-NI assembler uses %%esi for key-material pointer register. However %[key] can be marked as "r" (register) and automatically be 64bit on x86-64 and be 32bit on i386. So use %[key] for pointer register instead of %esi and that way make same AES-NI code work on both x86-64 and i386. [v2] - Add GNU style changelog - Fixed do_setkey changes, use %[ksch] for output instead of %[key] - Changed [key] assembler arguments from "g" to "r" to force use of registers in all cases (when tested v1, "g" did work as indented and %[key] mapped to register on i386 and x86-64, but that might not happen always). Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 199 ++++++++++++++++++++++++++--------------------------- src/hwfeatures.c | 3 - 2 files changed, 96 insertions(+), 106 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index d9a95cb..1d2681c 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -75,7 +75,7 @@ gcc 3. However, to be on the safe side we require at least gcc 4. */ #undef USE_AESNI #ifdef ENABLE_AESNI_SUPPORT -# if defined (__i386__) && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ >= 4 +# if ((defined (__i386__) && SIZEOF_UNSIGNED_LONG == 4) || defined(__x86_64__)) && __GNUC__ >= 4 # define USE_AESNI 1 # endif #endif /* ENABLE_AESNI_SUPPORT */ @@ -297,40 +297,38 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen) than using the standard key schedule. We disable it for now and don't put any effort into implementing this for AES-192 and AES-256. */ - asm volatile ("movl %[key], %%esi\n\t" - "movdqu (%%esi), %%xmm1\n\t" /* xmm1 := key */ - "movl %[ksch], %%esi\n\t" - "movdqa %%xmm1, (%%esi)\n\t" /* ksch[0] := xmm1 */ + asm volatile ("movdqu (%[key]), %%xmm1\n\t" /* xmm1 := key */ + "movdqa %%xmm1, (%[ksch])\n\t" /* ksch[0] := xmm1 */ "aeskeygenassist $0x01, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x10(%%esi)\n\t" /* ksch[1] := xmm1 */ + "movdqa %%xmm1, 0x10(%[ksch])\n\t" /* ksch[1] := xmm1 */ "aeskeygenassist $0x02, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x20(%%esi)\n\t" /* ksch[2] := xmm1 */ + "movdqa %%xmm1, 0x20(%[ksch])\n\t" /* ksch[2] := xmm1 */ "aeskeygenassist $0x04, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x30(%%esi)\n\t" /* ksch[3] := xmm1 */ + "movdqa %%xmm1, 0x30(%[ksch])\n\t" /* ksch[3] := xmm1 */ "aeskeygenassist $0x08, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x40(%%esi)\n\t" /* ksch[4] := xmm1 */ + "movdqa %%xmm1, 0x40(%[ksch])\n\t" /* ksch[4] := xmm1 */ "aeskeygenassist $0x10, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x50(%%esi)\n\t" /* ksch[5] := xmm1 */ + "movdqa %%xmm1, 0x50(%[ksch])\n\t" /* ksch[5] := xmm1 */ "aeskeygenassist $0x20, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x60(%%esi)\n\t" /* ksch[6] := xmm1 */ + "movdqa %%xmm1, 0x60(%[ksch])\n\t" /* ksch[6] := xmm1 */ "aeskeygenassist $0x40, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x70(%%esi)\n\t" /* ksch[7] := xmm1 */ + "movdqa %%xmm1, 0x70(%[ksch])\n\t" /* ksch[7] := xmm1 */ "aeskeygenassist $0x80, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x80(%%esi)\n\t" /* ksch[8] := xmm1 */ + "movdqa %%xmm1, 0x80(%[ksch])\n\t" /* ksch[8] := xmm1 */ "aeskeygenassist $0x1b, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0x90(%%esi)\n\t" /* ksch[9] := xmm1 */ + "movdqa %%xmm1, 0x90(%[ksch])\n\t" /* ksch[9] := xmm1 */ "aeskeygenassist $0x36, %%xmm1, %%xmm2\n\t" "call .Lexpand128_%=\n\t" - "movdqa %%xmm1, 0xa0(%%esi)\n\t" /* ksch[10] := xmm1 */ + "movdqa %%xmm1, 0xa0(%[ksch])\n\t" /* ksch[10] := xmm1 */ "jmp .Lleave%=\n" ".Lexpand128_%=:\n\t" @@ -350,8 +348,8 @@ do_setkey (RIJNDAEL_context *ctx, const byte *key, const unsigned keylen) "pxor %%xmm2, %%xmm2\n\t" "pxor %%xmm3, %%xmm3\n" : - : [key] "g" (key), [ksch] "g" (ctx->keyschenc) - : "%esi", "cc", "memory" ); + : [key] "r" (key), [ksch] "r" (ctx->keyschenc) + : "cc", "memory" ); } #endif /*USE_AESNI*/ else @@ -722,40 +720,39 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, aligned but that is a special case. We should better implement CFB direct in asm. */ asm volatile ("movdqu %[src], %%xmm0\n\t" /* xmm0 := *a */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -764,7 +761,7 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, : [src] "m" (*a), [key] "r" (ctx->keyschenc), [rounds] "r" (ctx->rounds) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -777,40 +774,39 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, #define aesdec_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xde, 0xc1\n\t" #define aesdeclast_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xc1\n\t" asm volatile ("movdqu %[src], %%xmm0\n\t" /* xmm0 := *a */ - "movl %[key], %%esi\n\t" - "movdqa (%%esi), %%xmm1\n\t" + "movdqa (%[key]), %%xmm1\n\t" "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Ldeclast%=:\n\t" aesdeclast_xmm1_xmm0 @@ -819,7 +815,7 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, : [src] "m" (*a), [key] "r" (ctx->keyschdec), [rounds] "r" (ctx->rounds) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesdec_xmm1_xmm0 #undef aesdeclast_xmm1_xmm0 } @@ -836,40 +832,39 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, #define aesenc_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xc1\n\t" #define aesenclast_xmm1_xmm0 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xc1\n\t" asm volatile ("movdqa %[iv], %%xmm0\n\t" /* xmm0 := IV */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -886,10 +881,10 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, "movdqu %%xmm0, %[dst]\n" /* Store output. */ : [iv] "+m" (*iv), [dst] "=m" (*b) : [src] "m" (*a), - [key] "g" (ctx->keyschenc), + [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [decrypt] "m" (decrypt_flag) - : "%esi", "cc", "memory"); + : "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -915,40 +910,39 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, "pshufb %[mask], %%xmm2\n\t" "movdqa %%xmm2, %[ctr]\n" /* Update CTR. */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -958,7 +952,7 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, : [ctr] "+m" (*ctr), [dst] "=m" (*b) : [src] "m" (*a), - [key] "g" (ctx->keyschenc), + [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) : "%esi", "cc", "memory"); @@ -1012,82 +1006,81 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, "pshufb %[mask], %%xmm5\n\t" /* xmm5 := be(xmm5) */ "movdqa %%xmm5, %[ctr]\n" /* Update CTR. */ - "movl %[key], %%esi\n\t" /* esi := keyschenc */ - "movdqa (%%esi), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ "pxor %%xmm1, %%xmm2\n\t" /* xmm2 ^= key[0] */ "pxor %%xmm1, %%xmm3\n\t" /* xmm3 ^= key[0] */ "pxor %%xmm1, %%xmm4\n\t" /* xmm4 ^= key[0] */ - "movdqa 0x10(%%esi), %%xmm1\n\t" + "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x20(%%esi), %%xmm1\n\t" + "movdqa 0x20(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x30(%%esi), %%xmm1\n\t" + "movdqa 0x30(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x40(%%esi), %%xmm1\n\t" + "movdqa 0x40(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x50(%%esi), %%xmm1\n\t" + "movdqa 0x50(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x60(%%esi), %%xmm1\n\t" + "movdqa 0x60(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x70(%%esi), %%xmm1\n\t" + "movdqa 0x70(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x80(%%esi), %%xmm1\n\t" + "movdqa 0x80(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0x90(%%esi), %%xmm1\n\t" + "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xa0(%%esi), %%xmm1\n\t" + "movdqa 0xa0(%[key]), %%xmm1\n\t" "cmp $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xb0(%%esi), %%xmm1\n\t" + "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xc0(%%esi), %%xmm1\n\t" + "movdqa 0xc0(%[key]), %%xmm1\n\t" "cmp $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xd0(%%esi), %%xmm1\n\t" + "movdqa 0xd0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 - "movdqa 0xe0(%%esi), %%xmm1\n" + "movdqa 0xe0(%[key]), %%xmm1\n" ".Lenclast%=:\n\t" aesenclast_xmm1_xmm0 @@ -1113,7 +1106,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, : [ctr] "+m" (*ctr), [dst] "=m" (*b) : [src] "m" (*a), - [key] "g" (ctx->keyschenc), + [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) : "%esi", "cc", "memory"); diff --git a/src/hwfeatures.c b/src/hwfeatures.c index 606f3e7..89d7685 100644 --- a/src/hwfeatures.c +++ b/src/hwfeatures.c @@ -292,9 +292,6 @@ _gcry_detect_hw_features (unsigned int disabled_features) # ifdef __GNUC__ { detect_x86_64_gnuc (); - /* We don't have AESNI support for 64 bit yet. Thus we should not - announce it. */ - hw_features &= ~HWF_INTEL_AESNI; } # endif #endif From jussi.kivilinna at mbnet.fi Fri Nov 16 09:45:43 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 16 Nov 2012 10:45:43 +0200 Subject: [PATCH v2] Fix too large burn_stack in camellia-glue.c Message-ID: <20121116084543.18010.1650.stgit@localhost6.localdomain6> * cipher/camellia-glue.c (camellia_encrypt, camellia_decrypt): Do not take full array size of KEY_TABLE_TYPE, but argument size instead. -- KEY_TABLE_TYPE is array type, and sizeof(KEY_TABLE_TYPE) gives full size of array. However what is wanted here is size of array argument in stack, so change sizeof(KEY_TABLE_TYPE) to sizeof(void*). This gives boost in speed for camellia cipher. On AMD Phenom II, x86-64: Before: $ tests/benchmark --cipher-repetitions 10 cipher camellia128 Running each test 10 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 250ms 240ms 270ms 260ms 250ms 250ms 260ms 250ms 340ms 330ms After: $ tests/benchmark --cipher-repetitions 10 cipher camellia128 Running each test 10 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- CAMELLIA128 140ms 130ms 150ms 160ms 150ms 150ms 150ms 140ms 220ms 220ms [v2] - Add GNU style changelog Signed-off-by: Jussi Kivilinna --- cipher/camellia-glue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cipher/camellia-glue.c b/cipher/camellia-glue.c index a263621..c5019d0 100644 --- a/cipher/camellia-glue.c +++ b/cipher/camellia-glue.c @@ -111,7 +111,7 @@ camellia_encrypt(void *c, byte *outbuf, const byte *inbuf) Camellia_EncryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); _gcry_burn_stack - (sizeof(int)+2*sizeof(unsigned char *)+sizeof(KEY_TABLE_TYPE) + (sizeof(int)+2*sizeof(unsigned char *)+sizeof(void*/*KEY_TABLE_TYPE*/) +4*sizeof(u32) +2*sizeof(u32*)+4*sizeof(u32) +2*2*sizeof(void*) /* Function calls. */ @@ -125,7 +125,7 @@ camellia_decrypt(void *c, byte *outbuf, const byte *inbuf) Camellia_DecryptBlock(ctx->keybitlength,inbuf,ctx->keytable,outbuf); _gcry_burn_stack - (sizeof(int)+2*sizeof(unsigned char *)+sizeof(KEY_TABLE_TYPE) + (sizeof(int)+2*sizeof(unsigned char *)+sizeof(void*/*KEY_TABLE_TYPE*/) +4*sizeof(u32) +2*sizeof(u32*)+4*sizeof(u32) +2*2*sizeof(void*) /* Function calls. */ From chris.adamson at mcri.edu.au Mon Nov 19 03:50:03 2012 From: chris.adamson at mcri.edu.au (Chris Adamson) Date: Mon, 19 Nov 2012 13:50:03 +1100 Subject: [PATCH v2 3/3] Add x86_64 support for AES-NI Message-ID: <1353293403.24127.3.camel@mcri-i0054.rch.unimelb.edu.au> Hi Jussi, Has the AESNI patch gone into the git version yet? Chris. -- Dr Christopher Adamson, PhD (Melb.), B Software Engineering (Hons., Monash) Research Officer Developmental Imaging, Critical Care and Neurosciences Murdoch Childrens Research Institute The Royal Children?s Hospital Flemington Road Parkville Victoria 3052 Australia T 9906 6780 M XXXX XXX XXX E chris.adamson at mcri.edu.au www.mcri.edu.au ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ From wk at gnupg.org Wed Nov 21 13:02:35 2012 From: wk at gnupg.org (Werner Koch) Date: Wed, 21 Nov 2012 13:02:35 +0100 Subject: DCO signature In-Reply-To: <20121115172331.150537dzb5i6jmy8@www.dalek.fi> (Jussi Kivilinna's message of "Thu, 15 Nov 2012 17:23:31 +0200") References: <20121115002232.17963o3vc5so53ok@www.dalek.fi> <871ufvc8uk.fsf@vigenere.g10code.de> <20121115172331.150537dzb5i6jmy8@www.dalek.fi> Message-ID: <87d2z7i2p0.fsf@vigenere.g10code.de> On Thu, 15 Nov 2012 16:23, jussi.kivilinna at mbnet.fi said: > Ah, I can fix them and resubmit. Also I noticed small problem with the > "AES-NI support for x86-64" patch, as it breaks the currently unused > AES-NI keysched part. Thanks. I pushed all your patches. One modification I did is to disable the AES_NI detection if the support is disabled by configure. Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From chris.adamson at mcri.edu.au Fri Nov 23 00:35:10 2012 From: chris.adamson at mcri.edu.au (Chris Adamson) Date: Fri, 23 Nov 2012 10:35:10 +1100 Subject: AES-NI + compression Message-ID: <1353627310.27796.15.camel@mcri-i0054.rch.unimelb.edu.au> Hello list, I am very keen for when AES-NI is implemented in a released 64-bit version of libgcrypt. I decided to test the i386 AES-NI and compare it to the i386 software implementation as well as the x86_64 software implementation. I also tested the effect of adding compression, which is important to me since I'm using gpg for backup. I took 895M of fairly compressible DICOM data in a tar file (bz2 compresses to 168M) and ran gpg2 on a i7-980. The table below shows seconds of CPU time. I included some non AES ciphers as well. The 64-bit software version performs better than the i386 software version in all cases of no compression, not surprising. But for AES the 64-bit software version outperforms the i386 AES-NI implementation when compression is used. At high levels of compression AES-NI has little effect for the i386 version. i.e. the compression part must also be faster in the 64-bit version, which is unexpected. My immediate questions: i386 AES-NI gives a 50% reduction when compared to the i386 software version, is this expected or should it be a greater reduction? I did see some x86_64 AES-NI patches released on the list, will these be put into a released version or backported? Thank you. 32-bit 32-bit 64-bit AES-NI ON AES-NI OFF AES-NI NOT SUPPORTED no compression no compression no compression CAST5 18.01 CAST5 17.80 CAST5 16.46 BLOWFISH 18.42 BLOWFISH 18.28 BLOWFISH 17.59 AES 6.60 AES 14.35 AES 9.89 AES192 6.81 AES192 16.10 AES192 11.22 AES256 6.96 AES256 17.16 AES256 12.54 TWOFISH 15.52 TWOFISH 15.57 TWOFISH 11.47 zlib 1 zlib 1 zlib 1 CAST5 20.34 CAST5 20.08 CAST5 17.04 BLOWFISH 20.65 BLOWFISH 20.26 BLOWFISH 17.42 AES 16.98 AES 18.95 AES 15.01 AES192 16.78 AES192 19.76 AES192 15.38 AES256 16.85 AES256 20.08 AES256 15.79 TWOFISH 19.55 TWOFISH 19.69 TWOFISH 15.45 zlib 3 zlib 3 zlib 3 CAST5 25.62 CAST5 25.64 CAST5 21.27 BLOWFISH 25.70 BLOWFISH 25.44 BLOWFISH 21.92 AES 22.10 AES 24.06 AES 19.51 AES192 22.01 AES192 24.84 AES192 19.87 AES256 22.21 AES256 25.41 AES256 20.40 TWOFISH 25.04 TWOFISH 24.88 TWOFISH 19.96 zlib 6 zlib 6 zlib 6 CAST5 52.85 CAST5 52.84 CAST5 45.97 BLOWFISH 53.37 BLOWFISH 52.75 BLOWFISH 44.66 AES 50.57 AES 51.54 AES 43.27 AES192 49.92 AES192 51.95 AES192 42.88 AES256 49.76 AES256 53.19 AES256 43.11 TWOFISH 52.29 TWOFISH 52.08 TWOFISH 42.82 zlib 9 zlib 9 zlib 9 CAST5 181.82 CAST5 180.99 CAST5 146.81 BLOWFISH 181.42 BLOWFISH 180.89 BLOWFISH 147.51 AES 177.44 AES 182.09 AES 146.40 AES192 178.21 AES192 181.00 AES192 156.41 AES256 182.53 AES256 180.27 AES256 147.48 TWOFISH 179.54 TWOFISH 179.92 TWOFISH 147.64 -- Dr Christopher Adamson, PhD (Melb.), B Software Engineering (Hons., Monash) Research Officer Developmental Imaging, Critical Care and Neurosciences Murdoch Childrens Research Institute The Royal Children?s Hospital Flemington Road Parkville Victoria 3052 Australia T 9906 6780 M XXXX XXX XXX E chris.adamson at mcri.edu.au www.mcri.edu.au ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ From jussi.kivilinna at mbnet.fi Fri Nov 23 18:21:54 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:21:54 +0200 Subject: [PATCH 01/10] Extend test of chained modes for 128bit ciphers Message-ID: <20121123172154.1410.35771.stgit@localhost6.localdomain6> * tests/basic.c (check_one_cipher_core, check_one_cipher): Increase input and output buffer sizes from 16 bytes to 1024+16=1040 bytes. (check_one_cipher_core): Add asserts to verify sizes of temporary buffers. -- Currently check_one_cipher() has buffer size of 16 bytes, which is one block with 128bit cipher. As result chained modes for 128bit ciphers are not well tested. Increase buffer size to 1040 bytes, so that iterations of chained modes and parallellized code paths (AES-NI CTR, etc) are also tested. Extra 16 bytes after 1024 bytes to ensure that loop transision from parallelized code paths to serialized code paths get tested too. Signed-off-by: Jussi Kivilinna --- tests/basic.c | 45 ++++++++++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/tests/basic.c b/tests/basic.c index 8001e86..656d76c 100644 --- a/tests/basic.c +++ b/tests/basic.c @@ -1367,13 +1367,15 @@ check_one_cipher_core (int algo, int mode, int flags, int bufshift, int pass) { gcry_cipher_hd_t hd; - unsigned char in_buffer[17], out_buffer[17]; + unsigned char in_buffer[1040+1], out_buffer[1040+1]; unsigned char *in, *out; int keylen; gcry_error_t err = 0; assert (nkey == 32); - assert (nplain == 16); + assert (nplain == 1040); + assert (sizeof(in_buffer) == nplain + 1); + assert (sizeof(out_buffer) == sizeof(in_buffer)); if (!bufshift) { @@ -1427,7 +1429,7 @@ check_one_cipher_core (int algo, int mode, int flags, return -1; } - err = gcry_cipher_encrypt (hd, out, 16, plain, 16); + err = gcry_cipher_encrypt (hd, out, nplain, plain, nplain); if (err) { fail ("pass %d, algo %d, mode %d, gcry_cipher_encrypt failed: %s\n", @@ -1438,7 +1440,7 @@ check_one_cipher_core (int algo, int mode, int flags, gcry_cipher_reset (hd); - err = gcry_cipher_decrypt (hd, in, 16, out, 16); + err = gcry_cipher_decrypt (hd, in, nplain, out, nplain); if (err) { fail ("pass %d, algo %d, mode %d, gcry_cipher_decrypt failed: %s\n", @@ -1447,15 +1449,15 @@ check_one_cipher_core (int algo, int mode, int flags, return -1; } - if (memcmp (plain, in, 16)) + if (memcmp (plain, in, nplain)) fail ("pass %d, algo %d, mode %d, encrypt-decrypt mismatch\n", pass, algo, mode); /* Again, using in-place encryption. */ gcry_cipher_reset (hd); - memcpy (out, plain, 16); - err = gcry_cipher_encrypt (hd, out, 16, NULL, 0); + memcpy (out, plain, nplain); + err = gcry_cipher_encrypt (hd, out, nplain, NULL, 0); if (err) { fail ("pass %d, algo %d, mode %d, in-place, gcry_cipher_encrypt failed:" @@ -1467,7 +1469,7 @@ check_one_cipher_core (int algo, int mode, int flags, gcry_cipher_reset (hd); - err = gcry_cipher_decrypt (hd, out, 16, NULL, 0); + err = gcry_cipher_decrypt (hd, out, nplain, NULL, 0); if (err) { fail ("pass %d, algo %d, mode %d, in-place, gcry_cipher_decrypt failed:" @@ -1477,7 +1479,7 @@ check_one_cipher_core (int algo, int mode, int flags, return -1; } - if (memcmp (plain, out, 16)) + if (memcmp (plain, out, nplain)) fail ("pass %d, algo %d, mode %d, in-place, encrypt-decrypt mismatch\n", pass, algo, mode); @@ -1492,34 +1494,43 @@ check_one_cipher_core (int algo, int mode, int flags, static void check_one_cipher (int algo, int mode, int flags) { - char key[33]; - unsigned char plain[17]; - int bufshift; + char key[32+1]; + unsigned char plain[1040+1]; + int bufshift, i; for (bufshift=0; bufshift < 4; bufshift++) { /* Pass 0: Standard test. */ memcpy (key, "0123456789abcdef.,;/[]{}-=ABCDEF", 32); memcpy (plain, "foobar42FOOBAR17", 16); - if (check_one_cipher_core (algo, mode, flags, key, 32, plain, 16, + for (i = 16; i < 1040; i += 16) + { + memcpy (&plain[i], &plain[i-16], 16); + if (!++plain[i+7]) + plain[i+6]++; + if (!++plain[i+15]) + plain[i+14]++; + } + + if (check_one_cipher_core (algo, mode, flags, key, 32, plain, 1040, bufshift, 0+10*bufshift)) return; /* Pass 1: Key not aligned. */ memmove (key+1, key, 32); - if (check_one_cipher_core (algo, mode, flags, key+1, 32, plain, 16, + if (check_one_cipher_core (algo, mode, flags, key+1, 32, plain, 1040, bufshift, 1+10*bufshift)) return; /* Pass 2: Key not aligned and data not aligned. */ - memmove (plain+1, plain, 16); - if (check_one_cipher_core (algo, mode, flags, key+1, 32, plain+1, 16, + memmove (plain+1, plain, 1024); + if (check_one_cipher_core (algo, mode, flags, key+1, 32, plain+1, 1040, bufshift, 2+10*bufshift)) return; /* Pass 3: Key aligned and data not aligned. */ memmove (key, key+1, 32); - if (check_one_cipher_core (algo, mode, flags, key, 32, plain+1, 16, + if (check_one_cipher_core (algo, mode, flags, key, 32, plain+1, 1040, bufshift, 3+10*bufshift)) return; } From jussi.kivilinna at mbnet.fi Fri Nov 23 18:21:59 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:21:59 +0200 Subject: [PATCH 02/10] Improve parallelizability of CBC decryption for AES-NI In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172159.1410.6927.stgit@localhost6.localdomain6> * cipher/rijndael.c (_gcry_aes_cbc_dec) [USE_AESNI]: Add AES-NI specific CBC mode loop with temporary block and IV stored in free SSE registers. -- Benchmark results on Intel Core i5-2450M (x86-64) show ~2.5x improvement: Before: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 690ms 780ms 2940ms 2110ms 1880ms 670ms 2250ms 2250ms 490ms 500ms AES192 890ms 930ms 3260ms 2390ms 2220ms 820ms 2580ms 2590ms 560ms 570ms AES256 1040ms 1070ms 3590ms 2640ms 2540ms 970ms 2880ms 2890ms 650ms 650ms After: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 97 +++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 75 insertions(+), 22 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index d081b42..104f869 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -1582,33 +1582,86 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, int i; unsigned char savebuf[BLOCKSIZE]; - aesni_prepare (); - for ( ;nblocks; nblocks-- ) + if (0) + ; +#ifdef USE_AESNI + else if (ctx->use_aesni) { - /* We need to save INBUF away because it may be identical to - OUTBUF. */ - memcpy (savebuf, inbuf, BLOCKSIZE); + aesni_prepare (); - if (0) - ; + if (!ctx->decryption_prepared ) + { + prepare_decryption ( ctx ); + ctx->decryption_prepared = 1; + } + + /* As we avoid memcpy to/from stack by using xmm2 and xmm3 for temporary + storage, out-of-order CPUs see parallellism even over loop iterations + and see 2.5x to 2.9x speed up on Intel Sandy-Bridge. Further + improvements are possible with do_aesni_cbc_dec_4() when implemented. + */ + asm volatile + ("movdqu %[iv], %%xmm3\n\t" /* use xmm3 as fast IV storage */ + : /* No output */ + : [iv] "m" (*iv) + : "memory"); + + for ( ;nblocks; nblocks-- ) + { + asm volatile + ("movdqu %[inbuf], %%xmm2\n\t" /* use xmm2 as savebuf */ + : /* No output */ + : [inbuf] "m" (*inbuf) + : "memory"); + + /* uses only xmm0 and xmm1 */ + do_aesni_dec_aligned (ctx, outbuf, inbuf); + + asm volatile + ("movdqu %[outbuf], %%xmm0\n\t" + "pxor %%xmm3, %%xmm0\n\t" /* xor IV with output */ + "movdqu %%xmm0, %[outbuf]\n\t" + "movdqu %%xmm2, %%xmm3\n\t" /* store savebuf as new IV */ + : /* No output */ + : [outbuf] "m" (*outbuf) + : "memory"); + + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; + } + + asm volatile + ("movdqu %%xmm3, %[iv]\n\t" /* store IV */ + : /* No output */ + : [iv] "m" (*iv) + : "memory"); + + aesni_cleanup (); + aesni_cleanup_2_4 (); + } +#endif /*USE_AESNI*/ + else + for ( ;nblocks; nblocks-- ) + { + /* We need to save INBUF away because it may be identical to + OUTBUF. */ + memcpy (savebuf, inbuf, BLOCKSIZE); + + if (0) + ; #ifdef USE_PADLOCK - else if (ctx->use_padlock) - do_padlock (ctx, 1, outbuf, inbuf); + else if (ctx->use_padlock) + do_padlock (ctx, 1, outbuf, inbuf); #endif /*USE_PADLOCK*/ -#ifdef USE_AESNI - else if (ctx->use_aesni) - do_aesni (ctx, 1, outbuf, inbuf); -#endif /*USE_AESNI*/ - else - do_decrypt (ctx, outbuf, inbuf); + else + do_decrypt (ctx, outbuf, inbuf); - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] ^= *ivp++; - memcpy (iv, savebuf, BLOCKSIZE); - inbuf += BLOCKSIZE; - outbuf += BLOCKSIZE; - } - aesni_cleanup (); + for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) + outbuf[i] ^= *ivp++; + memcpy (iv, savebuf, BLOCKSIZE); + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } _gcry_burn_stack (48 + 2*sizeof(int) + BLOCKSIZE + 4*sizeof (char*)); } From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:04 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:04 +0200 Subject: [PATCH 03/10] Optimize AES-NI CBC encryption In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172204.1410.75614.stgit@localhost6.localdomain6> * cipher/rijndeal.c (_gcry_aes_cbc_enc) [USE_AESNI]: Add AES-NI spesific loop and use SSE2 assembler for xoring and copying of blocks. -- This gives ~35% improvement in 'tests/benchmark cipher aes' on Sandy-Bridge CPU (x86-64). Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 47 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 104f869..982c54e 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -1249,23 +1249,50 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, aesni_prepare (); for ( ;nblocks; nblocks-- ) { - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] = inbuf[i] ^ *ivp++; - if (0) ; -#ifdef USE_PADLOCK - else if (ctx->use_padlock) - do_padlock (ctx, 0, outbuf, outbuf); -#endif /*USE_PADLOCK*/ #ifdef USE_AESNI else if (ctx->use_aesni) - do_aesni (ctx, 0, outbuf, outbuf); + { + /* ~35% speed up on Sandy-Bridge when doing xoring and copying with + SSE registers. */ + asm volatile ("movdqu %[iv], %%xmm0\n\t" + "movdqu %[inbuf], %%xmm1\n\t" + "pxor %%xmm0, %%xmm1\n\t" + "movdqu %%xmm1, %[outbuf]\n\t" + : /* No output */ + : [iv] "m" (*iv), + [inbuf] "m" (*inbuf), + [outbuf] "m" (*outbuf) + : "memory" ); + + do_aesni (ctx, 0, outbuf, outbuf); + + asm volatile ("movdqu %[outbuf], %%xmm0\n\t" + "movdqu %%xmm0, %[iv]\n\t" + : /* No output */ + : [outbuf] "m" (*outbuf), + [iv] "m" (*iv) + : "memory" ); + } #endif /*USE_AESNI*/ else - do_encrypt (ctx, outbuf, outbuf ); + { + for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) + outbuf[i] = inbuf[i] ^ *ivp++; + + if (0) + ; +#ifdef USE_PADLOCK + else if (ctx->use_padlock) + do_padlock (ctx, 0, outbuf, outbuf); +#endif /*USE_PADLOCK*/ + else + do_encrypt (ctx, outbuf, outbuf ); + + memcpy (iv, outbuf, BLOCKSIZE); + } - memcpy (iv, outbuf, BLOCKSIZE); inbuf += BLOCKSIZE; if (!cbc_mac) outbuf += BLOCKSIZE; From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:14 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:14 +0200 Subject: [PATCH 05/10] Add parallelized AES-NI CBC decryption In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172214.1410.20520.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI] (aesni_cleanup_5): New macro. [USE_AESNI] (do_aesni_dec_vec4): New function. (_gcry_aes_cbc_dec) [USE_AESNI]: Add parallelized CBC loop. (_gcry_aes_cbc_dec) [USE_AESNI]: Change IV storage register from xmm3 to xmm5. -- This gives ~60% improvement in CBC decryption speed on sandy-bridge (x86-64). Overall speed improvement with this and previous CBC patches is over 400%. Before: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms After: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2130ms 450ms 1880ms 670ms 2250ms 2280ms 490ms 490ms AES192 880ms 920ms 2460ms 540ms 2210ms 830ms 2580ms 2570ms 580ms 570ms AES256 1020ms 1070ms 2800ms 620ms 2560ms 970ms 2880ms 2880ms 660ms 650ms Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 152 insertions(+), 9 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 69e1df1..34a0f8c 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -822,6 +822,115 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, } +/* Decrypt four blocks using the Intel AES-NI instructions. Blocks are input + * and output through SSE registers xmm1 to xmm4. */ +static void +do_aesni_dec_vec4 (const RIJNDAEL_context *ctx) +{ +#define aesdec_xmm0_xmm1 ".byte 0x66, 0x0f, 0x38, 0xde, 0xc8\n\t" +#define aesdec_xmm0_xmm2 ".byte 0x66, 0x0f, 0x38, 0xde, 0xd0\n\t" +#define aesdec_xmm0_xmm3 ".byte 0x66, 0x0f, 0x38, 0xde, 0xd8\n\t" +#define aesdec_xmm0_xmm4 ".byte 0x66, 0x0f, 0x38, 0xde, 0xe0\n\t" +#define aesdeclast_xmm0_xmm1 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xc8\n\t" +#define aesdeclast_xmm0_xmm2 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xd0\n\t" +#define aesdeclast_xmm0_xmm3 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xd8\n\t" +#define aesdeclast_xmm0_xmm4 ".byte 0x66, 0x0f, 0x38, 0xdf, 0xe0\n\t" + asm volatile ("movdqa (%[key]), %%xmm0\n\t" + "pxor %%xmm0, %%xmm1\n\t" /* xmm1 ^= key[0] */ + "pxor %%xmm0, %%xmm2\n\t" /* xmm2 ^= key[0] */ + "pxor %%xmm0, %%xmm3\n\t" /* xmm3 ^= key[0] */ + "pxor %%xmm0, %%xmm4\n\t" /* xmm4 ^= key[0] */ + "movdqa 0x10(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x20(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x30(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x40(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x50(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x60(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x70(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x80(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0x90(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0xa0(%[key]), %%xmm0\n\t" + "cmp $10, %[rounds]\n\t" + "jz .Ldeclast%=\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0xb0(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0xc0(%[key]), %%xmm0\n\t" + "cmp $12, %[rounds]\n\t" + "jz .Ldeclast%=\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0xd0(%[key]), %%xmm0\n\t" + aesdec_xmm0_xmm1 + aesdec_xmm0_xmm2 + aesdec_xmm0_xmm3 + aesdec_xmm0_xmm4 + "movdqa 0xe0(%[key]), %%xmm0\n" + + ".Ldeclast%=:\n\t" + aesdeclast_xmm0_xmm1 + aesdeclast_xmm0_xmm2 + aesdeclast_xmm0_xmm3 + aesdeclast_xmm0_xmm4 + : /* no output */ + : [key] "r" (ctx->keyschdec), + [rounds] "r" (ctx->rounds) + : "cc", "memory"); +#undef aesdec_xmm0_xmm1 +#undef aesdec_xmm0_xmm2 +#undef aesdec_xmm0_xmm3 +#undef aesdec_xmm0_xmm4 +#undef aesdeclast_xmm0_xmm1 +#undef aesdeclast_xmm0_xmm2 +#undef aesdeclast_xmm0_xmm3 +#undef aesdeclast_xmm0_xmm4 +} + + /* Perform a CFB encryption or decryption round using the initialization vector IV and the input block A. Write the result to the output block B and update IV. IV needs to be 16 byte @@ -1623,17 +1732,51 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, ctx->decryption_prepared = 1; } - /* As we avoid memcpy to/from stack by using xmm2 and xmm3 for temporary - storage, out-of-order CPUs see parallellism even over loop iterations - and see 2.5x to 2.9x speed up on Intel Sandy-Bridge. Further - improvements are possible with do_aesni_cbc_dec_4() when implemented. - */ asm volatile - ("movdqu %[iv], %%xmm3\n\t" /* use xmm3 as fast IV storage */ + ("movdqu %[iv], %%xmm5\n\t" /* use xmm5 as fast IV storage */ : /* No output */ : [iv] "m" (*iv) : "memory"); + for ( ;nblocks > 3 ; nblocks -= 4 ) + { + asm volatile + ("movdqu 0*16(%[inbuf]), %%xmm1\n\t" /* load input blocks */ + "movdqu 1*16(%[inbuf]), %%xmm2\n\t" + "movdqu 2*16(%[inbuf]), %%xmm3\n\t" + "movdqu 3*16(%[inbuf]), %%xmm4\n\t" + : /* No output */ + : [inbuf] "r" (inbuf) + : "memory"); + + do_aesni_dec_vec4 (ctx); + + asm volatile + ("pxor %%xmm5, %%xmm1\n\t" /* xor IV with output */ + "movdqu 0*16(%[inbuf]), %%xmm5\n\t" /* load new IV */ + "movdqu %%xmm1, 0*16(%[outbuf])\n\t" + + "pxor %%xmm5, %%xmm2\n\t" /* xor IV with output */ + "movdqu 1*16(%[inbuf]), %%xmm5\n\t" /* load new IV */ + "movdqu %%xmm2, 1*16(%[outbuf])\n\t" + + "pxor %%xmm5, %%xmm3\n\t" /* xor IV with output */ + "movdqu 2*16(%[inbuf]), %%xmm5\n\t" /* load new IV */ + "movdqu %%xmm3, 2*16(%[outbuf])\n\t" + + "pxor %%xmm5, %%xmm4\n\t" /* xor IV with output */ + "movdqu 3*16(%[inbuf]), %%xmm5\n\t" /* load new IV */ + "movdqu %%xmm4, 3*16(%[outbuf])\n\t" + + : /* No output */ + : [inbuf] "r" (inbuf), + [outbuf] "r" (outbuf) + : "memory"); + + outbuf += 4*BLOCKSIZE; + inbuf += 4*BLOCKSIZE; + } + for ( ;nblocks; nblocks-- ) { asm volatile @@ -1647,9 +1790,9 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, asm volatile ("movdqu %[outbuf], %%xmm0\n\t" - "pxor %%xmm3, %%xmm0\n\t" /* xor IV with output */ + "pxor %%xmm5, %%xmm0\n\t" /* xor IV with output */ "movdqu %%xmm0, %[outbuf]\n\t" - "movdqu %%xmm2, %%xmm3\n\t" /* store savebuf as new IV */ + "movdqu %%xmm2, %%xmm5\n\t" /* store savebuf as new IV */ : /* No output */ : [outbuf] "m" (*outbuf) : "memory"); @@ -1659,7 +1802,7 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, } asm volatile - ("movdqu %%xmm3, %[iv]\n\t" /* store IV */ + ("movdqu %%xmm5, %[iv]\n\t" /* store IV */ : /* No output */ : [iv] "m" (*iv) : "memory"); From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:09 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:09 +0200 Subject: [PATCH 04/10] Clear xmm5 after use in AES-NI CTR mode In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172209.1410.891.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI]: Rename aesni_cleanup_2_4 to aesni_cleanup_2_5. [USE_AESNI] (aesni_cleanup_2_5): Clear xmm5 register. (_gcry_aes_ctr_enc, _gcry_aes_cbc_dec) [USE_AESNI]: Use aesni_cleanup_2_5 instead of aesni_cleanup_2_4. -- xmm5 register is used by parallelized AES-NI CTR mode, so it should be cleaned up after use too. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 982c54e..69e1df1 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -144,10 +144,11 @@ typedef struct do { asm volatile ("pxor %%xmm0, %%xmm0\n\t" \ "pxor %%xmm1, %%xmm1\n" :: ); \ } while (0) -# define aesni_cleanup_2_4() \ +# define aesni_cleanup_2_5() \ do { asm volatile ("pxor %%xmm2, %%xmm2\n\t" \ "pxor %%xmm3, %%xmm3\n" \ - "pxor %%xmm4, %%xmm4\n":: ); \ + "pxor %%xmm4, %%xmm4\n" \ + "pxor %%xmm5, %%xmm5\n":: ); \ } while (0) #else # define aesni_prepare() do { } while (0) @@ -1338,7 +1339,7 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, inbuf += BLOCKSIZE; } aesni_cleanup (); - aesni_cleanup_2_4 (); + aesni_cleanup_2_5 (); } #endif /*USE_AESNI*/ else @@ -1664,7 +1665,7 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, : "memory"); aesni_cleanup (); - aesni_cleanup_2_4 (); + aesni_cleanup_2_5 (); } #endif /*USE_AESNI*/ else From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:25 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:25 +0200 Subject: [PATCH 07/10] Add parallelized AES-NI ECB encryption In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172225.1410.75072.stgit@localhost6.localdomain6> * cipher/cipher-internal.h (struct gcry_cipher_handle): Add bulk.ecb_enc. * cipher/cipher.c (gcry_cipher_open) [USE_AES]: Set bulk.ecb_enc to _gcry_aes_ecb_enc. (do_ecb_encrypt): Redirect call into bulk.ecb_enc if non-null. * src/cipher.h (_gcry_aes_ecb_enc): Add new function prototype. * cipher/rijndeal.c (_gcry_aes_ecb_enc): Add new function. [USE_AESNI] (do_aesni_enc_vec4): Add new function. -- Parallelized ECB encryption is ~2.0x faster on Intel Sandy-Bridge (x86-64). Before: Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 690ms 350ms 2130ms 470ms 1890ms 670ms 2220ms 2240ms 490ms 490ms AES192 900ms 440ms 2460ms 560ms 2210ms 840ms 2550ms 2560ms 570ms 570ms AES256 1040ms 520ms 2800ms 640ms 2550ms 970ms 2840ms 2850ms 660ms 650ms After: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 340ms 360ms 2130ms 470ms 1870ms 690ms 2200ms 2250ms 500ms 490ms AES192 430ms 440ms 2460ms 550ms 2210ms 820ms 2540ms 2560ms 570ms 570ms AES256 500ms 520ms 2790ms 640ms 2540ms 960ms 2830ms 2840ms 650ms 650ms Signed-off-by: Jussi Kivilinna --- cipher/cipher-internal.h | 3 + cipher/cipher.c | 8 ++ cipher/rijndael.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++ src/cipher.h | 2 + 4 files changed, 187 insertions(+) diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index dcce708..edd8e17 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -89,6 +89,9 @@ struct gcry_cipher_handle void (*ctr_enc)(void *context, unsigned char *iv, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); + void (*ecb_enc)(void *context, void *outbuf_arg, + const void *inbuf_arg, + unsigned int nblocks); void (*ecb_dec)(void *context, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); diff --git a/cipher/cipher.c b/cipher/cipher.c index b0f9773..edc84f7 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -716,6 +716,7 @@ gcry_cipher_open (gcry_cipher_hd_t *handle, h->bulk.cbc_enc = _gcry_aes_cbc_enc; h->bulk.cbc_dec = _gcry_aes_cbc_dec; h->bulk.ctr_enc = _gcry_aes_ctr_enc; + h->bulk.ecb_enc = _gcry_aes_ecb_enc; h->bulk.ecb_dec = _gcry_aes_ecb_dec; break; #endif /*USE_AES*/ @@ -859,6 +860,13 @@ do_ecb_encrypt (gcry_cipher_hd_t c, nblocks = inbuflen / c->cipher->blocksize; + if (nblocks && c->bulk.ecb_enc) + { + c->bulk.ecb_enc (&c->context.c, outbuf, inbuf, nblocks); + + return 0; + } + for (n=0; n < nblocks; n++ ) { c->cipher->encrypt (&c->context.c, outbuf, (byte*)/*arggg*/inbuf); diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 421b159..5110c72 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -822,6 +822,115 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, } +/* Encrypt four blocks using the Intel AES-NI instructions. Blocks are input + * and output through SSE registers xmm1 to xmm4. */ +static void +do_aesni_enc_vec4 (const RIJNDAEL_context *ctx) +{ +#define aesenc_xmm0_xmm1 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xc8\n\t" +#define aesenc_xmm0_xmm2 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xd0\n\t" +#define aesenc_xmm0_xmm3 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xd8\n\t" +#define aesenc_xmm0_xmm4 ".byte 0x66, 0x0f, 0x38, 0xdc, 0xe0\n\t" +#define aesenclast_xmm0_xmm1 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xc8\n\t" +#define aesenclast_xmm0_xmm2 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xd0\n\t" +#define aesenclast_xmm0_xmm3 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xd8\n\t" +#define aesenclast_xmm0_xmm4 ".byte 0x66, 0x0f, 0x38, 0xdd, 0xe0\n\t" + asm volatile ("movdqa (%[key]), %%xmm0\n\t" + "pxor %%xmm0, %%xmm1\n\t" /* xmm1 ^= key[0] */ + "pxor %%xmm0, %%xmm2\n\t" /* xmm2 ^= key[0] */ + "pxor %%xmm0, %%xmm3\n\t" /* xmm3 ^= key[0] */ + "pxor %%xmm0, %%xmm4\n\t" /* xmm4 ^= key[0] */ + "movdqa 0x10(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x20(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x30(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x40(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x50(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x60(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x70(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x80(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0x90(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0xa0(%[key]), %%xmm0\n\t" + "cmp $10, %[rounds]\n\t" + "jz .Ldeclast%=\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0xb0(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0xc0(%[key]), %%xmm0\n\t" + "cmp $12, %[rounds]\n\t" + "jz .Ldeclast%=\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0xd0(%[key]), %%xmm0\n\t" + aesenc_xmm0_xmm1 + aesenc_xmm0_xmm2 + aesenc_xmm0_xmm3 + aesenc_xmm0_xmm4 + "movdqa 0xe0(%[key]), %%xmm0\n" + + ".Ldeclast%=:\n\t" + aesenclast_xmm0_xmm1 + aesenclast_xmm0_xmm2 + aesenclast_xmm0_xmm3 + aesenclast_xmm0_xmm4 + : /* no output */ + : [key] "r" (ctx->keyschenc), + [rounds] "r" (ctx->rounds) + : "cc", "memory"); +#undef aesenc_xmm0_xmm1 +#undef aesenc_xmm0_xmm2 +#undef aesenc_xmm0_xmm3 +#undef aesenc_xmm0_xmm4 +#undef aesenclast_xmm0_xmm1 +#undef aesenclast_xmm0_xmm2 +#undef aesenclast_xmm0_xmm3 +#undef aesenclast_xmm0_xmm4 +} + + /* Decrypt four blocks using the Intel AES-NI instructions. Blocks are input * and output through SSE registers xmm1 to xmm4. */ static void @@ -1476,6 +1585,71 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, } +/* Bulk encryption of complete blocks in ECB mode. This function is only + * intended for the bulk encryption feature of cipher.c. */ +void +_gcry_aes_ecb_enc (void *context, void *outbuf_arg, + const void *inbuf_arg, unsigned int nblocks) +{ + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + + if (0) + ; +#ifdef USE_AESNI + else if (ctx->use_aesni) + { + aesni_prepare (); + + for ( ;nblocks > 3 ; nblocks -= 4 ) + { + asm volatile + ("movdqu 0*16(%[inbuf]), %%xmm1\n\t" /* load input blocks */ + "movdqu 1*16(%[inbuf]), %%xmm2\n\t" + "movdqu 2*16(%[inbuf]), %%xmm3\n\t" + "movdqu 3*16(%[inbuf]), %%xmm4\n\t" + : /* No output */ + : [inbuf] "r" (inbuf) + : "memory"); + + do_aesni_enc_vec4 (ctx); + + asm volatile + ("movdqu %%xmm1, 0*16(%[outbuf])\n\t" /* store output blocks */ + "movdqu %%xmm2, 1*16(%[outbuf])\n\t" + "movdqu %%xmm3, 2*16(%[outbuf])\n\t" + "movdqu %%xmm4, 3*16(%[outbuf])\n\t" + : /* No output */ + : [outbuf] "r" (outbuf) + : "memory"); + + outbuf += 4*BLOCKSIZE; + inbuf += 4*BLOCKSIZE; + } + + for ( ;nblocks; nblocks-- ) + { + do_aesni_enc_aligned (ctx, outbuf, inbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + aesni_cleanup (); + aesni_cleanup_2_5 (); + } +#endif + else + for ( ;nblocks; nblocks-- ) + { + rijndael_encrypt(context, outbuf, inbuf); + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } +} + + /* Decrypt one block. A and B need to be aligned on a 4 byte boundary and the decryption must have been prepared. A and B may be the diff --git a/src/cipher.h b/src/cipher.h index 6b34e90..66367c1 100644 --- a/src/cipher.h +++ b/src/cipher.h @@ -94,6 +94,8 @@ void _gcry_aes_cbc_dec (void *context, unsigned char *iv, void _gcry_aes_ctr_enc (void *context, unsigned char *ctr, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); +void _gcry_aes_ecb_enc (void *context, void *outbuf_arg, + const void *inbuf_arg, unsigned int nblocks); void _gcry_aes_ecb_dec (void *context, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:35 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:35 +0200 Subject: [PATCH 09/10] Optimize wipememory2 for i386 and x86-64 In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172235.1410.40871.stgit@localhost6.localdomain6> * src/g10lib.h (wipememory2): Add call to fast_wipememory2. (fast_wipememory2): New macros for i386 and x86-64 architectures. Empty macro provided for other architectures. -- Optimizing wipememory2 give broad range of speed improvements, as seen below. Cipher speed ratios, old-vs-new (AMD Phenom II, x86-64): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.32x 1.35x 1.29x 1.25x 1.30x 1.33x 1.33x 1.33x 1.22x 1.22x 3DES 1.13x 1.10x 1.11x 1.12x 1.13x 1.16x 1.13x 1.13x 1.10x 1.12x CAST5 1.57x 1.51x 1.56x 1.43x 1.48x 1.50x 1.49x 1.51x 1.28x 1.27x BLOWFISH 1.53x 1.52x 1.56x 1.42x 1.50x 1.51x 1.49x 1.52x 1.27x 1.28x AES 1.33x 1.33x 1.00x 1.02x 1.04x 1.02x 1.26x 1.26x 1.00x 0.98x AES192 1.33x 1.36x 1.05x 1.00x 1.04x 1.00x 1.28x 1.24x 1.02x 1.00x AES256 1.22x 1.33x 0.98x 1.00x 1.03x 1.02x 1.28x 1.25x 1.00x 1.00x TWOFISH 1.34x 1.34x 1.44x 1.25x 1.35x 1.28x 1.37x 1.37x 1.14x 1.16x ARCFOUR 1.00x 1.00x DES 1.31x 1.30x 1.34x 1.25x 1.28x 1.28x 1.34x 1.26x 1.22x 1.24x TWOFISH128 1.41x 1.45x 1.46x 1.28x 1.32x 1.37x 1.34x 1.28x 1.16x 1.16x SERPENT128 1.16x 1.20x 1.22x 1.16x 1.16x 1.16x 1.18x 1.18x 1.14x 1.11x SERPENT192 1.16x 1.20x 1.23x 1.16x 1.19x 1.18x 1.16x 1.16x 1.10x 1.10x SERPENT256 1.18x 1.23x 1.23x 1.13x 1.18x 1.16x 1.18x 1.16x 1.11x 1.11x RFC2268_40 1.00x 1.00x 1.03x 0.96x 0.98x 1.00x 0.99x 1.00x 0.99x 0.98x SEED 1.20x 1.24x 1.25x 1.18x 1.19x 1.18x 1.21x 1.22x 1.14x 1.12x CAMELLIA128 1.60x 1.69x 1.56x 1.50x 1.60x 1.53x 1.64x 1.63x 1.29x 1.32x CAMELLIA192 1.55x 1.46x 1.44x 1.34x 1.42x 1.50x 1.46x 1.51x 1.26x 1.28x CAMELLIA256 1.52x 1.50x 1.47x 1.40x 1.51x 1.44x 1.41x 1.50x 1.28x 1.28x Cipher speed ratios, old-vs-new (AMD Phenom II, i386): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.15x 1.11x 1.10x 1.08x 1.09x 1.13x 1.16x 1.07x 1.10x 1.14x 3DES 1.08x 1.08x 1.08x 1.07x 1.06x 1.06x 1.06x 1.05x 1.05x 1.05x CAST5 1.23x 1.25x 1.18x 1.17x 1.25x 1.21x 1.22x 1.17x 1.14x 1.12x BLOWFISH 1.25x 1.22x 1.21x 1.11x 1.23x 1.23x 1.24x 1.17x 1.14x 1.14x AES 1.13x 1.13x 1.02x 1.02x 0.98x 0.98x 1.16x 1.03x 1.02x 0.98x AES192 1.11x 1.12x 1.02x 0.99x 1.02x 0.95x 1.06x 1.00x 0.94x 0.91x AES256 1.05x 1.05x 0.97x 1.00x 1.00x 0.99x 1.11x 1.01x 0.99x 1.00x TWOFISH 1.11x 1.15x 1.16x 1.13x 1.12x 1.14x 1.13x 1.05x 1.07x 1.08x ARCFOUR 1.00x 0.97x DES 1.14x 1.14x 1.10x 1.07x 1.11x 1.12x 1.14x 1.08x 1.11x 1.17x TWOFISH128 1.16x 1.23x 1.18x 1.15x 1.14x 1.20x 1.15x 1.05x 1.08x 1.08x SERPENT128 1.08x 1.08x 1.08x 1.05x 1.06x 1.05x 1.09x 1.04x 1.05x 1.05x SERPENT192 1.07x 1.08x 1.08x 1.04x 1.04x 1.06x 1.08x 1.04x 1.01x 1.05x SERPENT256 1.06x 1.08x 1.05x 1.04x 1.05x 1.08x 1.07x 1.03x 1.06x 1.06x RFC2268_40 1.00x 0.99x 1.02x 1.01x 1.01x 1.00x 1.02x 0.99x 0.98x 0.99x SEED 1.12x 1.07x 1.12x 1.07x 1.09x 1.10x 1.10x 1.03x 1.07x 1.05x CAMELLIA128 1.24x 1.21x 1.16x 1.17x 1.16x 1.16x 1.21x 1.16x 1.13x 1.12x CAMELLIA192 1.19x 1.20x 1.14x 1.19x 1.20x 1.20x 1.18x 1.13x 1.13x 1.15x CAMELLIA256 1.21x 1.19x 1.14x 1.17x 1.17x 1.16x 1.17x 1.11x 1.12x 1.14x Hash speed ratios, old-vs-new (Intel Sandy-Bridge, x86-64): MD5 1.00x 1.47x 1.07x 1.00x 1.00x SHA1 1.06x 1.27x 1.06x 1.00x 1.00x RIPEMD160 1.04x 1.32x 1.11x 1.00x 1.00x TIGER192 1.05x 1.50x 1.15x 1.03x 1.05x SHA256 1.05x 1.38x 1.21x 1.04x 1.03x SHA384 1.15x 1.76x 1.25x 1.10x 1.04x SHA512 1.15x 1.76x 1.27x 1.08x 1.04x SHA224 1.05x 1.38x 1.21x 1.06x 1.00x MD4 1.17x 1.55x 1.06x 1.06x 1.00x CRC32 1.00x 1.00x 0.99x 1.04x 1.00x CRC32RFC1510 0.93x 1.00x 1.01x 1.00x 1.00x CRC24RFC2440 1.00x 1.00x 1.00x 0.99x 1.00x WHIRLPOOL 1.02x 1.00x 0.99x 1.00x 1.00x TIGER 1.05x 1.50x 1.15x 1.09x 1.05x TIGER2 1.05x 1.48x 1.16x 1.06x 0.95x Signed-off-by: Jussi Kivilinna --- src/g10lib.h | 43 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/src/g10lib.h b/src/g10lib.h index c580c08..f1af399 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -238,11 +238,52 @@ void _gcry_burn_stack (int bytes); #define wipememory2(_ptr,_set,_len) do { \ volatile char *_vptr=(volatile char *)(_ptr); \ size_t _vlen=(_len); \ - while(_vlen) { *_vptr=(_set); _vptr++; _vlen--; } \ + unsigned char _vset=(_set); \ + fast_wipememory2(_vptr,_vset,_vlen); \ + while(_vlen) { *_vptr=(_vset); _vptr++; _vlen--; } \ } while(0) #define wipememory(_ptr,_len) wipememory2(_ptr,0,_len) +/* Optimized fast_wipememory2 for i386 and x86-64 architechtures. Maybe leave + tail bytes unhandled, in which case tail bytes are handled by wipememory2. + */ +#if defined(__x86_64__) && __GNUC__ >= 4 +#define fast_wipememory2(_vptr,_vset,_vlen) do { \ + unsigned long long int _vset8 = _vset; \ + if (_vlen < 8) \ + break; \ + _vset8 *= 0x0101010101010101ULL; \ + do { \ + asm volatile("movq %[set], %[ptr]\n\t" \ + : /**/ \ + : [set] "Cr" (_vset8), \ + [ptr] "m" (*_vptr) \ + : "memory"); \ + _vlen -= 8; \ + _vptr += 8; \ + } while (_vlen >= 8); \ + } while (0) +#elif defined (__i386__) && SIZEOF_UNSIGNED_LONG == 4 && __GNUC__ >= 4 +#define fast_wipememory2(_ptr,_set,_len) do { \ + unsigned long _vset4 = _vset; \ + if (_vlen < 4) \ + break; \ + _vset4 *= 0x01010101; \ + do { \ + asm volatile("movl %[set], %[ptr]\n\t" \ + : /**/ \ + : [set] "Cr" (_vset4), \ + [ptr] "m" (*_vptr) \ + : "memory"); \ + _vlen -= 4; \ + _vptr += 4; \ + } while (_vlen >= 4); \ + } while (0) +#else +#define fast_wipememory2(_ptr,_set,_len) +#endif + /* Digit predicates. */ From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:30 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:30 +0200 Subject: [PATCH 08/10] Fix missing 64bit carry handling in AES-NI CTR mode In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172230.1410.86214.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Add carry handling to 64-bit addition. (selftest_ctr_128): New function for testing IV handling in bulk CTR function. (selftest): Add call to selftest_ctr_128. -- Carry handling checks if lower 64-bit part of SSE register was overflowed and if it was, increment upper parts since that point. Also add selftests to verify correct operation. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 189 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 171 insertions(+), 18 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 5110c72..bdb2e51 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -1120,16 +1120,33 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, static unsigned char be_mask[16] __attribute__ ((aligned (16))) = { 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 }; - asm volatile ("movdqa %[ctr], %%xmm0\n\t" /* xmm0, xmm2 := CTR */ + asm volatile ("movdqa (%[ctr]), %%xmm0\n\t" /* xmm0, xmm2 := CTR */ "movaps %%xmm0, %%xmm2\n\t" "mov $1, %%esi\n\t" /* xmm2++ (big-endian) */ "movd %%esi, %%xmm1\n\t" + + "movl 12(%[ctr]), %%esi\n\t" /* load lower parts of CTR */ + "bswapl %%esi\n\t" + "movl 8(%[ctr]), %%edi\n\t" + "bswapl %%edi\n\t" + "pshufb %[mask], %%xmm2\n\t" "paddq %%xmm1, %%xmm2\n\t" + + "addl $1, %%esi\n\t" + "adcl $0, %%edi\n\t" /* detect 64bit overflow */ + "jnc .Lno_carry%=\n\t" + + /* swap upper and lower halfs */ + "pshufd $0x4e, %%xmm1, %%xmm1\n\t" + "paddq %%xmm1, %%xmm2\n\t" /* add carry to upper 64bits */ + + ".Lno_carry%=:\n\t" + "pshufb %[mask], %%xmm2\n\t" - "movdqa %%xmm2, %[ctr]\n" /* Update CTR. */ + "movdqa %%xmm2, (%[ctr])\n" /* Update CTR. */ - "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ "movdqa 0x10(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 @@ -1169,12 +1186,13 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, "pxor %%xmm1, %%xmm0\n\t" /* EncCTR ^= input */ "movdqu %%xmm0, %[dst]" /* Store EncCTR. */ - : [ctr] "+m" (*ctr), [dst] "=m" (*b) + : [dst] "=m" (*b) : [src] "m" (*a), + [ctr] "r" (ctr), [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) - : "%esi", "cc", "memory"); + : "%esi", "%edi", "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -1207,10 +1225,16 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, xmm5 temp */ - asm volatile ("movdqa %[ctr], %%xmm0\n\t" /* xmm0, xmm2 := CTR */ + asm volatile ("movdqa (%[ctr]), %%xmm0\n\t" /* xmm0, xmm2 := CTR */ "movaps %%xmm0, %%xmm2\n\t" "mov $1, %%esi\n\t" /* xmm1 := 1 */ "movd %%esi, %%xmm1\n\t" + + "movl 12(%[ctr]), %%esi\n\t" /* load lower parts of CTR */ + "bswapl %%esi\n\t" + "movl 8(%[ctr]), %%edi\n\t" + "bswapl %%edi\n\t" + "pshufb %[mask], %%xmm2\n\t" /* xmm2 := le(xmm2) */ "paddq %%xmm1, %%xmm2\n\t" /* xmm2++ */ "movaps %%xmm2, %%xmm3\n\t" /* xmm3 := xmm2 */ @@ -1219,11 +1243,39 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, "paddq %%xmm1, %%xmm4\n\t" /* xmm4++ */ "movaps %%xmm4, %%xmm5\n\t" /* xmm5 := xmm4 */ "paddq %%xmm1, %%xmm5\n\t" /* xmm5++ */ + + /* swap upper and lower halfs */ + "pshufd $0x4e, %%xmm1, %%xmm1\n\t" + + "addl $1, %%esi\n\t" + "adcl $0, %%edi\n\t" /* detect 64bit overflow */ + "jc .Lcarry_xmm2%=\n\t" + "addl $1, %%esi\n\t" + "adcl $0, %%edi\n\t" /* detect 64bit overflow */ + "jc .Lcarry_xmm3%=\n\t" + "addl $1, %%esi\n\t" + "adcl $0, %%edi\n\t" /* detect 64bit overflow */ + "jc .Lcarry_xmm4%=\n\t" + "addl $1, %%esi\n\t" + "adcl $0, %%edi\n\t" /* detect 64bit overflow */ + "jc .Lcarry_xmm5%=\n\t" + "jmp .Lno_carry%=\n\t" + + ".Lcarry_xmm2%=:\n\t" + "paddq %%xmm1, %%xmm2\n\t" + ".Lcarry_xmm3%=:\n\t" + "paddq %%xmm1, %%xmm3\n\t" + ".Lcarry_xmm4%=:\n\t" + "paddq %%xmm1, %%xmm4\n\t" + ".Lcarry_xmm5%=:\n\t" + "paddq %%xmm1, %%xmm5\n\t" + + ".Lno_carry%=:\n\t" "pshufb %[mask], %%xmm2\n\t" /* xmm2 := be(xmm2) */ "pshufb %[mask], %%xmm3\n\t" /* xmm3 := be(xmm3) */ "pshufb %[mask], %%xmm4\n\t" /* xmm4 := be(xmm4) */ "pshufb %[mask], %%xmm5\n\t" /* xmm5 := be(xmm5) */ - "movdqa %%xmm5, %[ctr]\n" /* Update CTR. */ + "movdqa %%xmm5, (%[ctr])\n" /* Update CTR. */ "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ @@ -1307,28 +1359,30 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, aesenclast_xmm1_xmm3 aesenclast_xmm1_xmm4 - "movdqu %[src], %%xmm1\n\t" /* Get block 1. */ + "movdqu (%[src]), %%xmm1\n\t" /* Get block 1. */ "pxor %%xmm1, %%xmm0\n\t" /* EncCTR-1 ^= input */ - "movdqu %%xmm0, %[dst]\n\t" /* Store block 1 */ + "movdqu %%xmm0, (%[dst])\n\t" /* Store block 1 */ - "movdqu (16)%[src], %%xmm1\n\t" /* Get block 2. */ + "movdqu 16(%[src]), %%xmm1\n\t" /* Get block 2. */ "pxor %%xmm1, %%xmm2\n\t" /* EncCTR-2 ^= input */ - "movdqu %%xmm2, (16)%[dst]\n\t" /* Store block 2. */ + "movdqu %%xmm2, 16(%[dst])\n\t" /* Store block 2. */ - "movdqu (32)%[src], %%xmm1\n\t" /* Get block 3. */ + "movdqu 32(%[src]), %%xmm1\n\t" /* Get block 3. */ "pxor %%xmm1, %%xmm3\n\t" /* EncCTR-3 ^= input */ - "movdqu %%xmm3, (32)%[dst]\n\t" /* Store block 3. */ + "movdqu %%xmm3, 32(%[dst])\n\t" /* Store block 3. */ - "movdqu (48)%[src], %%xmm1\n\t" /* Get block 4. */ + "movdqu 48(%[src]), %%xmm1\n\t" /* Get block 4. */ "pxor %%xmm1, %%xmm4\n\t" /* EncCTR-4 ^= input */ - "movdqu %%xmm4, (48)%[dst]" /* Store block 4. */ + "movdqu %%xmm4, 48(%[dst])" /* Store block 4. */ - : [ctr] "+m" (*ctr), [dst] "=m" (*b) - : [src] "m" (*a), + : + : [ctr] "r" (ctr), + [src] "r" (a), + [dst] "r" (b), [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) - : "%esi", "cc", "memory"); + : "%esi", "%edi", "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenc_xmm1_xmm2 #undef aesenc_xmm1_xmm3 @@ -2214,6 +2268,102 @@ selftest_basic_256 (void) return NULL; } + +/* Run the self-tests for AES-CTR-128, tests IV increment of bulk CTR + encryption. Returns NULL on success. */ +static const char* +selftest_ctr_128 (void) +{ + RIJNDAEL_context ctx ATTR_ALIGNED_16; + unsigned char plaintext[7*16] ATTR_ALIGNED_16; + unsigned char ciphertext[7*16] ATTR_ALIGNED_16; + unsigned char plaintext2[7*16] ATTR_ALIGNED_16; + unsigned char iv[16] ATTR_ALIGNED_16; + unsigned char iv2[16] ATTR_ALIGNED_16; + int i, j, diff; + + static const unsigned char key[16] ATTR_ALIGNED_16 = { + 0x06,0x9A,0x00,0x7F,0xC7,0x6A,0x45,0x9F, + 0x98,0xBA,0xF9,0x17,0xFE,0xDF,0x95,0x21 + }; + static char error_str[128]; + + rijndael_setkey (&ctx, key, sizeof (key)); + + /* Test single block code path */ + memset(iv, 0xff, sizeof(iv)); + for (i = 0; i < 16; i++) + plaintext[i] = i; + + /* CTR manually. */ + rijndael_encrypt (&ctx, ciphertext, iv); + for (i = 0; i < 16; i++) + ciphertext[i] ^= plaintext[i]; + for (i = 16; i > 0; i--) + { + iv[i-1]++; + if (iv[i-1]) + break; + } + + memset(iv2, 0xff, sizeof(iv2)); + _gcry_aes_ctr_enc (&ctx, iv2, plaintext2, ciphertext, 1); + + if (memcmp(plaintext2, plaintext, 16)) + return "AES-128-CTR test failed (plaintext mismatch)"; + + if (memcmp(iv2, iv, 16)) + return "AES-128-CTR test failed (IV mismatch)"; + + /* Test parallelized code paths */ + for (diff = 0; diff < 7; diff++) { + memset(iv, 0xff, sizeof(iv)); + iv[15] -= diff; + + for (i = 0; i < sizeof(plaintext); i++) + plaintext[i] = i; + + /* Create CTR ciphertext manually. */ + for (i = 0; i < sizeof(plaintext); i+=16) + { + rijndael_encrypt (&ctx, &ciphertext[i], iv); + for (j = 0; j < 16; j++) + ciphertext[i+j] ^= plaintext[i+j]; + for (j = 16; j > 0; j--) + { + iv[j-1]++; + if (iv[j-1]) + break; + } + } + + /* Decrypt using bulk CTR and compare result. */ + memset(iv2, 0xff, sizeof(iv2)); + iv2[15] -= diff; + + _gcry_aes_ctr_enc (&ctx, iv2, plaintext2, ciphertext, + sizeof(ciphertext) / BLOCKSIZE); + + if (memcmp(plaintext2, plaintext, sizeof(plaintext))) + { + snprintf(error_str, sizeof(error_str), + "AES-128-CTR test failed (plaintext mismatch, diff: %d)", + diff); + return error_str; + } + if (memcmp(iv2, iv, sizeof(iv))) + { + snprintf(error_str, sizeof(error_str), + "AES-128-CTR test failed (IV mismatch, diff: %d)", + diff); + return error_str; + } + } + + return NULL; +} + + /* Run all the self-tests and return NULL on success. This function is used for the on-the-fly self-tests. */ static const char * @@ -2226,6 +2376,9 @@ selftest (void) || (r = selftest_basic_256 ()) ) return r; + if ( (r = selftest_ctr_128 ()) ) + return r; + return r; } From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:20 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:20 +0200 Subject: [PATCH 06/10] Add parallelized AES-NI ECB decryption In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172219.1410.33080.stgit@localhost6.localdomain6> * cipher/cipher-internal.h (struct gcry_cipher_handle): Add bulk.ecb_dec. * cipher/cipher.c (gcry_cipher_open) [USE_AES]: Set bulk.ecb_dec to _gcry_aes_ecb_dec. (do_ecb_decrypt): Redirect call into bulk.ecb_dec if non-null. * src/cipher.h (_gcry_aes_ecb_dec): Add new function prototype. * cipher/rijndeal.c (_gcry_aes_ecb_dec): Add new function. -- Parallelized ECB decryption is ~2.0x faster on Intel Sandy-Bridge (x86-64). Before: $ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256 Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 670ms 770ms 2130ms 450ms 1880ms 670ms 2250ms 2280ms 490ms 490ms AES192 880ms 920ms 2460ms 540ms 2210ms 830ms 2580ms 2570ms 580ms 570ms AES256 1020ms 1070ms 2800ms 620ms 2560ms 970ms 2880ms 2880ms 660ms 650ms After: Running each test 1000 times. ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- AES 690ms 350ms 2130ms 470ms 1890ms 670ms 2220ms 2240ms 490ms 490ms AES192 900ms 440ms 2460ms 560ms 2210ms 840ms 2550ms 2560ms 570ms 570ms AES256 1040ms 520ms 2800ms 640ms 2550ms 970ms 2840ms 2850ms 660ms 650ms Signed-off-by: Jussi Kivilinna --- cipher/cipher-internal.h | 3 ++ cipher/cipher.c | 8 +++++ cipher/rijndael.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++ src/cipher.h | 2 + 4 files changed, 83 insertions(+) diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index 025bf2e..dcce708 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -89,6 +89,9 @@ struct gcry_cipher_handle void (*ctr_enc)(void *context, unsigned char *iv, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); + void (*ecb_dec)(void *context, void *outbuf_arg, + const void *inbuf_arg, + unsigned int nblocks); } bulk; diff --git a/cipher/cipher.c b/cipher/cipher.c index 389bf7a..b0f9773 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -716,6 +716,7 @@ gcry_cipher_open (gcry_cipher_hd_t *handle, h->bulk.cbc_enc = _gcry_aes_cbc_enc; h->bulk.cbc_dec = _gcry_aes_cbc_dec; h->bulk.ctr_enc = _gcry_aes_ctr_enc; + h->bulk.ecb_dec = _gcry_aes_ecb_dec; break; #endif /*USE_AES*/ @@ -881,6 +882,13 @@ do_ecb_decrypt (gcry_cipher_hd_t c, return GPG_ERR_INV_LENGTH; nblocks = inbuflen / c->cipher->blocksize; + if (nblocks && c->bulk.ecb_dec) + { + c->bulk.ecb_dec (&c->context.c, outbuf, inbuf, nblocks); + + return 0; + } + for (n=0; n < nblocks; n++ ) { c->cipher->decrypt (&c->context.c, outbuf, (byte*)/*arggg*/inbuf ); diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 34a0f8c..421b159 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -1838,6 +1838,76 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, } +/* Bulk decryption of complete blocks in ECB mode. This function is only + * intended for the bulk encryption feature of cipher.c. */ +void +_gcry_aes_ecb_dec (void *context, void *outbuf_arg, + const void *inbuf_arg, unsigned int nblocks) +{ + RIJNDAEL_context *ctx = context; + unsigned char *outbuf = outbuf_arg; + const unsigned char *inbuf = inbuf_arg; + + if (0) + ; +#ifdef USE_AESNI + else if (ctx->use_aesni) + { + aesni_prepare (); + + if (!ctx->decryption_prepared ) + { + prepare_decryption ( ctx ); + ctx->decryption_prepared = 1; + } + + for ( ;nblocks > 3 ; nblocks -= 4 ) + { + asm volatile + ("movdqu 0*16(%[inbuf]), %%xmm1\n\t" /* load input blocks */ + "movdqu 1*16(%[inbuf]), %%xmm2\n\t" + "movdqu 2*16(%[inbuf]), %%xmm3\n\t" + "movdqu 3*16(%[inbuf]), %%xmm4\n\t" + : /* No output */ + : [inbuf] "r" (inbuf) + : "memory"); + + do_aesni_dec_vec4 (ctx); + + asm volatile + ("movdqu %%xmm1, 0*16(%[outbuf])\n\t" /* store output blocks */ + "movdqu %%xmm2, 1*16(%[outbuf])\n\t" + "movdqu %%xmm3, 2*16(%[outbuf])\n\t" + "movdqu %%xmm4, 3*16(%[outbuf])\n\t" + : /* No output */ + : [outbuf] "r" (outbuf) + : "memory"); + + outbuf += 4*BLOCKSIZE; + inbuf += 4*BLOCKSIZE; + } + + for ( ;nblocks; nblocks-- ) + { + do_aesni_dec_aligned (ctx, outbuf, inbuf); + + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } + + aesni_cleanup (); + aesni_cleanup_2_5 (); + } +#endif + else + for ( ;nblocks; nblocks-- ) + { + rijndael_decrypt(context, outbuf, inbuf); + inbuf += BLOCKSIZE; + outbuf += BLOCKSIZE; + } +} + /* Run the self-tests for AES 128. Returns NULL on success. */ diff --git a/src/cipher.h b/src/cipher.h index 48eeeda..6b34e90 100644 --- a/src/cipher.h +++ b/src/cipher.h @@ -94,6 +94,8 @@ void _gcry_aes_cbc_dec (void *context, unsigned char *iv, void _gcry_aes_ctr_enc (void *context, unsigned char *ctr, void *outbuf_arg, const void *inbuf_arg, unsigned int nblocks); +void _gcry_aes_ecb_dec (void *context, void *outbuf_arg, + const void *inbuf_arg, unsigned int nblocks); /*-- dsa.c --*/ From jussi.kivilinna at mbnet.fi Fri Nov 23 18:22:40 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Fri, 23 Nov 2012 19:22:40 +0200 Subject: [PATCH 10/10] Fix building with Clang on x86-64 and i386 In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <20121123172240.1410.97916.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI] (do_aesni_enc_aligned) (do_aesni_enc_vec4, do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr) (do_aesni_ctr_4): Add explicit suffix to 'cmp' instructions. -- Clang throws errors on missing instruction suffixes, such as: rijndael.c:1091:39: error: ambiguous instructions require an explicit suffix (could be 'cmpb', 'cmpw', 'cmpl', or 'cmpq') :39:2: note: instantiated into assembly here cmp $1, -44(%rbp) With this patch building on x86-64 works fine. Other issues still exists on i386, namely with MPI, which can be overcome with 'clang -fheinous-gnu-extensions'. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index bdb2e51..c8f1d40 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -742,13 +742,13 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -796,13 +796,13 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -886,7 +886,7 @@ do_aesni_enc_vec4 (const RIJNDAEL_context *ctx) aesenc_xmm0_xmm3 aesenc_xmm0_xmm4 "movdqa 0xa0(%[key]), %%xmm0\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesenc_xmm0_xmm1 aesenc_xmm0_xmm2 @@ -898,7 +898,7 @@ do_aesni_enc_vec4 (const RIJNDAEL_context *ctx) aesenc_xmm0_xmm3 aesenc_xmm0_xmm4 "movdqa 0xc0(%[key]), %%xmm0\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesenc_xmm0_xmm1 aesenc_xmm0_xmm2 @@ -995,7 +995,7 @@ do_aesni_dec_vec4 (const RIJNDAEL_context *ctx) aesdec_xmm0_xmm3 aesdec_xmm0_xmm4 "movdqa 0xa0(%[key]), %%xmm0\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm0_xmm1 aesdec_xmm0_xmm2 @@ -1007,7 +1007,7 @@ do_aesni_dec_vec4 (const RIJNDAEL_context *ctx) aesdec_xmm0_xmm3 aesdec_xmm0_xmm4 "movdqa 0xc0(%[key]), %%xmm0\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm0_xmm1 aesdec_xmm0_xmm2 @@ -1072,13 +1072,13 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -1090,7 +1090,7 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, "movdqu %[src], %%xmm1\n\t" /* Save input. */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 = input ^ IV */ - "cmp $1, %[decrypt]\n\t" + "cmpl $1, %[decrypt]\n\t" "jz .Ldecrypt_%=\n\t" "movdqa %%xmm0, %[iv]\n\t" /* [encrypt] Store IV. */ "jmp .Lleave_%=\n" @@ -1167,13 +1167,13 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -1328,7 +1328,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 @@ -1340,7 +1340,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 From wk at gnupg.org Sat Nov 24 10:31:42 2012 From: wk at gnupg.org (Werner Koch) Date: Sat, 24 Nov 2012 10:31:42 +0100 Subject: [PATCH 01/10] Extend test of chained modes for 128bit ciphers In-Reply-To: <20121123172154.1410.35771.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Fri, 23 Nov 2012 19:21:54 +0200") References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> Message-ID: <87wqxbe48x.fsf@vigenere.g10code.de> Hi, Wow, lots of patches. Thanks. I'll take care of them next week. BTW, at the time I added the AES-NI code, I was often trapped by gcc problems claiming that it was running out of free registers. Some stuff in my original code might be related to this. I can't remember which gcc version I used at that time, though. Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From jussi.kivilinna at mbnet.fi Sat Nov 24 16:38:57 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Sat, 24 Nov 2012 17:38:57 +0200 Subject: [PATCH 01/10] Extend test of chained modes for 128bit ciphers In-Reply-To: <87wqxbe48x.fsf@vigenere.g10code.de> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <87wqxbe48x.fsf@vigenere.g10code.de> Message-ID: <20121124173857.19791t21vgnzy1a8@www.dalek.fi> Quoting Werner Koch : > Hi, > > Wow, lots of patches. Thanks. I'll take care of them next week. BTW, > at the time I added the AES-NI code, I was often trapped by gcc problems > claiming that it was running out of free registers. Some stuff in my > original code might be related to this. I can't remember which gcc > version I used at that time, though. Maybe later gcc versions have improved in matter. I have tested patches with gcc-4.7, gcc-4.6 and clang-3.0 on i386 and didn't run into register allocation problems. If there will be problems, there is some room for improvement in current asm code (use less "r" variables, reuse %%esi for pointers etc, or split asm sequences so that gcc can reallocate input asm registers from/to stack). -Jussi > > > Salam-Shalom, > > Werner > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > > > From wk at gnupg.org Sat Nov 24 17:05:32 2012 From: wk at gnupg.org (Werner Koch) Date: Sat, 24 Nov 2012 17:05:32 +0100 Subject: [PATCH 01/10] Extend test of chained modes for 128bit ciphers In-Reply-To: <20121124173857.19791t21vgnzy1a8@www.dalek.fi> (Jussi Kivilinna's message of "Sat, 24 Nov 2012 17:38:57 +0200") References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <87wqxbe48x.fsf@vigenere.g10code.de> <20121124173857.19791t21vgnzy1a8@www.dalek.fi> Message-ID: <87pq33dm0j.fsf@vigenere.g10code.de> On Sat, 24 Nov 2012 16:38, jussi.kivilinna at mbnet.fi said: > into register allocation problems. If there will be problems, there is > some room for improvement in current asm code (use less "r" variables, or use "configure --disable-aesni-support" ;-) Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From wk at gnupg.org Mon Nov 26 10:16:05 2012 From: wk at gnupg.org (Werner Koch) Date: Mon, 26 Nov 2012 10:16:05 +0100 Subject: AES-NI + compression In-Reply-To: <1353627310.27796.15.camel@mcri-i0054.rch.unimelb.edu.au> (Chris Adamson's message of "Fri, 23 Nov 2012 10:35:10 +1100") References: <1353627310.27796.15.camel@mcri-i0054.rch.unimelb.edu.au> Message-ID: <87boekenca.fsf@vigenere.g10code.de> On Fri, 23 Nov 2012 00:35, chris.adamson at mcri.edu.au said: > implementation. I also tested the effect of adding compression, which is > important to me since I'm using gpg for backup. I took 895M of fairly > compressible DICOM data in a tar file (bz2 compresses to 168M) and ran Note that gpg may not always be able to auto-detect already compressed files. To be safe you should use "-z 0" to disable gpg's own compression layer. > My immediate questions: i386 AES-NI gives a 50% reduction when compared > to the i386 software version, is this expected or should it be a greater > reduction? I did see some x86_64 AES-NI patches released on the list, > will these be put into a released version or backported? GPG uses a quite complex internal pipepline to process the data, thus improvements in Libgcrypt's AES code won't have a full effect on GPG's encryption. In particular OpenPGP'c use of the CFB mode does not allow to parallelize the encryption operation. Jussi's recent AES-NI improvements will go into the 1.6. version. I don't think that it makes sense to backport them. Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From wk at gnupg.org Mon Nov 26 10:19:37 2012 From: wk at gnupg.org (Werner Koch) Date: Mon, 26 Nov 2012 10:19:37 +0100 Subject: [PATCH 06/10] Add parallelized AES-NI ECB decryption In-Reply-To: <20121123172219.1410.33080.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Fri, 23 Nov 2012 19:22:20 +0200") References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <20121123172219.1410.33080.stgit@localhost6.localdomain6> Message-ID: <877gp8en6e.fsf@vigenere.g10code.de> On Fri, 23 Nov 2012 18:22, jussi.kivilinna at mbnet.fi said: > Parallelized ECB decryption is ~2.0x faster on Intel Sandy-Bridge (x86-64). I do not see a reason to ever do more than once ECB block. Use cases for ECB are very limited and the use multi-block ECB is a strong indication of a severe cryptographic flaw. Thus your parallelized ECB mode would only blow up the code. I applied all your other patches to master. Thanks, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From jussi.kivilinna at mbnet.fi Mon Nov 26 10:43:26 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Mon, 26 Nov 2012 11:43:26 +0200 Subject: [PATCH 06/10] Add parallelized AES-NI ECB decryption In-Reply-To: <877gp8en6e.fsf@vigenere.g10code.de> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <20121123172219.1410.33080.stgit@localhost6.localdomain6> <877gp8en6e.fsf@vigenere.g10code.de> Message-ID: <20121126114326.11263dsujddbbk84@www.dalek.fi> Quoting Werner Koch : > On Fri, 23 Nov 2012 18:22, jussi.kivilinna at mbnet.fi said: > >> Parallelized ECB decryption is ~2.0x faster on Intel Sandy-Bridge (x86-64). > > I do not see a reason to ever do more than once ECB block. Use cases > for ECB are very limited and the use multi-block ECB is a strong > indication of a severe cryptographic flaw. Thus your parallelized ECB > mode would only blow up the code. Benefit is that if someone implements other parallelisable mode (XTS for example) on top of libgcrypt/ECB, they can benefit from improved speed. Plain ECB might only be interest of benchmarking. -Jussi > > I applied all your other patches to master. > > Thanks, > > Werner > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > > > From wk at gnupg.org Mon Nov 26 13:48:43 2012 From: wk at gnupg.org (Werner Koch) Date: Mon, 26 Nov 2012 13:48:43 +0100 Subject: [PATCH 06/10] Add parallelized AES-NI ECB decryption In-Reply-To: <20121126114326.11263dsujddbbk84@www.dalek.fi> (Jussi Kivilinna's message of "Mon, 26 Nov 2012 11:43:26 +0200") References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <20121123172219.1410.33080.stgit@localhost6.localdomain6> <877gp8en6e.fsf@vigenere.g10code.de> <20121126114326.11263dsujddbbk84@www.dalek.fi> Message-ID: <87txsccyxg.fsf@vigenere.g10code.de> On Mon, 26 Nov 2012 10:43, jussi.kivilinna at mbnet.fi said: > Benefit is that if someone implements other parallelisable mode (XTS > for example) on top of libgcrypt/ECB, they can benefit from improved > speed. Plain ECB might only be interest of benchmarking. We can add it if there is good reason for it. Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From andre at amorim.me Mon Nov 26 14:02:25 2012 From: andre at amorim.me (Andre Amorim) Date: Mon, 26 Nov 2012 13:02:25 +0000 Subject: [PATCH 06/10] Add parallelized AES-NI ECB decryption In-Reply-To: <87txsccyxg.fsf@vigenere.g10code.de> References: <20121123172154.1410.35771.stgit@localhost6.localdomain6> <20121123172219.1410.33080.stgit@localhost6.localdomain6> <877gp8en6e.fsf@vigenere.g10code.de> <20121126114326.11263dsujddbbk84@www.dalek.fi> <87txsccyxg.fsf@vigenere.g10code.de> Message-ID: //Sorry if I disturb any parallel conversation, but I am thinking about "ethernet copper cables". and electric propagation. On 26 November 2012 12:48, Werner Koch wrote: > jussi.kivilinna at mbnet.fi -- Gnupg key: 02375205 Fingerprint: F7CD D181 943B 0453 8668 AF16 84E9 7565 0237 5205 From jussi.kivilinna at mbnet.fi Tue Nov 27 22:44:13 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Tue, 27 Nov 2012 23:44:13 +0200 Subject: [PATCH] [v2] Fix building with Clang on x86-64 and i386 Message-ID: <20121127214413.1290.72897.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI] (do_aesni_enc_aligned) (do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Add explicit suffix to 'cmp' instructions. -- Clang throws errors on missing instruction suffixes, such as: rijndael.c:1091:39: error: ambiguous instructions require an explicit suffix (could be 'cmpb', 'cmpw', 'cmpl', or 'cmpq') :39:2: note: instantiated into assembly here cmp $1, -44(%rbp) With this patch building on x86-64 works fine. Other issues still exists on i386, namely with MPI, which can be overcome with 'clang -fheinous-gnu-extensions'. [v2]: - remove do_aesni_enc_vec4 modification as that function didn't make it to upstream. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 860dcf8..0f5e07c 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -742,13 +742,13 @@ do_aesni_enc_aligned (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -796,13 +796,13 @@ do_aesni_dec_aligned (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesdec_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -886,7 +886,7 @@ do_aesni_dec_vec4 (const RIJNDAEL_context *ctx) aesdec_xmm0_xmm3 aesdec_xmm0_xmm4 "movdqa 0xa0(%[key]), %%xmm0\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm0_xmm1 aesdec_xmm0_xmm2 @@ -898,7 +898,7 @@ do_aesni_dec_vec4 (const RIJNDAEL_context *ctx) aesdec_xmm0_xmm3 aesdec_xmm0_xmm4 "movdqa 0xc0(%[key]), %%xmm0\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Ldeclast%=\n\t" aesdec_xmm0_xmm1 aesdec_xmm0_xmm2 @@ -963,13 +963,13 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -981,7 +981,7 @@ do_aesni_cfb (const RIJNDAEL_context *ctx, int decrypt_flag, "movdqu %[src], %%xmm1\n\t" /* Save input. */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 = input ^ IV */ - "cmp $1, %[decrypt]\n\t" + "cmpl $1, %[decrypt]\n\t" "jz .Ldecrypt_%=\n\t" "movdqa %%xmm0, %[iv]\n\t" /* [encrypt] Store IV. */ "jmp .Lleave_%=\n" @@ -1058,13 +1058,13 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, "movdqa 0x90(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xb0(%[key]), %%xmm1\n\t" aesenc_xmm1_xmm0 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 "movdqa 0xd0(%[key]), %%xmm1\n\t" @@ -1219,7 +1219,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 "movdqa 0xa0(%[key]), %%xmm1\n\t" - "cmp $10, %[rounds]\n\t" + "cmpl $10, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 @@ -1231,7 +1231,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, aesenc_xmm1_xmm3 aesenc_xmm1_xmm4 "movdqa 0xc0(%[key]), %%xmm1\n\t" - "cmp $12, %[rounds]\n\t" + "cmpl $12, %[rounds]\n\t" "jz .Lenclast%=\n\t" aesenc_xmm1_xmm0 aesenc_xmm1_xmm2 From wk at gnupg.org Wed Nov 28 09:43:39 2012 From: wk at gnupg.org (Werner Koch) Date: Wed, 28 Nov 2012 09:43:39 +0100 Subject: [PATCH] [v2] Fix building with Clang on x86-64 and i386 In-Reply-To: <20121127214413.1290.72897.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Tue, 27 Nov 2012 23:44:13 +0200") References: <20121127214413.1290.72897.stgit@localhost6.localdomain6> Message-ID: <878v9m9kxw.fsf@vigenere.g10code.de> On Tue, 27 Nov 2012 22:44, jussi.kivilinna at mbnet.fi said: > * cipher/rijndael.c [USE_AESNI] (do_aesni_enc_aligned) > (do_aesni_dec_vec4, do_aesni_cfb, do_aesni_ctr, do_aesni_ctr_4): Add > explicit suffix to 'cmp' instructions. Pushed with the additional comment NB: I still believe it is a bad idea of clang to define __GNUC__ and not being 100% compatible to gcc. [wk] Thanks, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From bradh at frogmouth.net Wed Nov 28 10:40:30 2012 From: bradh at frogmouth.net (Brad Hards) Date: Wed, 28 Nov 2012 20:40:30 +1100 Subject: [PATCH] [v2] Fix building with Clang on x86-64 and i386 In-Reply-To: <878v9m9kxw.fsf@vigenere.g10code.de> References: <20121127214413.1290.72897.stgit@localhost6.localdomain6> <878v9m9kxw.fsf@vigenere.g10code.de> Message-ID: <201211282040.30253.bradh@frogmouth.net> On Wednesday 28 November 2012 19:43:39 Werner Koch wrote: > NB: I still believe it is a bad idea of clang to define __GNUC__ > and not being 100% compatible to gcc. [wk] Note that intel C compilers define the same macro. I think its trying to get the "good" headers. Not saying I disagree with your assessment, just pointing it out as reality. From wk at gnupg.org Wed Nov 28 13:17:09 2012 From: wk at gnupg.org (Werner Koch) Date: Wed, 28 Nov 2012 13:17:09 +0100 Subject: [PATCH] [v2] Fix building with Clang on x86-64 and i386 In-Reply-To: <201211282040.30253.bradh@frogmouth.net> (Brad Hards's message of "Wed, 28 Nov 2012 20:40:30 +1100") References: <20121127214413.1290.72897.stgit@localhost6.localdomain6> <878v9m9kxw.fsf@vigenere.g10code.de> <201211282040.30253.bradh@frogmouth.net> Message-ID: <87zk219b22.fsf@vigenere.g10code.de> On Wed, 28 Nov 2012 10:40, bradh at frogmouth.net said: > Note that intel C compilers define the same macro. I think its trying to get > the "good" headers. Not saying I disagree with your assessment, just pointing > it out as reality. This reality will eventually force us to replace all conditionals involving __GNUC__ by configure generated macros and actual tests. Well, it is autoconf way of doing things. However, that does not work with public header files - there we would need to take all faked __GNUC__ usages into account :-( Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From jussi.kivilinna at mbnet.fi Thu Nov 29 16:31:03 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 17:31:03 +0200 Subject: [PATCH] Optimize AES-NI CTR mode. Message-ID: <20121129153103.6234.78914.stgit@localhost6.localdomain6> * cipher/rijndael.c [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Make handling of 64-bit overflow and carry conditional. Avoid generic to vector register passing of value '1'. Generate and use '-1' instead. -- We only need to handle 64-bit carry in few special cases, that happen very rarely. So move carry handling to slow-path and only detect need for carry handling on fast-path. Also avoid moving '1' from generic register to vector register, as that might be slow on some CPUs. Instead generate '-1' with SSE2 instructions and use subtraction instead of addition to increase IV. Overall this gives ~8% improvement in speed for AES CTR mode on Intel Sandy-Bridge. Signed-off-by: Jussi Kivilinna --- cipher/rijndael.c | 90 +++++++++++++++++++++++------------------------------ 1 file changed, 39 insertions(+), 51 deletions(-) diff --git a/cipher/rijndael.c b/cipher/rijndael.c index cc7f8d6..6313ab2 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -1015,24 +1015,20 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, asm volatile ("movdqa (%[ctr]), %%xmm0\n\t" /* xmm0, xmm2 := CTR */ "movaps %%xmm0, %%xmm2\n\t" - "mov $1, %%esi\n\t" /* xmm2++ (big-endian) */ - "movd %%esi, %%xmm1\n\t" - - "movl 12(%[ctr]), %%esi\n\t" /* load lower parts of CTR */ - "bswapl %%esi\n\t" - "movl 8(%[ctr]), %%edi\n\t" - "bswapl %%edi\n\t" + "pcmpeqd %%xmm1, %%xmm1\n\t" + "psrldq $8, %%xmm1\n\t" /* xmm1 = -1 */ "pshufb %[mask], %%xmm2\n\t" - "paddq %%xmm1, %%xmm2\n\t" + "psubq %%xmm1, %%xmm2\n\t" /* xmm2++ (big endian) */ - "addl $1, %%esi\n\t" - "adcl $0, %%edi\n\t" /* detect 64bit overflow */ - "jnc .Lno_carry%=\n\t" + /* detect if 64-bit carry handling is needed */ + "cmpl $0xffffffff, 8(%[ctr])\n\t" + "jne .Lno_carry%=\n\t" + "cmpl $0xffffffff, 12(%[ctr])\n\t" + "jne .Lno_carry%=\n\t" - /* swap upper and lower halfs */ - "pshufd $0x4e, %%xmm1, %%xmm1\n\t" - "paddq %%xmm1, %%xmm2\n\t" /* add carry to upper 64bits */ + "pslldq $8, %%xmm1\n\t" /* move lower 64-bit to high */ + "psubq %%xmm1, %%xmm2\n\t" /* add carry to upper 64bits */ ".Lno_carry%=:\n\t" @@ -1085,7 +1081,7 @@ do_aesni_ctr (const RIJNDAEL_context *ctx, [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) - : "%esi", "%edi", "cc", "memory"); + : "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenclast_xmm1_xmm0 } @@ -1120,48 +1116,40 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, asm volatile ("movdqa (%[ctr]), %%xmm0\n\t" /* xmm0, xmm2 := CTR */ "movaps %%xmm0, %%xmm2\n\t" - "mov $1, %%esi\n\t" /* xmm1 := 1 */ - "movd %%esi, %%xmm1\n\t" - - "movl 12(%[ctr]), %%esi\n\t" /* load lower parts of CTR */ - "bswapl %%esi\n\t" - "movl 8(%[ctr]), %%edi\n\t" - "bswapl %%edi\n\t" + "pcmpeqd %%xmm1, %%xmm1\n\t" + "psrldq $8, %%xmm1\n\t" /* xmm1 = -1 */ "pshufb %[mask], %%xmm2\n\t" /* xmm2 := le(xmm2) */ - "paddq %%xmm1, %%xmm2\n\t" /* xmm2++ */ + "psubq %%xmm1, %%xmm2\n\t" /* xmm2++ */ "movaps %%xmm2, %%xmm3\n\t" /* xmm3 := xmm2 */ - "paddq %%xmm1, %%xmm3\n\t" /* xmm3++ */ + "psubq %%xmm1, %%xmm3\n\t" /* xmm3++ */ "movaps %%xmm3, %%xmm4\n\t" /* xmm4 := xmm3 */ - "paddq %%xmm1, %%xmm4\n\t" /* xmm4++ */ + "psubq %%xmm1, %%xmm4\n\t" /* xmm4++ */ "movaps %%xmm4, %%xmm5\n\t" /* xmm5 := xmm4 */ - "paddq %%xmm1, %%xmm5\n\t" /* xmm5++ */ - - /* swap upper and lower halfs */ - "pshufd $0x4e, %%xmm1, %%xmm1\n\t" - - "addl $1, %%esi\n\t" - "adcl $0, %%edi\n\t" /* detect 64bit overflow */ - "jc .Lcarry_xmm2%=\n\t" - "addl $1, %%esi\n\t" - "adcl $0, %%edi\n\t" /* detect 64bit overflow */ - "jc .Lcarry_xmm3%=\n\t" - "addl $1, %%esi\n\t" - "adcl $0, %%edi\n\t" /* detect 64bit overflow */ - "jc .Lcarry_xmm4%=\n\t" - "addl $1, %%esi\n\t" - "adcl $0, %%edi\n\t" /* detect 64bit overflow */ - "jc .Lcarry_xmm5%=\n\t" - "jmp .Lno_carry%=\n\t" - - ".Lcarry_xmm2%=:\n\t" - "paddq %%xmm1, %%xmm2\n\t" + "psubq %%xmm1, %%xmm5\n\t" /* xmm5++ */ + + /* detect if 64-bit carry handling is needed */ + "cmpl $0xffffffff, 8(%[ctr])\n\t" + "jne .Lno_carry%=\n\t" + "movl 12(%[ctr]), %%esi\n\t" + "bswapl %%esi\n\t" + "cmpl $0xfffffffc, %%esi\n\t" + "jb .Lno_carry%=\n\t" /* no carry */ + + "pslldq $8, %%xmm1\n\t" /* move lower 64-bit to high */ + "je .Lcarry_xmm5%=\n\t" /* esi == 0xfffffffc */ + "cmpl $0xfffffffe, %%esi\n\t" + "jb .Lcarry_xmm4%=\n\t" /* esi == 0xfffffffd */ + "je .Lcarry_xmm3%=\n\t" /* esi == 0xfffffffe */ + /* esi == 0xffffffff */ + + "psubq %%xmm1, %%xmm2\n\t" ".Lcarry_xmm3%=:\n\t" - "paddq %%xmm1, %%xmm3\n\t" + "psubq %%xmm1, %%xmm3\n\t" ".Lcarry_xmm4%=:\n\t" - "paddq %%xmm1, %%xmm4\n\t" + "psubq %%xmm1, %%xmm4\n\t" ".Lcarry_xmm5%=:\n\t" - "paddq %%xmm1, %%xmm5\n\t" + "psubq %%xmm1, %%xmm5\n\t" ".Lno_carry%=:\n\t" "pshufb %[mask], %%xmm2\n\t" /* xmm2 := be(xmm2) */ @@ -1170,7 +1158,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, "pshufb %[mask], %%xmm5\n\t" /* xmm5 := be(xmm5) */ "movdqa %%xmm5, (%[ctr])\n" /* Update CTR. */ - "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ + "movdqa (%[key]), %%xmm1\n\t" /* xmm1 := key[0] */ "pxor %%xmm1, %%xmm0\n\t" /* xmm0 ^= key[0] */ "pxor %%xmm1, %%xmm2\n\t" /* xmm2 ^= key[0] */ "pxor %%xmm1, %%xmm3\n\t" /* xmm3 ^= key[0] */ @@ -1275,7 +1263,7 @@ do_aesni_ctr_4 (const RIJNDAEL_context *ctx, [key] "r" (ctx->keyschenc), [rounds] "g" (ctx->rounds), [mask] "m" (*be_mask) - : "%esi", "%edi", "cc", "memory"); + : "%esi", "cc", "memory"); #undef aesenc_xmm1_xmm0 #undef aesenc_xmm1_xmm2 #undef aesenc_xmm1_xmm3 From jussi.kivilinna at mbnet.fi Thu Nov 29 16:37:09 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 17:37:09 +0200 Subject: [PATCH 1/2] Optimize buffer xoring. Message-ID: <20121129153709.7099.62541.stgit@localhost6.localdomain6> * cipher/Makefile.am (libcipher_la_SOURCES): Add 'bufhelp.h'. * cipher/bufhelp.h: New. * cipher/cipher-aeswrap.c (_gcry_cipher_aeswrap_encrypt) (_gcry_cipher_aeswrap_decrypt): Use 'buf_xor' for buffer xoring. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gcry_cipher_cfb_decrypt): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-ofb.c (_gcry_cipher_ofb_encrypt) (_gcry_cipher_ofb_decrypt): Use 'buf_xor' for buffer xoring and remove resulting used variables. * cipher/rijndael.c (_gry_aes_cfb_enc): Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gry_aes_cfb_dev): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. (_gry_aes_cbc_enc, _gry_aes_ctr_enc, _gry_aes_cbc_dec): Use 'buf_xor' for buffer xoring and remove resulting unused variables. -- Add faster helper functions for buffer xoring and replace byte buffer xor loops. This give following speed up. Note that CTR speed up is from refactoring code to use buf_xor() and removal of integer division/modulo operations issued per each processed byte. This removal of div/mod most likely gives even greater speed increase on CPU architechtures that do not have hardware division unit. Benchmark ratios (old-vs-new, AMD Phenom II, x86-64): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 0.99x 1.01x 1.06x 1.02x 1.03x 1.06x 1.04x 1.02x 1.58x 1.58x 3DES 1.00x 1.00x 1.01x 1.01x 1.02x 1.02x 1.02x 1.01x 1.22x 1.23x CAST5 0.98x 1.00x 1.09x 1.03x 1.09x 1.09x 1.07x 1.07x 1.98x 1.95x BLOWFISH 1.00x 1.00x 1.18x 1.05x 1.07x 1.07x 1.05x 1.05x 1.93x 1.91x AES 1.00x 0.98x 1.18x 1.14x 1.13x 1.13x 1.14x 1.14x 1.18x 1.18x AES192 0.98x 1.00x 1.13x 1.14x 1.13x 1.10x 1.14x 1.16x 1.15x 1.15x AES256 0.97x 1.02x 1.09x 1.13x 1.13x 1.09x 1.10x 1.14x 1.11x 1.13x TWOFISH 1.00x 1.00x 1.15x 1.17x 1.18x 1.16x 1.18x 1.13x 2.37x 2.31x ARCFOUR 1.03x 0.97x DES 1.01x 1.00x 1.04x 1.04x 1.04x 1.05x 1.05x 1.02x 1.56x 1.55x TWOFISH128 0.97x 1.03x 1.18x 1.17x 1.18x 1.15x 1.15x 1.15x 2.37x 2.31x SERPENT128 1.00x 1.00x 1.10x 1.11x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x SERPENT192 1.00x 1.00x 1.07x 1.08x 1.08x 1.09x 1.08x 1.08x 1.65x 1.66x SERPENT256 1.00x 1.00x 1.09x 1.09x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x RFC2268_40 1.03x 0.99x 1.05x 1.02x 1.03x 1.03x 1.04x 1.03x 1.46x 1.46x SEED 1.00x 1.00x 1.10x 1.10x 1.09x 1.09x 1.10x 1.07x 1.80x 1.76x CAMELLIA128 1.00x 1.00x 1.23x 1.12x 1.15x 1.17x 1.15x 1.12x 2.15x 2.13x CAMELLIA192 1.05x 1.03x 1.23x 1.21x 1.21x 1.16x 1.12x 1.25x 1.90x 1.90x CAMELLIA256 1.03x 1.07x 1.10x 1.19x 1.08x 1.14x 1.12x 1.10x 1.90x 1.92x Benchmark ratios (old-vs-new, AMD Phenom II, i386): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.00x 1.00x 1.04x 1.05x 1.04x 1.02x 1.02x 1.02x 1.38x 1.40x 3DES 1.01x 1.00x 1.02x 1.04x 1.03x 1.01x 1.00x 1.02x 1.20x 1.20x CAST5 1.00x 1.00x 1.03x 1.09x 1.07x 1.04x 1.13x 1.00x 1.74x 1.74x BLOWFISH 1.04x 1.08x 1.03x 1.13x 1.07x 1.12x 1.03x 1.00x 1.78x 1.74x AES 0.96x 1.00x 1.09x 1.08x 1.14x 1.13x 1.07x 1.03x 1.14x 1.09x AES192 1.00x 1.03x 1.07x 1.03x 1.07x 1.07x 1.06x 1.03x 1.08x 1.11x AES256 1.00x 1.00x 1.06x 1.06x 1.10x 1.06x 1.05x 1.03x 1.10x 1.10x TWOFISH 0.95x 1.10x 1.13x 1.23x 1.05x 1.14x 1.09x 1.13x 1.95x 1.86x ARCFOUR 1.00x 1.00x DES 1.02x 0.98x 1.04x 1.04x 1.05x 1.02x 1.04x 1.00x 1.45x 1.48x TWOFISH128 0.95x 1.10x 1.26x 1.19x 1.09x 1.14x 1.17x 1.00x 2.00x 1.91x SERPENT128 1.02x 1.00x 1.08x 1.04x 1.10x 1.06x 1.08x 1.04x 1.42x 1.42x SERPENT192 1.02x 1.02x 1.06x 1.06x 1.10x 1.08x 1.04x 1.06x 1.42x 1.42x SERPENT256 1.02x 0.98x 1.06x 1.06x 1.10x 1.06x 1.04x 1.06x 1.42x 1.40x RFC2268_40 1.00x 1.00x 1.02x 1.06x 1.04x 1.02x 1.02x 1.02x 1.35x 1.35x SEED 1.00x 0.97x 1.11x 1.05x 1.06x 1.08x 1.08x 1.05x 1.56x 1.57x CAMELLIA128 1.03x 0.97x 1.12x 1.14x 1.06x 1.10x 1.06x 1.06x 1.73x 1.59x CAMELLIA192 1.06x 1.00x 1.13x 1.10x 1.11x 1.11x 1.15x 1.08x 1.57x 1.58x CAMELLIA256 1.06x 1.03x 1.10x 1.10x 1.11x 1.11x 1.13x 1.08x 1.57x 1.62x Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 1 cipher/bufhelp.h | 174 +++++++++++++++++++++++++++++++++++++++++++++++ cipher/cipher-aeswrap.c | 7 +- cipher/cipher-cbc.c | 14 +--- cipher/cipher-cfb.c | 85 ++++++++++------------- cipher/cipher-ctr.c | 42 +++++------ cipher/cipher-ofb.c | 52 ++++++++------ cipher/rijndael.c | 50 +++++--------- 8 files changed, 289 insertions(+), 136 deletions(-) create mode 100644 cipher/bufhelp.h diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 473e3c8..e8050e3 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -40,6 +40,7 @@ cipher-cbc.c cipher-cfb.c cipher-ofb.c cipher-ctr.c cipher-aeswrap.c \ pubkey.c md.c kdf.c \ hmac-tests.c \ bithelp.h \ +bufhelp.h \ primegen.c \ hash-common.c hash-common.h \ rmd.h diff --git a/cipher/bufhelp.h b/cipher/bufhelp.h new file mode 100644 index 0000000..b4e34e4 --- /dev/null +++ b/cipher/bufhelp.h @@ -0,0 +1,174 @@ +/* bufhelp.h - Some buffer manipulation helpers + * Copyright ? 2012 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser general Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA + */ +#ifndef G10_BUFHELP_H +#define G10_BUFHELP_H + +#include + +#if defined(__i386__) || defined(__x86_64__) +/* These architechtures are able of unaligned memory accesses and can + handle those fast. + */ +# define BUFHELP_FAST_UNALIGNED_ACCESS 1 +#endif + + +/* Optimized function for buffer xoring */ +static inline void +buf_xor(void *_dst, const void *_src1, const void *_src2, size_t len) +{ + byte *dst = _dst; + const byte *src1 = _src1; + const byte *src2 = _src2; + long *ldst; + const long *lsrc1, *lsrc2; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(long) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((intptr_t)dst ^ (intptr_t)src1) | + ((intptr_t)dst ^ (intptr_t)src2)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((intptr_t)dst & longmask); len--) + *dst++ = *src1++ ^ *src2++; +#endif + + ldst = (long *)dst; + lsrc1 = (const long *)src1; + lsrc2 = (const long *)src2; + + for (; len >= sizeof(long); len -= sizeof(long)) + *ldst++ = *lsrc1++ ^ *lsrc2++; + + dst = (byte *)ldst; + src1 = (const byte *)lsrc1; + src2 = (const byte *)lsrc2; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + *dst++ = *src1++ ^ *src2++; +} + + +/* Optimized function for buffer xoring with two destination buffers. Used + mainly by CFB mode encryption. */ +static inline void +buf_xor_2dst(void *_dst1, void *_dst2, const void *_src, size_t len) +{ + byte *dst1 = _dst1; + byte *dst2 = _dst2; + const byte *src = _src; + long *ldst1, *ldst2; + const long *lsrc; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(long) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((intptr_t)src ^ (intptr_t)dst1) | + ((intptr_t)src ^ (intptr_t)dst2)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((intptr_t)src & longmask); len--) + *dst1++ = (*dst2++ ^= *src++); +#endif + + ldst1 = (long *)dst1; + ldst2 = (long *)dst2; + lsrc = (const long *)src; + + for (; len >= sizeof(long); len -= sizeof(long)) + *ldst1++ = (*ldst2++ ^= *lsrc++); + + dst1 = (byte *)ldst1; + dst2 = (byte *)ldst2; + src = (const byte *)lsrc; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + *dst1++ = (*dst2++ ^= *src++); +} + + +/* Optimized function for combined buffer xoring and copying. Used by mainly + CFB mode decryption. */ +static inline void +buf_xor_n_copy(void *_dst_xor, void *_srcdst_cpy, const void *_src, size_t len) +{ + byte *dst_xor = _dst_xor; + byte *srcdst_cpy = _srcdst_cpy; + byte temp; + const byte *src = _src; + long *ldst_xor, *lsrcdst_cpy; + const long *lsrc; + long ltemp; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(long) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((intptr_t)src ^ (intptr_t)dst_xor) | + ((intptr_t)src ^ (intptr_t)srcdst_cpy)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((intptr_t)src & longmask); len--) + { + temp = *src++; + *dst_xor++ = *srcdst_cpy ^ temp; + *srcdst_cpy++ = temp; + } +#endif + + ldst_xor = (long *)dst_xor; + lsrcdst_cpy = (long *)srcdst_cpy; + lsrc = (const long *)src; + + for (; len >= sizeof(long); len -= sizeof(long)) + { + ltemp = *lsrc++; + *ldst_xor++ = *lsrcdst_cpy ^ ltemp; + *lsrcdst_cpy++ = ltemp; + } + + dst_xor = (byte *)ldst_xor; + srcdst_cpy = (byte *)lsrcdst_cpy; + src = (const byte *)lsrc; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + { + temp = *src++; + *dst_xor++ = *srcdst_cpy ^ temp; + *srcdst_cpy++ = temp; + } +} + +#endif /*G10_BITHELP_H*/ diff --git a/cipher/cipher-aeswrap.c b/cipher/cipher-aeswrap.c index b559e7f..8e117eb 100644 --- a/cipher/cipher-aeswrap.c +++ b/cipher/cipher-aeswrap.c @@ -26,6 +26,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -95,8 +96,7 @@ _gcry_cipher_aeswrap_encrypt (gcry_cipher_hd_t c, break; } /* A := MSB_64(B) ^ t */ - for (x=0; x < 8; x++) - a[x] = b[x] ^ t[x]; + buf_xor(a, b, t, 8); /* R[i] := LSB_64(B) */ memcpy (r+i*8, b+8, 8); } @@ -161,8 +161,7 @@ _gcry_cipher_aeswrap_decrypt (gcry_cipher_hd_t c, for (i = n; i >= 1; i--) { /* B := AES_k^1( (A ^ t)| R[i] ) */ - for (x = 0; x < 8; x++) - b[x] = a[x] ^ t[x]; + buf_xor(b, a, t, 8); memcpy (b+8, r+(i-1)*8, 8); c->cipher->decrypt (&c->context.c, b, b); /* t := t - 1 */ diff --git a/cipher/cipher-cbc.c b/cipher/cipher-cbc.c index b852589..0d30f63 100644 --- a/cipher/cipher-cbc.c +++ b/cipher/cipher-cbc.c @@ -28,6 +28,7 @@ #include "cipher.h" #include "ath.h" #include "./cipher-internal.h" +#include "bufhelp.h" @@ -68,8 +69,7 @@ _gcry_cipher_cbc_encrypt (gcry_cipher_hd_t c, { for (n=0; n < nblocks; n++ ) { - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - outbuf[i] = inbuf[i] ^ *ivp++; + buf_xor(outbuf, inbuf, c->u_iv.iv, blocksize); c->cipher->encrypt ( &c->context.c, outbuf, outbuf ); memcpy (c->u_iv.iv, outbuf, blocksize ); inbuf += blocksize; @@ -114,7 +114,6 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, const unsigned char *inbuf, unsigned int inbuflen) { unsigned int n; - unsigned char *ivp; int i; size_t blocksize = c->cipher->blocksize; unsigned int nblocks = inbuflen / blocksize; @@ -150,8 +149,7 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, * this here because it is not used otherwise. */ memcpy (c->lastiv, inbuf, blocksize); c->cipher->decrypt ( &c->context.c, outbuf, inbuf ); - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->u_iv.iv, blocksize); memcpy(c->u_iv.iv, c->lastiv, blocksize ); inbuf += c->cipher->blocksize; outbuf += c->cipher->blocksize; @@ -171,15 +169,13 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, memcpy (c->u_iv.iv, inbuf + blocksize, restbytes ); /* Save Cn. */ c->cipher->decrypt ( &c->context.c, outbuf, inbuf ); - for (ivp=c->u_iv.iv,i=0; i < restbytes; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->u_iv.iv, restbytes); memcpy(outbuf + blocksize, outbuf, restbytes); for(i=restbytes; i < blocksize; i++) c->u_iv.iv[i] = outbuf[i]; c->cipher->decrypt (&c->context.c, outbuf, c->u_iv.iv); - for(ivp=c->lastiv,i=0; i < blocksize; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->lastiv, blocksize); /* c->lastiv is now really lastlastiv, does this matter? */ } diff --git a/cipher/cipher-cfb.c b/cipher/cipher-cfb.c index f4152b9..ed84b75 100644 --- a/cipher/cipher-cfb.c +++ b/cipher/cipher-cfb.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -46,10 +47,9 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv+c->cipher->blocksize - c->unused; - inbuflen; - inbuflen--, c->unused-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + ivp = c->u_iv.iv + c->cipher->blocksize - c->unused; + buf_xor_2dst(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } @@ -57,8 +57,11 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { /* XOR the input with the IV and store input into IV */ inbuflen -= c->unused; - for(ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_2dst(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. We use a loop as long as we @@ -76,25 +79,25 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { while ( inbuflen >= blocksize_x_2 ) { - int i; /* Encrypt the IV. */ c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV. */ - for(ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } } if ( inbuflen >= blocksize ) { - int i; /* Save the current IV and then encrypt the IV. */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV */ - for(ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -105,8 +108,10 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, c->unused = blocksize; /* Apply the XOR. */ c->unused -= inbuflen; - for(ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } @@ -118,8 +123,6 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, const unsigned char *inbuf, unsigned int inbuflen) { unsigned char *ivp; - unsigned long temp; - int i; size_t blocksize = c->cipher->blocksize; size_t blocksize_x_2 = blocksize + blocksize; @@ -130,14 +133,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv+blocksize - c->unused; - inbuflen; - inbuflen--, c->unused--) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_n_copy(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } @@ -145,12 +143,11 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, { /* XOR the input with the IV and store input into IV. */ inbuflen -= c->unused; - for (ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_n_copy(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. We use a loop as long as we @@ -171,12 +168,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, /* Encrypt the IV. */ c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } } @@ -187,12 +181,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, memcpy ( c->lastiv, c->u_iv.iv, blocksize); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV */ - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } @@ -204,12 +195,10 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, c->unused = blocksize; /* Apply the XOR. */ c->unused -= inbuflen; - for (ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } diff --git a/cipher/cipher-ctr.c b/cipher/cipher-ctr.c index a334abc..6bc6ffc 100644 --- a/cipher/cipher-ctr.c +++ b/cipher/cipher-ctr.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -48,11 +49,9 @@ _gcry_cipher_ctr_encrypt (gcry_cipher_hd_t c, { gcry_assert (c->unused < blocksize); i = blocksize - c->unused; - for (n=0; c->unused && n < inbuflen; c->unused--, n++, i++) - { - /* XOR input with encrypted counter and store in output. */ - outbuf[n] = inbuf[n] ^ c->lastiv[i]; - } + n = c->unused > inbuflen ? inbuflen : c->unused; + buf_xor(outbuf, inbuf, &c->lastiv[i], n); + c->unused -= n; inbuf += n; outbuf += n; inbuflen -= n; @@ -75,27 +74,26 @@ _gcry_cipher_ctr_encrypt (gcry_cipher_hd_t c, { unsigned char tmp[MAX_BLOCKSIZE]; - for (n=0; n < inbuflen; n++) - { - if ((n % blocksize) == 0) - { - c->cipher->encrypt (&c->context.c, tmp, c->u_ctr.ctr); + do { + c->cipher->encrypt (&c->context.c, tmp, c->u_ctr.ctr); - for (i = blocksize; i > 0; i--) - { - c->u_ctr.ctr[i-1]++; - if (c->u_ctr.ctr[i-1] != 0) - break; - } - } + for (i = blocksize; i > 0; i--) + { + c->u_ctr.ctr[i-1]++; + if (c->u_ctr.ctr[i-1] != 0) + break; + } - /* XOR input with encrypted counter and store in output. */ - outbuf[n] = inbuf[n] ^ tmp[n % blocksize]; - } + n = blocksize < inbuflen ? blocksize : inbuflen; + buf_xor(outbuf, inbuf, tmp, n); + + inbuflen -= n; + outbuf += n; + inbuf += n; + } while (inbuflen); /* Save the unused bytes of the counter. */ - n %= blocksize; - c->unused = (blocksize - n) % blocksize; + c->unused = blocksize - n; if (c->unused) memcpy (c->lastiv+n, tmp+n, c->unused); diff --git a/cipher/cipher-ofb.c b/cipher/cipher-ofb.c index e5868cd..e194976 100644 --- a/cipher/cipher-ofb.c +++ b/cipher/cipher-ofb.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -45,30 +46,31 @@ _gcry_cipher_ofb_encrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV */ - for (ivp=c->u_iv.iv+c->cipher->blocksize - c->unused; - inbuflen; - inbuflen--, c->unused-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + ivp = c->u_iv.iv + c->cipher->blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } if( c->unused ) { inbuflen -= c->unused; - for(ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. */ while ( inbuflen >= blocksize ) { - int i; /* Encrypt the IV (and save the current one). */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); - - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^ *inbuf++); + buf_xor(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -77,8 +79,10 @@ _gcry_cipher_ofb_encrypt (gcry_cipher_hd_t c, c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); c->unused = blocksize; c->unused -= inbuflen; - for(ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + buf_xor(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } @@ -98,27 +102,31 @@ _gcry_cipher_ofb_decrypt (gcry_cipher_hd_t c, if( inbuflen <= c->unused ) { /* Short enough to be encoded by the remaining XOR mask. */ - for (ivp=c->u_iv.iv+blocksize - c->unused; inbuflen; inbuflen--,c->unused--) - *outbuf++ = *ivp++ ^ *inbuf++; + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } if ( c->unused ) { inbuflen -= c->unused; - for (ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = *ivp++ ^ *inbuf++; + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. */ while ( inbuflen >= blocksize ) { - int i; /* Encrypt the IV (and save the current one). */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = *ivp++ ^ *inbuf++; + buf_xor(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -128,8 +136,10 @@ _gcry_cipher_ofb_decrypt (gcry_cipher_hd_t c, c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); c->unused = blocksize; c->unused -= inbuflen; - for (ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = *ivp++ ^ *inbuf++; + buf_xor(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 6313ab2..24372d9 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -45,6 +45,7 @@ #include "types.h" /* for byte and u32 typedefs */ #include "g10lib.h" #include "cipher.h" +#include "bufhelp.h" #define MAXKC (256/32) #define MAXROUNDS 14 @@ -1337,8 +1338,6 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; if (0) ; @@ -1351,8 +1350,9 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, /* Encrypt the IV. */ do_padlock (ctx, 0, iv, iv); /* XOR the input with the IV and store input into IV. */ - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } #endif /*USE_PADLOCK*/ @@ -1376,8 +1376,9 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, /* Encrypt the IV. */ do_encrypt_aligned (ctx, iv, iv); /* XOR the input with the IV and store input into IV. */ - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } @@ -1397,8 +1398,6 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; aesni_prepare (); for ( ;nblocks; nblocks-- ) @@ -1432,8 +1431,7 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, #endif /*USE_AESNI*/ else { - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] = inbuf[i] ^ *ivp++; + buf_xor(outbuf, inbuf, iv, BLOCKSIZE); if (0) ; @@ -1470,7 +1468,6 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *p; int i; if (0) @@ -1504,8 +1501,9 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, /* Encrypt the counter. */ do_encrypt_aligned (ctx, tmp.x1, ctr); /* XOR the input with the encrypted counter and store in output. */ - for (p=tmp.x1, i=0; i < BLOCKSIZE; i++) - *outbuf++ = (*p++ ^= *inbuf++); + buf_xor(outbuf, tmp.x1, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; /* Increment the counter. */ for (i = BLOCKSIZE; i > 0; i--) { @@ -1694,9 +1692,6 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - unsigned char temp; - int i; if (0) ; @@ -1707,12 +1702,9 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, for ( ;nblocks; nblocks-- ) { do_padlock (ctx, 0, iv, iv); - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } #endif /*USE_PADLOCK*/ @@ -1734,12 +1726,9 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, for ( ;nblocks; nblocks-- ) { do_encrypt_aligned (ctx, iv, iv); - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } @@ -1759,8 +1748,6 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; unsigned char savebuf[BLOCKSIZE]; if (0) @@ -1871,8 +1858,7 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, else do_decrypt (ctx, outbuf, inbuf); - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, iv, BLOCKSIZE); memcpy (iv, savebuf, BLOCKSIZE); inbuf += BLOCKSIZE; outbuf += BLOCKSIZE; From jussi.kivilinna at mbnet.fi Thu Nov 29 16:37:14 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 17:37:14 +0200 Subject: [PATCH 2/2] Avoid slow integer multiplication and division with blocksize calculations. In-Reply-To: <20121129153709.7099.62541.stgit@localhost6.localdomain6> References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> Message-ID: <20121129153714.7099.70119.stgit@localhost6.localdomain6> * cipher/cipher-internal.h (gcry_cipher_handle): Add blockmask and blockshift members for blocksize. * cipher/cipher.c (gcry_cipher_open): Precalculate mask and shift from blocksize and add assert for checking that blocksize if power of two. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Use masking/shifting inplace for multiplication/division/modulo for blocksize. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) (_gcry_cipher_cfb_decrypt): Likewise. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Likewise. * configure.ac: Check for function ffs. * src/g10lib.h [!HAVE_FFS] (ffs): New function prototype. * src/missing-string.c [!HAVE_FFS] (ffs): New function. -- Currently all blocksizes are powers of two (and most likely in future), so we can avoid using integer multiplication and division/modulo operations (that are slow on architechtures without hardware units for mul/div/mod). Signed-off-by: Jussi Kivilinna --- cipher/cipher-cbc.c | 28 ++++++++++++++-------------- cipher/cipher-cfb.c | 16 ++++++++-------- cipher/cipher-ctr.c | 8 ++++---- cipher/cipher-internal.h | 5 +++++ cipher/cipher.c | 14 ++++++++++---- configure.ac | 2 +- src/g10lib.h | 3 +++ src/missing-string.c | 21 +++++++++++++++++++++ 8 files changed, 66 insertions(+), 31 deletions(-) diff --git a/cipher/cipher-cbc.c b/cipher/cipher-cbc.c index 0d30f63..63aa2a9 100644 --- a/cipher/cipher-cbc.c +++ b/cipher/cipher-cbc.c @@ -41,19 +41,19 @@ _gcry_cipher_cbc_encrypt (gcry_cipher_hd_t c, unsigned char *ivp; int i; size_t blocksize = c->cipher->blocksize; - unsigned nblocks = inbuflen / blocksize; + unsigned nblocks = inbuflen >> c->blockshift; if (outbuflen < ((c->flags & GCRY_CIPHER_CBC_MAC)? blocksize : inbuflen)) return GPG_ERR_BUFFER_TOO_SHORT; - if ((inbuflen % c->cipher->blocksize) + if ((inbuflen & c->blockmask) && !(inbuflen > c->cipher->blocksize && (c->flags & GCRY_CIPHER_CBC_CTS))) return GPG_ERR_INV_LENGTH; if ((c->flags & GCRY_CIPHER_CBC_CTS) && inbuflen > blocksize) { - if ((inbuflen % blocksize) == 0) + if ((inbuflen & c->blockmask) == 0) nblocks--; } @@ -61,9 +61,9 @@ _gcry_cipher_cbc_encrypt (gcry_cipher_hd_t c, { c->bulk.cbc_enc (&c->context.c, c->u_iv.iv, outbuf, inbuf, nblocks, (c->flags & GCRY_CIPHER_CBC_MAC)); - inbuf += nblocks * blocksize; + inbuf += nblocks << c->blockshift; if (!(c->flags & GCRY_CIPHER_CBC_MAC)) - outbuf += nblocks * blocksize; + outbuf += nblocks << c->blockshift; } else { @@ -85,10 +85,10 @@ _gcry_cipher_cbc_encrypt (gcry_cipher_hd_t c, int restbytes; unsigned char b; - if ((inbuflen % blocksize) == 0) + if ((inbuflen & c->blockmask) == 0) restbytes = blocksize; else - restbytes = inbuflen % blocksize; + restbytes = inbuflen & c->blockmask; outbuf -= blocksize; for (ivp = c->u_iv.iv, i = 0; i < restbytes; i++) @@ -116,12 +116,12 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, unsigned int n; int i; size_t blocksize = c->cipher->blocksize; - unsigned int nblocks = inbuflen / blocksize; + unsigned int nblocks = inbuflen >> c->blockshift; if (outbuflen < inbuflen) return GPG_ERR_BUFFER_TOO_SHORT; - if ((inbuflen % c->cipher->blocksize) + if ((inbuflen & c->blockmask) && !(inbuflen > c->cipher->blocksize && (c->flags & GCRY_CIPHER_CBC_CTS))) return GPG_ERR_INV_LENGTH; @@ -129,7 +129,7 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, if ((c->flags & GCRY_CIPHER_CBC_CTS) && inbuflen > blocksize) { nblocks--; - if ((inbuflen % blocksize) == 0) + if ((inbuflen & c->blockmask) == 0) nblocks--; memcpy (c->lastiv, c->u_iv.iv, blocksize); } @@ -137,8 +137,8 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, if (c->bulk.cbc_dec) { c->bulk.cbc_dec (&c->context.c, c->u_iv.iv, outbuf, inbuf, nblocks); - inbuf += nblocks * blocksize; - outbuf += nblocks * blocksize; + inbuf += nblocks << c->blockshift; + outbuf += nblocks << c->blockshift; } else { @@ -160,10 +160,10 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, { int restbytes; - if ((inbuflen % blocksize) == 0) + if ((inbuflen & c->blockmask) == 0) restbytes = blocksize; else - restbytes = inbuflen % blocksize; + restbytes = inbuflen & c->blockmask; memcpy (c->lastiv, c->u_iv.iv, blocksize ); /* Save Cn-2. */ memcpy (c->u_iv.iv, inbuf + blocksize, restbytes ); /* Save Cn. */ diff --git a/cipher/cipher-cfb.c b/cipher/cipher-cfb.c index ed84b75..7da80fc 100644 --- a/cipher/cipher-cfb.c +++ b/cipher/cipher-cfb.c @@ -69,11 +69,11 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, also allows to use a bulk encryption function if available. */ if (inbuflen >= blocksize_x_2 && c->bulk.cfb_enc) { - unsigned int nblocks = inbuflen / blocksize; + unsigned int nblocks = inbuflen >> c->blockshift; c->bulk.cfb_enc (&c->context.c, c->u_iv.iv, outbuf, inbuf, nblocks); - outbuf += nblocks * blocksize; - inbuf += nblocks * blocksize; - inbuflen -= nblocks * blocksize; + outbuf += nblocks << c->blockshift; + inbuf += nblocks << c->blockshift; + inbuflen -= nblocks << c->blockshift; } else { @@ -155,11 +155,11 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, also allows to use a bulk encryption function if available. */ if (inbuflen >= blocksize_x_2 && c->bulk.cfb_dec) { - unsigned int nblocks = inbuflen / blocksize; + unsigned int nblocks = inbuflen >> c->blockshift; c->bulk.cfb_dec (&c->context.c, c->u_iv.iv, outbuf, inbuf, nblocks); - outbuf += nblocks * blocksize; - inbuf += nblocks * blocksize; - inbuflen -= nblocks * blocksize; + outbuf += nblocks << c->blockshift; + inbuf += nblocks << c->blockshift; + inbuflen -= nblocks << c->blockshift; } else { diff --git a/cipher/cipher-ctr.c b/cipher/cipher-ctr.c index 6bc6ffc..96ec7a6 100644 --- a/cipher/cipher-ctr.c +++ b/cipher/cipher-ctr.c @@ -59,13 +59,13 @@ _gcry_cipher_ctr_encrypt (gcry_cipher_hd_t c, /* Use a bulk method if available. */ - nblocks = inbuflen / blocksize; + nblocks = inbuflen >> c->blockshift; if (nblocks && c->bulk.ctr_enc) { c->bulk.ctr_enc (&c->context.c, c->u_ctr.ctr, outbuf, inbuf, nblocks); - inbuf += nblocks * blocksize; - outbuf += nblocks * blocksize; - inbuflen -= nblocks * blocksize; + inbuf += nblocks << c->blockshift; + outbuf += nblocks << c->blockshift; + inbuflen -= nblocks << c->blockshift; } /* If we don't have a bulk method use the standard method. We also diff --git a/cipher/cipher-internal.h b/cipher/cipher-internal.h index 025bf2e..b9f77c3 100644 --- a/cipher/cipher-internal.h +++ b/cipher/cipher-internal.h @@ -68,6 +68,11 @@ struct gcry_cipher_handle interface does not easily allow to retrieve this value. */ int algo; + /* To avoid slow integer multiplications and divisions, use precalculated + blocksize shift and mask. */ + unsigned int blockshift; + unsigned int blockmask; + /* A structure with function pointers for bulk operations. Due to limitations of the module system (we don't want to change the API) we need to keep these function pointers here. The cipher diff --git a/cipher/cipher.c b/cipher/cipher.c index 389bf7a..b9e5841 100644 --- a/cipher/cipher.c +++ b/cipher/cipher.c @@ -704,6 +704,11 @@ gcry_cipher_open (gcry_cipher_hd_t *handle, h->mode = mode; h->flags = flags; + /* Blocksize should be power of two. */ + gcry_assert((cipher->blocksize & (cipher->blocksize - 1)) == 0); + h->blockshift = ffs(cipher->blocksize) - 1; + h->blockmask = cipher->blocksize - 1; + /* Setup bulk encryption routines. */ switch (algo) { @@ -853,10 +858,10 @@ do_ecb_encrypt (gcry_cipher_hd_t c, if (outbuflen < inbuflen) return GPG_ERR_BUFFER_TOO_SHORT; - if ((inbuflen % blocksize)) + if ((inbuflen & c->blockmask)) return GPG_ERR_INV_LENGTH; - nblocks = inbuflen / c->cipher->blocksize; + nblocks = inbuflen >> c->blockshift; for (n=0; n < nblocks; n++ ) { @@ -877,9 +882,10 @@ do_ecb_decrypt (gcry_cipher_hd_t c, if (outbuflen < inbuflen) return GPG_ERR_BUFFER_TOO_SHORT; - if ((inbuflen % blocksize)) + if ((inbuflen & c->blockmask)) return GPG_ERR_INV_LENGTH; - nblocks = inbuflen / c->cipher->blocksize; + + nblocks = inbuflen >> c->blockshift; for (n=0; n < nblocks; n++ ) { diff --git a/configure.ac b/configure.ac index a2235a8..78e0a0f 100644 --- a/configure.ac +++ b/configure.ac @@ -810,7 +810,7 @@ fi AC_FUNC_VPRINTF # We have replacements for these in src/missing-string.c -AC_CHECK_FUNCS(stpcpy strcasecmp) +AC_CHECK_FUNCS(stpcpy strcasecmp ffs) # We have replacements for these in src/g10lib.h AC_CHECK_FUNCS(strtoul memmove stricmp atexit raise) # Other checks diff --git a/src/g10lib.h b/src/g10lib.h index f1af399..81c3050 100644 --- a/src/g10lib.h +++ b/src/g10lib.h @@ -206,6 +206,9 @@ char *stpcpy (char *a, const char *b); #ifndef HAVE_STRCASECMP int strcasecmp (const char *a, const char *b) _GCRY_GCC_ATTR_PURE; #endif +#ifndef HAVE_FFS +int ffs(int val) _GCRY_GCC_ATTR_PURE; +#endif #include "../compat/libcompat.h" diff --git a/src/missing-string.c b/src/missing-string.c index 4756c00..9196fa3 100644 --- a/src/missing-string.c +++ b/src/missing-string.c @@ -52,3 +52,24 @@ strcasecmp( const char *a, const char *b ) return *(const byte*)a - *(const byte*)b; } #endif + + +#ifndef HAVE_FFS +/* Find first set bit */ +int +ffs(int val) +{ + unsigned int bits = val; + int i; + + for (i = 0; i < sizeof(bits) * 8; i++) + { + if (bits & 1) + return i + 1; + + bits >>= 1; + } + + return 0; +} +#endif From wk at gnupg.org Thu Nov 29 18:50:48 2012 From: wk at gnupg.org (Werner Koch) Date: Thu, 29 Nov 2012 18:50:48 +0100 Subject: [PATCH 1/2] Optimize buffer xoring. In-Reply-To: <20121129153709.7099.62541.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Thu, 29 Nov 2012 17:37:09 +0200") References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> Message-ID: <87pq2w70xz.fsf@vigenere.g10code.de> On Thu, 29 Nov 2012 16:37, jussi.kivilinna at mbnet.fi said: > +#ifndef G10_BUFHELP_H > +#define G10_BUFHELP_H > + > +#include You may only use stdint.h if it is available; see for example sha1.c. We are targeting C90 and not C99. > +#if defined(__i386__) || defined(__x86_64__) > +/* These architechtures are able of unaligned memory accesses and can > + handle those fast. > + */ Really? All of them? > + ldst = (long *)dst; > + lsrc1 = (const long *)src1; > + lsrc2 = (const long *)src2; You assume sizeof(long)<=sizeof(void*) - that should be okay. However this is only optimal on Unix systems where sizeof(long)== sizeof(void*); Windows 64 bit has sizeof(long long)==sizeof(void*). Thus tehre should be some space for further improvement. Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From wk at gnupg.org Thu Nov 29 19:00:11 2012 From: wk at gnupg.org (Werner Koch) Date: Thu, 29 Nov 2012 19:00:11 +0100 Subject: [PATCH 2/2] Avoid slow integer multiplication and division with blocksize calculations. In-Reply-To: <20121129153714.7099.70119.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Thu, 29 Nov 2012 17:37:14 +0200") References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> <20121129153714.7099.70119.stgit@localhost6.localdomain6> Message-ID: <87lidk70ic.fsf@vigenere.g10code.de> On Thu, 29 Nov 2012 16:37, jussi.kivilinna at mbnet.fi said: > Currently all blocksizes are powers of two (and most likely in future), so we > can avoid using integer multiplication and division/modulo operations (that > are slow on architechtures without hardware units for mul/div/mod). Do you have some of your cool benchmarks? Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From wk at gnupg.org Thu Nov 29 19:07:24 2012 From: wk at gnupg.org (Werner Koch) Date: Thu, 29 Nov 2012 19:07:24 +0100 Subject: [PATCH] Optimize AES-NI CTR mode. In-Reply-To: <20121129153103.6234.78914.stgit@localhost6.localdomain6> (Jussi Kivilinna's message of "Thu, 29 Nov 2012 17:31:03 +0200") References: <20121129153103.6234.78914.stgit@localhost6.localdomain6> Message-ID: <87fw3s706b.fsf@vigenere.g10code.de> On Thu, 29 Nov 2012 16:31, jussi.kivilinna at mbnet.fi said: > * cipher/rijndael.c [USE_AESNI] (do_aesni_ctr, do_aesni_ctr_4): Make > handling of 64-bit overflow and carry conditional. Avoid generic to > vector register passing of value '1'. Generate and use '-1' instead. Pushed. Thanks. Shalom-Salam, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. From jussi.kivilinna at mbnet.fi Thu Nov 29 19:25:41 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 20:25:41 +0200 Subject: [PATCH 1/2] Optimize buffer xoring. In-Reply-To: <87pq2w70xz.fsf@vigenere.g10code.de> References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> <87pq2w70xz.fsf@vigenere.g10code.de> Message-ID: <20121129202541.58309c8hl7b0z0dc@www.dalek.fi> Quoting Werner Koch : > On Thu, 29 Nov 2012 16:37, jussi.kivilinna at mbnet.fi said: > >> +#ifndef G10_BUFHELP_H >> +#define G10_BUFHELP_H >> + >> +#include > > You may only use stdint.h if it is available; see for example sha1.c. > We are targeting C90 and not C99. Ok. > >> +#if defined(__i386__) || defined(__x86_64__) >> +/* These architechtures are able of unaligned memory accesses and can >> + handle those fast. >> + */ > > Really? All of them? > Well, based on my own experiences and that in Linux kernel x86 is one of two architechtures that have config flag "HAVE_EFFICIENT_UNALIGNED_ACCESS" always enabled (and the other one being PowerPC). >> + ldst = (long *)dst; >> + lsrc1 = (const long *)src1; >> + lsrc2 = (const long *)src2; > > You assume sizeof(long)<=sizeof(void*) - that should be okay. However > this is only optimal on Unix systems where sizeof(long)== sizeof(void*); > Windows 64 bit has sizeof(long long)==sizeof(void*). Thus tehre should > be some space for further improvement. I'll make v2 of this path, with stdint.h/intptr_t corrected and here use 'uintptr_t' instead of 'long'. -Jussi > > > Shalom-Salam, > > Werner > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > > > From jussi.kivilinna at mbnet.fi Thu Nov 29 20:09:03 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 21:09:03 +0200 Subject: [PATCH 2/2] Avoid slow integer multiplication and division with blocksize calculations. In-Reply-To: <87lidk70ic.fsf@vigenere.g10code.de> References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> <20121129153714.7099.70119.stgit@localhost6.localdomain6> <87lidk70ic.fsf@vigenere.g10code.de> Message-ID: <20121129210903.10921uoekjjc3b28@www.dalek.fi> Quoting Werner Koch : > On Thu, 29 Nov 2012 16:37, jussi.kivilinna at mbnet.fi said: > >> Currently all blocksizes are powers of two (and most likely in >> future), so we >> can avoid using integer multiplication and division/modulo operations (that >> are slow on architechtures without hardware units for mul/div/mod). > > Do you have some of your cool benchmarks? > Well, I currently only have access to x86 machines and there this didn't have easily measurable effect. However if I leave 'buffer xor' patch out, and change cipher-ctr.c loop to use '& c->blockmask' instead of '% blocksize', AMD Phenom II/x86-64 sees following improvement: Before (% blocksize): Running each test 20 times. CTR --------------- IDEA 490ms 500ms 3DES 1040ms 1010ms CAST5 370ms 370ms BLOWFISH 390ms 380ms AES 160ms 170ms AES192 190ms 190ms AES256 220ms 220ms TWOFISH 310ms 310ms DES 520ms 530ms TWOFISH128 320ms 320ms SERPENT128 460ms 460ms SERPENT192 460ms 460ms SERPENT256 440ms 450ms RFC2268_40 600ms 590ms SEED 400ms 410ms CAMELLIA128 340ms 350ms CAMELLIA192 380ms 370ms CAMELLIA256 380ms 380ms After (& c->blocksize): Running each test 20 times. CTR --------------- IDEA 350ms 350ms 3DES 850ms 840ms CAST5 220ms 220ms BLOWFISH 220ms 220ms AES 160ms 160ms AES192 180ms 190ms AES256 220ms 210ms TWOFISH 170ms 160ms DES 370ms 370ms TWOFISH128 160ms 170ms SERPENT128 300ms 310ms SERPENT192 300ms 300ms SERPENT256 310ms 300ms RFC2268_40 430ms 420ms SEED 270ms 250ms CAMELLIA128 200ms 190ms CAMELLIA192 240ms 230ms CAMELLIA256 230ms 230ms With 'buffer xor': Running each test 20 times. CTR --------------- IDEA 310ms 320ms 3DES 810ms 850ms CAST5 190ms 200ms BLOWFISH 190ms 200ms AES 140ms 130ms AES192 160ms 160ms AES256 190ms 190ms TWOFISH 140ms 140ms DES 320ms 340ms TWOFISH128 130ms 130ms SERPENT128 280ms 280ms SERPENT192 270ms 270ms SERPENT256 280ms 270ms RFC2268_40 380ms 390ms SEED 230ms 220ms CAMELLIA128 170ms 160ms CAMELLIA192 200ms 200ms CAMELLIA256 200ms 210ms -Jussi > > Salam-Shalom, > > Werner > > -- > Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz. > > > From jussi.kivilinna at mbnet.fi Thu Nov 29 20:54:57 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 21:54:57 +0200 Subject: [PATCH] [v2] Optimize buffer xoring. Message-ID: <20121129195457.26652.27108.stgit@localhost6.localdomain6> * cipher/Makefile.am (libcipher_la_SOURCES): Add 'bufhelp.h'. * cipher/bufhelp.h: New. * cipher/cipher-aeswrap.c (_gcry_cipher_aeswrap_encrypt) (_gcry_cipher_aeswrap_decrypt): Use 'buf_xor' for buffer xoring. * cipher/cipher-cbc.c (_gcry_cipher_cbc_encrypt) (_gcry_cipher_cbc_decrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-cfb.c (_gcry_cipher_cfb_encrypt) Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gcry_cipher_cfb_decrypt): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. * cipher/cipher-ctr.c (_gcry_cipher_ctr_encrypt): Use 'buf_xor' for buffer xoring and remove resulting unused variables. * cipher/cipher-ofb.c (_gcry_cipher_ofb_encrypt) (_gcry_cipher_ofb_decrypt): Use 'buf_xor' for buffer xoring and remove resulting used variables. * cipher/rijndael.c (_gry_aes_cfb_enc): Use 'buf_xor_2dst' for buffer xoring and remove resulting unused variables. (_gry_aes_cfb_dev): Use 'buf_xor_n_copy' for buffer xoring and remove resulting unused variables. (_gry_aes_cbc_enc, _gry_aes_ctr_enc, _gry_aes_cbc_dec): Use 'buf_xor' for buffer xoring and remove resulting unused variables. -- Add faster helper functions for buffer xoring and replace byte buffer xor loops. This give following speed up. Note that CTR speed up is from refactoring code to use buf_xor() and removal of integer division/modulo operations issued per each processed byte. This removal of div/mod most likely gives even greater speed increase on CPU architechtures that do not have hardware division unit. Benchmark ratios (old-vs-new, AMD Phenom II, x86-64): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 0.99x 1.01x 1.06x 1.02x 1.03x 1.06x 1.04x 1.02x 1.58x 1.58x 3DES 1.00x 1.00x 1.01x 1.01x 1.02x 1.02x 1.02x 1.01x 1.22x 1.23x CAST5 0.98x 1.00x 1.09x 1.03x 1.09x 1.09x 1.07x 1.07x 1.98x 1.95x BLOWFISH 1.00x 1.00x 1.18x 1.05x 1.07x 1.07x 1.05x 1.05x 1.93x 1.91x AES 1.00x 0.98x 1.18x 1.14x 1.13x 1.13x 1.14x 1.14x 1.18x 1.18x AES192 0.98x 1.00x 1.13x 1.14x 1.13x 1.10x 1.14x 1.16x 1.15x 1.15x AES256 0.97x 1.02x 1.09x 1.13x 1.13x 1.09x 1.10x 1.14x 1.11x 1.13x TWOFISH 1.00x 1.00x 1.15x 1.17x 1.18x 1.16x 1.18x 1.13x 2.37x 2.31x ARCFOUR 1.03x 0.97x DES 1.01x 1.00x 1.04x 1.04x 1.04x 1.05x 1.05x 1.02x 1.56x 1.55x TWOFISH128 0.97x 1.03x 1.18x 1.17x 1.18x 1.15x 1.15x 1.15x 2.37x 2.31x SERPENT128 1.00x 1.00x 1.10x 1.11x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x SERPENT192 1.00x 1.00x 1.07x 1.08x 1.08x 1.09x 1.08x 1.08x 1.65x 1.66x SERPENT256 1.00x 1.00x 1.09x 1.09x 1.08x 1.09x 1.08x 1.06x 1.66x 1.67x RFC2268_40 1.03x 0.99x 1.05x 1.02x 1.03x 1.03x 1.04x 1.03x 1.46x 1.46x SEED 1.00x 1.00x 1.10x 1.10x 1.09x 1.09x 1.10x 1.07x 1.80x 1.76x CAMELLIA128 1.00x 1.00x 1.23x 1.12x 1.15x 1.17x 1.15x 1.12x 2.15x 2.13x CAMELLIA192 1.05x 1.03x 1.23x 1.21x 1.21x 1.16x 1.12x 1.25x 1.90x 1.90x CAMELLIA256 1.03x 1.07x 1.10x 1.19x 1.08x 1.14x 1.12x 1.10x 1.90x 1.92x Benchmark ratios (old-vs-new, AMD Phenom II, i386): ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- --------------- IDEA 1.00x 1.00x 1.04x 1.05x 1.04x 1.02x 1.02x 1.02x 1.38x 1.40x 3DES 1.01x 1.00x 1.02x 1.04x 1.03x 1.01x 1.00x 1.02x 1.20x 1.20x CAST5 1.00x 1.00x 1.03x 1.09x 1.07x 1.04x 1.13x 1.00x 1.74x 1.74x BLOWFISH 1.04x 1.08x 1.03x 1.13x 1.07x 1.12x 1.03x 1.00x 1.78x 1.74x AES 0.96x 1.00x 1.09x 1.08x 1.14x 1.13x 1.07x 1.03x 1.14x 1.09x AES192 1.00x 1.03x 1.07x 1.03x 1.07x 1.07x 1.06x 1.03x 1.08x 1.11x AES256 1.00x 1.00x 1.06x 1.06x 1.10x 1.06x 1.05x 1.03x 1.10x 1.10x TWOFISH 0.95x 1.10x 1.13x 1.23x 1.05x 1.14x 1.09x 1.13x 1.95x 1.86x ARCFOUR 1.00x 1.00x DES 1.02x 0.98x 1.04x 1.04x 1.05x 1.02x 1.04x 1.00x 1.45x 1.48x TWOFISH128 0.95x 1.10x 1.26x 1.19x 1.09x 1.14x 1.17x 1.00x 2.00x 1.91x SERPENT128 1.02x 1.00x 1.08x 1.04x 1.10x 1.06x 1.08x 1.04x 1.42x 1.42x SERPENT192 1.02x 1.02x 1.06x 1.06x 1.10x 1.08x 1.04x 1.06x 1.42x 1.42x SERPENT256 1.02x 0.98x 1.06x 1.06x 1.10x 1.06x 1.04x 1.06x 1.42x 1.40x RFC2268_40 1.00x 1.00x 1.02x 1.06x 1.04x 1.02x 1.02x 1.02x 1.35x 1.35x SEED 1.00x 0.97x 1.11x 1.05x 1.06x 1.08x 1.08x 1.05x 1.56x 1.57x CAMELLIA128 1.03x 0.97x 1.12x 1.14x 1.06x 1.10x 1.06x 1.06x 1.73x 1.59x CAMELLIA192 1.06x 1.00x 1.13x 1.10x 1.11x 1.11x 1.15x 1.08x 1.57x 1.58x CAMELLIA256 1.06x 1.03x 1.10x 1.10x 1.11x 1.11x 1.13x 1.08x 1.57x 1.62x [v2]: - include stdint.h only when it's available - use uintptr_t instead of long and intptr_t Signed-off-by: Jussi Kivilinna --- cipher/Makefile.am | 1 cipher/bufhelp.h | 179 +++++++++++++++++++++++++++++++++++++++++++++++ cipher/cipher-aeswrap.c | 7 +- cipher/cipher-cbc.c | 14 +--- cipher/cipher-cfb.c | 85 ++++++++++------------ cipher/cipher-ctr.c | 42 +++++------ cipher/cipher-ofb.c | 52 ++++++++------ cipher/rijndael.c | 50 +++++-------- 8 files changed, 294 insertions(+), 136 deletions(-) create mode 100644 cipher/bufhelp.h diff --git a/cipher/Makefile.am b/cipher/Makefile.am index 473e3c8..e8050e3 100644 --- a/cipher/Makefile.am +++ b/cipher/Makefile.am @@ -40,6 +40,7 @@ cipher-cbc.c cipher-cfb.c cipher-ofb.c cipher-ctr.c cipher-aeswrap.c \ pubkey.c md.c kdf.c \ hmac-tests.c \ bithelp.h \ +bufhelp.h \ primegen.c \ hash-common.c hash-common.h \ rmd.h diff --git a/cipher/bufhelp.h b/cipher/bufhelp.h new file mode 100644 index 0000000..a3be24a --- /dev/null +++ b/cipher/bufhelp.h @@ -0,0 +1,179 @@ +/* bufhelp.h - Some buffer manipulation helpers + * Copyright ? 2012 Jussi Kivilinna + * + * This file is part of Libgcrypt. + * + * Libgcrypt is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser general Public License as + * published by the Free Software Foundation; either version 2.1 of + * the License, or (at your option) any later version. + * + * Libgcrypt is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA + */ +#ifndef G10_BUFHELP_H +#define G10_BUFHELP_H + +#ifdef HAVE_STDINT_H +# include /* uintptr_t */ +#else +/* In this case, uintptr_t is provided by config.h. */ +#endif + + +#if defined(__i386__) || defined(__x86_64__) +/* These architechtures are able of unaligned memory accesses and can + handle those fast. + */ +# define BUFHELP_FAST_UNALIGNED_ACCESS 1 +#endif + + +/* Optimized function for buffer xoring */ +static inline void +buf_xor(void *_dst, const void *_src1, const void *_src2, size_t len) +{ + byte *dst = _dst; + const byte *src1 = _src1; + const byte *src2 = _src2; + uintptr_t *ldst; + const uintptr_t *lsrc1, *lsrc2; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(uintptr_t) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((uintptr_t)dst ^ (uintptr_t)src1) | + ((uintptr_t)dst ^ (uintptr_t)src2)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((uintptr_t)dst & longmask); len--) + *dst++ = *src1++ ^ *src2++; +#endif + + ldst = (uintptr_t *)dst; + lsrc1 = (const uintptr_t *)src1; + lsrc2 = (const uintptr_t *)src2; + + for (; len >= sizeof(uintptr_t); len -= sizeof(uintptr_t)) + *ldst++ = *lsrc1++ ^ *lsrc2++; + + dst = (byte *)ldst; + src1 = (const byte *)lsrc1; + src2 = (const byte *)lsrc2; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + *dst++ = *src1++ ^ *src2++; +} + + +/* Optimized function for buffer xoring with two destination buffers. Used + mainly by CFB mode encryption. */ +static inline void +buf_xor_2dst(void *_dst1, void *_dst2, const void *_src, size_t len) +{ + byte *dst1 = _dst1; + byte *dst2 = _dst2; + const byte *src = _src; + uintptr_t *ldst1, *ldst2; + const uintptr_t *lsrc; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(uintptr_t) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((uintptr_t)src ^ (uintptr_t)dst1) | + ((uintptr_t)src ^ (uintptr_t)dst2)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((uintptr_t)src & longmask); len--) + *dst1++ = (*dst2++ ^= *src++); +#endif + + ldst1 = (uintptr_t *)dst1; + ldst2 = (uintptr_t *)dst2; + lsrc = (const uintptr_t *)src; + + for (; len >= sizeof(uintptr_t); len -= sizeof(uintptr_t)) + *ldst1++ = (*ldst2++ ^= *lsrc++); + + dst1 = (byte *)ldst1; + dst2 = (byte *)ldst2; + src = (const byte *)lsrc; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + *dst1++ = (*dst2++ ^= *src++); +} + + +/* Optimized function for combined buffer xoring and copying. Used by mainly + CFB mode decryption. */ +static inline void +buf_xor_n_copy(void *_dst_xor, void *_srcdst_cpy, const void *_src, size_t len) +{ + byte *dst_xor = _dst_xor; + byte *srcdst_cpy = _srcdst_cpy; + byte temp; + const byte *src = _src; + uintptr_t *ldst_xor, *lsrcdst_cpy; + const uintptr_t *lsrc; + uintptr_t ltemp; +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS + const unsigned int longmask = sizeof(uintptr_t) - 1; + + /* Skip fast processing if alignment of buffers do not match. */ + if ((((uintptr_t)src ^ (uintptr_t)dst_xor) | + ((uintptr_t)src ^ (uintptr_t)srcdst_cpy)) & longmask) + goto do_bytes; + + /* Handle unaligned head. */ + for (; len && ((uintptr_t)src & longmask); len--) + { + temp = *src++; + *dst_xor++ = *srcdst_cpy ^ temp; + *srcdst_cpy++ = temp; + } +#endif + + ldst_xor = (uintptr_t *)dst_xor; + lsrcdst_cpy = (uintptr_t *)srcdst_cpy; + lsrc = (const uintptr_t *)src; + + for (; len >= sizeof(uintptr_t); len -= sizeof(uintptr_t)) + { + ltemp = *lsrc++; + *ldst_xor++ = *lsrcdst_cpy ^ ltemp; + *lsrcdst_cpy++ = ltemp; + } + + dst_xor = (byte *)ldst_xor; + srcdst_cpy = (byte *)lsrcdst_cpy; + src = (const byte *)lsrc; + +#ifndef BUFHELP_FAST_UNALIGNED_ACCESS +do_bytes: +#endif + /* Handle tail. */ + for (; len; len--) + { + temp = *src++; + *dst_xor++ = *srcdst_cpy ^ temp; + *srcdst_cpy++ = temp; + } +} + +#endif /*G10_BITHELP_H*/ diff --git a/cipher/cipher-aeswrap.c b/cipher/cipher-aeswrap.c index b559e7f..8e117eb 100644 --- a/cipher/cipher-aeswrap.c +++ b/cipher/cipher-aeswrap.c @@ -26,6 +26,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -95,8 +96,7 @@ _gcry_cipher_aeswrap_encrypt (gcry_cipher_hd_t c, break; } /* A := MSB_64(B) ^ t */ - for (x=0; x < 8; x++) - a[x] = b[x] ^ t[x]; + buf_xor(a, b, t, 8); /* R[i] := LSB_64(B) */ memcpy (r+i*8, b+8, 8); } @@ -161,8 +161,7 @@ _gcry_cipher_aeswrap_decrypt (gcry_cipher_hd_t c, for (i = n; i >= 1; i--) { /* B := AES_k^1( (A ^ t)| R[i] ) */ - for (x = 0; x < 8; x++) - b[x] = a[x] ^ t[x]; + buf_xor(b, a, t, 8); memcpy (b+8, r+(i-1)*8, 8); c->cipher->decrypt (&c->context.c, b, b); /* t := t - 1 */ diff --git a/cipher/cipher-cbc.c b/cipher/cipher-cbc.c index b852589..0d30f63 100644 --- a/cipher/cipher-cbc.c +++ b/cipher/cipher-cbc.c @@ -28,6 +28,7 @@ #include "cipher.h" #include "ath.h" #include "./cipher-internal.h" +#include "bufhelp.h" @@ -68,8 +69,7 @@ _gcry_cipher_cbc_encrypt (gcry_cipher_hd_t c, { for (n=0; n < nblocks; n++ ) { - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - outbuf[i] = inbuf[i] ^ *ivp++; + buf_xor(outbuf, inbuf, c->u_iv.iv, blocksize); c->cipher->encrypt ( &c->context.c, outbuf, outbuf ); memcpy (c->u_iv.iv, outbuf, blocksize ); inbuf += blocksize; @@ -114,7 +114,6 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, const unsigned char *inbuf, unsigned int inbuflen) { unsigned int n; - unsigned char *ivp; int i; size_t blocksize = c->cipher->blocksize; unsigned int nblocks = inbuflen / blocksize; @@ -150,8 +149,7 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, * this here because it is not used otherwise. */ memcpy (c->lastiv, inbuf, blocksize); c->cipher->decrypt ( &c->context.c, outbuf, inbuf ); - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->u_iv.iv, blocksize); memcpy(c->u_iv.iv, c->lastiv, blocksize ); inbuf += c->cipher->blocksize; outbuf += c->cipher->blocksize; @@ -171,15 +169,13 @@ _gcry_cipher_cbc_decrypt (gcry_cipher_hd_t c, memcpy (c->u_iv.iv, inbuf + blocksize, restbytes ); /* Save Cn. */ c->cipher->decrypt ( &c->context.c, outbuf, inbuf ); - for (ivp=c->u_iv.iv,i=0; i < restbytes; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->u_iv.iv, restbytes); memcpy(outbuf + blocksize, outbuf, restbytes); for(i=restbytes; i < blocksize; i++) c->u_iv.iv[i] = outbuf[i]; c->cipher->decrypt (&c->context.c, outbuf, c->u_iv.iv); - for(ivp=c->lastiv,i=0; i < blocksize; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, c->lastiv, blocksize); /* c->lastiv is now really lastlastiv, does this matter? */ } diff --git a/cipher/cipher-cfb.c b/cipher/cipher-cfb.c index f4152b9..ed84b75 100644 --- a/cipher/cipher-cfb.c +++ b/cipher/cipher-cfb.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -46,10 +47,9 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv+c->cipher->blocksize - c->unused; - inbuflen; - inbuflen--, c->unused-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + ivp = c->u_iv.iv + c->cipher->blocksize - c->unused; + buf_xor_2dst(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } @@ -57,8 +57,11 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { /* XOR the input with the IV and store input into IV */ inbuflen -= c->unused; - for(ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_2dst(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. We use a loop as long as we @@ -76,25 +79,25 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, { while ( inbuflen >= blocksize_x_2 ) { - int i; /* Encrypt the IV. */ c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV. */ - for(ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } } if ( inbuflen >= blocksize ) { - int i; /* Save the current IV and then encrypt the IV. */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV */ - for(ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -105,8 +108,10 @@ _gcry_cipher_cfb_encrypt (gcry_cipher_hd_t c, c->unused = blocksize; /* Apply the XOR. */ c->unused -= inbuflen; - for(ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } @@ -118,8 +123,6 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, const unsigned char *inbuf, unsigned int inbuflen) { unsigned char *ivp; - unsigned long temp; - int i; size_t blocksize = c->cipher->blocksize; size_t blocksize_x_2 = blocksize + blocksize; @@ -130,14 +133,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv+blocksize - c->unused; - inbuflen; - inbuflen--, c->unused--) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_n_copy(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } @@ -145,12 +143,11 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, { /* XOR the input with the IV and store input into IV. */ inbuflen -= c->unused; - for (ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor_n_copy(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. We use a loop as long as we @@ -171,12 +168,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, /* Encrypt the IV. */ c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV. */ - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } } @@ -187,12 +181,9 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, memcpy ( c->lastiv, c->u_iv.iv, blocksize); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); /* XOR the input with the IV and store input into IV */ - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } @@ -204,12 +195,10 @@ _gcry_cipher_cfb_decrypt (gcry_cipher_hd_t c, c->unused = blocksize; /* Apply the XOR. */ c->unused -= inbuflen; - for (ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } diff --git a/cipher/cipher-ctr.c b/cipher/cipher-ctr.c index a334abc..6bc6ffc 100644 --- a/cipher/cipher-ctr.c +++ b/cipher/cipher-ctr.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -48,11 +49,9 @@ _gcry_cipher_ctr_encrypt (gcry_cipher_hd_t c, { gcry_assert (c->unused < blocksize); i = blocksize - c->unused; - for (n=0; c->unused && n < inbuflen; c->unused--, n++, i++) - { - /* XOR input with encrypted counter and store in output. */ - outbuf[n] = inbuf[n] ^ c->lastiv[i]; - } + n = c->unused > inbuflen ? inbuflen : c->unused; + buf_xor(outbuf, inbuf, &c->lastiv[i], n); + c->unused -= n; inbuf += n; outbuf += n; inbuflen -= n; @@ -75,27 +74,26 @@ _gcry_cipher_ctr_encrypt (gcry_cipher_hd_t c, { unsigned char tmp[MAX_BLOCKSIZE]; - for (n=0; n < inbuflen; n++) - { - if ((n % blocksize) == 0) - { - c->cipher->encrypt (&c->context.c, tmp, c->u_ctr.ctr); + do { + c->cipher->encrypt (&c->context.c, tmp, c->u_ctr.ctr); - for (i = blocksize; i > 0; i--) - { - c->u_ctr.ctr[i-1]++; - if (c->u_ctr.ctr[i-1] != 0) - break; - } - } + for (i = blocksize; i > 0; i--) + { + c->u_ctr.ctr[i-1]++; + if (c->u_ctr.ctr[i-1] != 0) + break; + } - /* XOR input with encrypted counter and store in output. */ - outbuf[n] = inbuf[n] ^ tmp[n % blocksize]; - } + n = blocksize < inbuflen ? blocksize : inbuflen; + buf_xor(outbuf, inbuf, tmp, n); + + inbuflen -= n; + outbuf += n; + inbuf += n; + } while (inbuflen); /* Save the unused bytes of the counter. */ - n %= blocksize; - c->unused = (blocksize - n) % blocksize; + c->unused = blocksize - n; if (c->unused) memcpy (c->lastiv+n, tmp+n, c->unused); diff --git a/cipher/cipher-ofb.c b/cipher/cipher-ofb.c index e5868cd..e194976 100644 --- a/cipher/cipher-ofb.c +++ b/cipher/cipher-ofb.c @@ -27,6 +27,7 @@ #include "g10lib.h" #include "cipher.h" #include "ath.h" +#include "bufhelp.h" #include "./cipher-internal.h" @@ -45,30 +46,31 @@ _gcry_cipher_ofb_encrypt (gcry_cipher_hd_t c, { /* Short enough to be encoded by the remaining XOR mask. */ /* XOR the input with the IV */ - for (ivp=c->u_iv.iv+c->cipher->blocksize - c->unused; - inbuflen; - inbuflen--, c->unused-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + ivp = c->u_iv.iv + c->cipher->blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } if( c->unused ) { inbuflen -= c->unused; - for(ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. */ while ( inbuflen >= blocksize ) { - int i; /* Encrypt the IV (and save the current one). */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); - - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = (*ivp++ ^ *inbuf++); + buf_xor(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -77,8 +79,10 @@ _gcry_cipher_ofb_encrypt (gcry_cipher_hd_t c, c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); c->unused = blocksize; c->unused -= inbuflen; - for(ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = (*ivp++ ^ *inbuf++); + buf_xor(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } @@ -98,27 +102,31 @@ _gcry_cipher_ofb_decrypt (gcry_cipher_hd_t c, if( inbuflen <= c->unused ) { /* Short enough to be encoded by the remaining XOR mask. */ - for (ivp=c->u_iv.iv+blocksize - c->unused; inbuflen; inbuflen--,c->unused--) - *outbuf++ = *ivp++ ^ *inbuf++; + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, inbuflen); + c->unused -= inbuflen; return 0; } if ( c->unused ) { inbuflen -= c->unused; - for (ivp=c->u_iv.iv+blocksize - c->unused; c->unused; c->unused-- ) - *outbuf++ = *ivp++ ^ *inbuf++; + ivp = c->u_iv.iv + blocksize - c->unused; + buf_xor(outbuf, ivp, inbuf, c->unused); + outbuf += c->unused; + inbuf += c->unused; + c->unused = 0; } /* Now we can process complete blocks. */ while ( inbuflen >= blocksize ) { - int i; /* Encrypt the IV (and save the current one). */ memcpy( c->lastiv, c->u_iv.iv, blocksize ); c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); - for (ivp=c->u_iv.iv,i=0; i < blocksize; i++ ) - *outbuf++ = *ivp++ ^ *inbuf++; + buf_xor(outbuf, c->u_iv.iv, inbuf, blocksize); + outbuf += blocksize; + inbuf += blocksize; inbuflen -= blocksize; } if ( inbuflen ) @@ -128,8 +136,10 @@ _gcry_cipher_ofb_decrypt (gcry_cipher_hd_t c, c->cipher->encrypt ( &c->context.c, c->u_iv.iv, c->u_iv.iv ); c->unused = blocksize; c->unused -= inbuflen; - for (ivp=c->u_iv.iv; inbuflen; inbuflen-- ) - *outbuf++ = *ivp++ ^ *inbuf++; + buf_xor(outbuf, c->u_iv.iv, inbuf, inbuflen); + outbuf += inbuflen; + inbuf += inbuflen; + inbuflen = 0; } return 0; } diff --git a/cipher/rijndael.c b/cipher/rijndael.c index 6313ab2..24372d9 100644 --- a/cipher/rijndael.c +++ b/cipher/rijndael.c @@ -45,6 +45,7 @@ #include "types.h" /* for byte and u32 typedefs */ #include "g10lib.h" #include "cipher.h" +#include "bufhelp.h" #define MAXKC (256/32) #define MAXROUNDS 14 @@ -1337,8 +1338,6 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; if (0) ; @@ -1351,8 +1350,9 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, /* Encrypt the IV. */ do_padlock (ctx, 0, iv, iv); /* XOR the input with the IV and store input into IV. */ - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } #endif /*USE_PADLOCK*/ @@ -1376,8 +1376,9 @@ _gcry_aes_cfb_enc (void *context, unsigned char *iv, /* Encrypt the IV. */ do_encrypt_aligned (ctx, iv, iv); /* XOR the input with the IV and store input into IV. */ - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - *outbuf++ = (*ivp++ ^= *inbuf++); + buf_xor_2dst(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } @@ -1397,8 +1398,6 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; aesni_prepare (); for ( ;nblocks; nblocks-- ) @@ -1432,8 +1431,7 @@ _gcry_aes_cbc_enc (void *context, unsigned char *iv, #endif /*USE_AESNI*/ else { - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] = inbuf[i] ^ *ivp++; + buf_xor(outbuf, inbuf, iv, BLOCKSIZE); if (0) ; @@ -1470,7 +1468,6 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *p; int i; if (0) @@ -1504,8 +1501,9 @@ _gcry_aes_ctr_enc (void *context, unsigned char *ctr, /* Encrypt the counter. */ do_encrypt_aligned (ctx, tmp.x1, ctr); /* XOR the input with the encrypted counter and store in output. */ - for (p=tmp.x1, i=0; i < BLOCKSIZE; i++) - *outbuf++ = (*p++ ^= *inbuf++); + buf_xor(outbuf, tmp.x1, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; /* Increment the counter. */ for (i = BLOCKSIZE; i > 0; i--) { @@ -1694,9 +1692,6 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - unsigned char temp; - int i; if (0) ; @@ -1707,12 +1702,9 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, for ( ;nblocks; nblocks-- ) { do_padlock (ctx, 0, iv, iv); - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } #endif /*USE_PADLOCK*/ @@ -1734,12 +1726,9 @@ _gcry_aes_cfb_dec (void *context, unsigned char *iv, for ( ;nblocks; nblocks-- ) { do_encrypt_aligned (ctx, iv, iv); - for (ivp=iv,i=0; i < BLOCKSIZE; i++ ) - { - temp = *inbuf++; - *outbuf++ = *ivp ^ temp; - *ivp++ = temp; - } + buf_xor_n_copy(outbuf, iv, inbuf, BLOCKSIZE); + outbuf += BLOCKSIZE; + inbuf += BLOCKSIZE; } } @@ -1759,8 +1748,6 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, RIJNDAEL_context *ctx = context; unsigned char *outbuf = outbuf_arg; const unsigned char *inbuf = inbuf_arg; - unsigned char *ivp; - int i; unsigned char savebuf[BLOCKSIZE]; if (0) @@ -1871,8 +1858,7 @@ _gcry_aes_cbc_dec (void *context, unsigned char *iv, else do_decrypt (ctx, outbuf, inbuf); - for (ivp=iv, i=0; i < BLOCKSIZE; i++ ) - outbuf[i] ^= *ivp++; + buf_xor(outbuf, outbuf, iv, BLOCKSIZE); memcpy (iv, savebuf, BLOCKSIZE); inbuf += BLOCKSIZE; outbuf += BLOCKSIZE; From jussi.kivilinna at mbnet.fi Thu Nov 29 20:55:34 2012 From: jussi.kivilinna at mbnet.fi (Jussi Kivilinna) Date: Thu, 29 Nov 2012 21:55:34 +0200 Subject: [PATCH] Fix building with CC="gcc -std=c90". Message-ID: <20121129195534.26810.71478.stgit@localhost6.localdomain6> * configure.ac: Add check for missing 'asm' keyword in C90 mode and replacement with '__asm__'. -- Signed-off-by: Jussi Kivilinna --- configure.ac | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/configure.ac b/configure.ac index 78e0a0f..b2c5ab1 100644 --- a/configure.ac +++ b/configure.ac @@ -804,6 +804,31 @@ if test "$gcry_cv_gcc_attribute_aligned" = "yes" ; then fi +# +# Check whether the compiler supports 'asm' or '__asm__' keyword for +# assembler blocks +# +AC_CACHE_CHECK([whether 'asm' assembler keyword is supported], + [gcry_cv_have_asm], + [gcry_cv_have_asm=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[void a(void) { asm("":::"memory"); }]])], + [gcry_cv_have_asm=yes])]) +AC_CACHE_CHECK([whether '__asm__' assembler keyword is supported], + [gcry_cv_have___asm__], + [gcry_cv_have___asm__=no + AC_COMPILE_IFELSE([AC_LANG_SOURCE( + [[void a(void) { __asm__("":::"memory"); }]])], + [gcry_cv_have___asm__=yes])]) +if test "$gcry_cv_have_asm" = "no" ; then + if test "$gcry_cv_have___asm__" = "yes" ; then + AC_DEFINE(asm,__asm__, + [Define to supported assembler block keyword, if plain 'asm' was not + supported]) + fi +fi + + ####################################### #### Checks for library functions. #### ####################################### From wk at gnupg.org Fri Nov 30 10:21:08 2012 From: wk at gnupg.org (Werner Koch) Date: Fri, 30 Nov 2012 10:21:08 +0100 Subject: [PATCH 2/2] Avoid slow integer multiplication and division with blocksize calculations. In-Reply-To: <20121129210903.10921uoekjjc3b28@www.dalek.fi> (Jussi Kivilinna's message of "Thu, 29 Nov 2012 21:09:03 +0200") References: <20121129153709.7099.62541.stgit@localhost6.localdomain6> <20121129153714.7099.70119.stgit@localhost6.localdomain6> <87lidk70ic.fsf@vigenere.g10code.de> <20121129210903.10921uoekjjc3b28@www.dalek.fi> Message-ID: <87boef78fv.fsf@vigenere.g10code.de> On Thu, 29 Nov 2012 20:09, jussi.kivilinna at mbnet.fi said: > Well, I currently only have access to x86 machines and there this > didn't have easily measurable effect. Okay, I was just curious. > However if I leave 'buffer xor' patch out, and change cipher-ctr.c > loop to use '& c->blockmask' instead of '% blocksize', AMD Phenom > II/x86-64 sees following improvement: Thanks. Salam-Shalom, Werner -- Die Gedanken sind frei. Ausnahmen regelt ein Bundesgesetz.