[PATCH] cipher:riscv: gate Zvkned AES backend on VLEN == 128
Michael Neuling
mikey at neuling.org
Wed May 6 09:47:35 CEST 2026
Jussi,
Thanks for the reply.
> m4 batching code path selects 128-bit vectors (4 32-bit elements or
> 16 8-bit elements) and "m4" grouping. Whatever HW supports VLEN>128
> or VLEN=128 not should not matter here.
I think the code's assumption around __riscv_vset_v_u32m1_u32m4() may be wrong.
I'm not familiar with RVV intrinsics so I apologize for my ignorance
here. Claude did give a
minimal reproducer below using the same pattern as the libgcrypt code.
Results using qemu-user:
% riscv64-linux-gnu-gcc -O2 -march=rv64gcv -static -o
libgcrypt-rvv-vlen128-assumption libgcrypt-rvv-vlen128-assumption.c
% qemu-riscv64 -cpu rv64,v=true,vlen=128 ./libgcrypt-rvv-vlen128-assumption
Element-by-element view of out[0..15]:
out[ 0] = 10001111
out[ 1] = 10002222
out[ 2] = 10003333
out[ 3] = 10004444
out[ 4] = 20001111
out[ 5] = 20002222
out[ 6] = 20003333
out[ 7] = 20004444
out[ 8] = 30001111
out[ 9] = 30002222
out[10] = 30003333
out[11] = 30004444
out[12] = 40001111
out[13] = 40002222
out[14] = 40003333
out[15] = 40004444
libgcrypt-shaped layout (VLEN=128 assumption): OK -- 4 contiguous AES
blocks at out[0..15]
% qemu-riscv64 -cpu rv64,v=true,vlen=256 ./libgcrypt-rvv-vlen128-assumption
Element-by-element view of out[0..15]:
out[ 0] = 10001111
out[ 1] = 10002222
out[ 2] = 10003333
out[ 3] = 10004444
out[ 4] = 00000000
out[ 5] = 00000000
out[ 6] = 00000000
out[ 7] = 00000000
out[ 8] = 20001111
out[ 9] = 20002222
out[10] = 20003333
out[11] = 20004444
out[12] = 00000000
out[13] = 00000000
out[14] = 00000000
out[15] = 00000000
libgcrypt-shaped layout (VLEN=128 assumption): BUG -- AES_CRYPT m4
vl=16 will not find the 4 blocks here
Where each loaded m1 register actually lands in g (per RVV intrinsic
spec, sub-register N -> elements N*VLMAX_m1 .. (N+1)*VLMAX_m1 - 1):
out[0..3] = sub-register 0 (= r0 + r0-tail)
... and so on for sub-registers 1..3
% cat libgcrypt-rvv-vlen128-assumption.c
/*
* Minimal reproducer: libgcrypt's RVV Zvkned AES kernels miscompute on
* RVV-capable CPUs whose VLEN > 128. The same bug shows up as AES-CFB
* encrypt-decrypt mismatch in libgcrypt's tests/basic on tt-ascalon
* (VLEN=256).
*
* NOT A GCC BUG. The __riscv_vundefined / __riscv_vset / __riscv_vget
* intrinsics behave exactly as the RVV intrinsic spec defines them.
* The bug is in libgcrypt's USAGE pattern, which silently assumes
* VLEN=128.
*
* The pattern (cipher/rijndael-riscv-zvkned.c CFB-DEC m4 path):
*
* size_t vl = 4; // u32 elements per AES block
* vsetvli e32, m1, vl=4 (implicit via intrinsics with vl arg)
* vuint32m1_t r0, r1, r2, r3; // each = 1 AES block
* vuint32m4_t g = __riscv_vundefined_u32m4();
* g = __riscv_vset_v_u32m1_u32m4(g, 0, r0);
* g = __riscv_vset_v_u32m1_u32m4(g, 1, r1);
* g = __riscv_vset_v_u32m1_u32m4(g, 2, r2);
* g = __riscv_vset_v_u32m1_u32m4(g, 3, r3);
* AES_CRYPT(e, m4, rounds, g, vl * 4); // vsetvli e32, m4, vl=16
*
* The author intended "place 4 AES blocks contiguously at elements
* 0..15 of g, then encrypt all four". That's what the code does on
* VLEN=128 (m1 has 4 elements / register; the m4 group has 16 elements
* total at 4 per sub-register).
*
* On VLEN=256 (m1 = 8 elements/register, m4 = 32 elements):
* - vset places r0 in g's *register* 0 (whole register, 8 elements).
* Of those 8 elements, only the first 4 are r0's active data; the
* other 4 are r0's tail (whatever vsetvli e32m1 vl=4 left there --
* "all-1s" on tt-ascalon's tail-agnostic policy, "undisturbed" or
* other garbage on other CPUs).
* - vset slot 1 -> g register 1 (whole register), holds r1 + r1-tail.
* - vset slot 2 -> g register 2 (whole register), holds r2 + r2-tail.
* - vset slot 3 -> g register 3 (whole register), holds r3 + r3-tail.
* - AES_CRYPT with vl=16 then processes only the first 16 elements
* of g (elements 0..15 = sub-registers 0 and 1 only, with each
* sub-register holding ONE valid block + ONE tail block).
*
* Net effect: AES sees blocks
* block 0: r0's valid 4 elements (= intended slot 0) GOOD
* block 1: r0's TAIL WRONG
* block 2: r1's valid 4 elements (intended for slot 1!) WRONG
* block 3: r1's TAIL WRONG
* and r2, r3 in sub-registers 2, 3 are never touched.
*
* This program demonstrates the layout difference. It builds a u32m4
* group via the libgcrypt pattern and stores 16 elements (the same vl
* AES_CRYPT m4 uses) -- showing where each input block actually lands.
*
* Build:
* riscv64-linux-gnu-gcc -O2 -march=rv64gcv -static \
* -o libgcrypt-rvv-vlen128-assumption \
* libgcrypt-rvv-vlen128-assumption.c
*
* Run:
* qemu-riscv64 -cpu rv64,v=true,vlen=128 ./libgcrypt-rvv-vlen128-assumption
* PASS: blocks 0..3 land at out[0..3], out[4..7], out[8..11], out[12..15]
*
* qemu-riscv64 -cpu rv64,v=true,vlen=256 ./libgcrypt-rvv-vlen128-assumption
* FAIL: out[4..7] is r0's tail; out[8..11] is r1; out[12..15] is r1's tail
*
* The libgcrypt fix is to lay the 4 blocks out byte-contiguously in
* memory (4 blocks * 16 bytes = 64 bytes) and reload via
* __riscv_vle8_v_u8m4 with vl=64, so the four blocks always land at
* element positions 0..15 of the m4 group regardless of VLEN. See
* targets/libgcrypt-zvkned-fix.py in the wr2 harness for the patch.
*
* Author: Claude Opus 4.6
*/
#include <riscv_vector.h>
#include <stdint.h>
#include <stdio.h>
int main(void)
{
/* Distinct per-block payloads so we can identify where each
block's data ends up in memory. Top nibble = source slot id. */
uint32_t b0_in[4] = { 0x10001111, 0x10002222, 0x10003333, 0x10004444 };
uint32_t b1_in[4] = { 0x20001111, 0x20002222, 0x20003333, 0x20004444 };
uint32_t b2_in[4] = { 0x30001111, 0x30002222, 0x30003333, 0x30004444 };
uint32_t b3_in[4] = { 0x40001111, 0x40002222, 0x40003333, 0x40004444 };
size_t vl = 4; /* AES block = 4 u32 = 16 bytes */
vuint32m1_t r0 = __riscv_vle32_v_u32m1(b0_in, vl);
vuint32m1_t r1 = __riscv_vle32_v_u32m1(b1_in, vl);
vuint32m1_t r2 = __riscv_vle32_v_u32m1(b2_in, vl);
vuint32m1_t r3 = __riscv_vle32_v_u32m1(b3_in, vl);
/* libgcrypt's pattern: build u32m4 group via vundefined + 4 vsets. */
vuint32m4_t g = __riscv_vundefined_u32m4();
g = __riscv_vset_v_u32m1_u32m4(g, 0, r0);
g = __riscv_vset_v_u32m1_u32m4(g, 1, r1);
g = __riscv_vset_v_u32m1_u32m4(g, 2, r2);
g = __riscv_vset_v_u32m1_u32m4(g, 3, r3);
/* Store the same number of elements that AES_CRYPT m4 vl=16 would
process (16 u32 = elements 0..15 of the m4 group). */
uint32_t out[16];
__riscv_vse32_v_u32m4(out, g, vl * 4);
printf("Element-by-element view of out[0..15]:\n\n");
for (int i = 0; i < 16; i++)
printf(" out[%2d] = %08x\n", i, out[i]);
/* The libgcrypt-shaped expectation: blocks 0..3 contiguous at
out[0..3], out[4..7], out[8..11], out[12..15]. This holds on
VLEN=128 only. */
int libgcrypt_layout_ok = 1;
uint32_t *blocks[4] = { b0_in, b1_in, b2_in, b3_in };
for (int blk = 0; blk < 4; blk++) {
for (int e = 0; e < 4; e++) {
int idx = blk * 4 + e;
if (out[idx] != blocks[blk][e]) {
libgcrypt_layout_ok = 0;
break;
}
}
}
printf("\nlibgcrypt-shaped layout (VLEN=128 assumption): %s\n",
libgcrypt_layout_ok
? "OK -- 4 contiguous AES blocks at out[0..15]"
: "BUG -- AES_CRYPT m4 vl=16 will not find the 4 blocks here");
if (!libgcrypt_layout_ok) {
/* Show where the blocks actually went on this VLEN. */
printf("\nWhere each loaded m1 register actually lands in g (per "
"RVV intrinsic spec, sub-register N -> elements "
"N*VLMAX_m1 .. (N+1)*VLMAX_m1 - 1):\n");
printf(" out[0..%zu] = sub-register 0 (= r0 + r0-tail)\n",
(size_t)(vl * 4 / 4 - 1));
printf(" ... and so on for sub-registers 1..3\n");
}
return libgcrypt_layout_ok ? 0 : 1;
}
More information about the Gcrypt-devel
mailing list