[PATCH] cipher:riscv: gate Zvkned AES backend on VLEN == 128

Wed May 6 09:47:35 CEST 2026

Jussi,

Thanks for the reply.

> m4 batching code path selects 128-bit vectors (4 32-bit elements or
> 16 8-bit elements) and "m4" grouping. Whatever HW supports VLEN>128
> or VLEN=128 not should not matter here.

I think the code's assumption around __riscv_vset_v_u32m1_u32m4() may be wrong.

I'm not familiar with RVV intrinsics so I apologize for my ignorance
here. Claude did give a
minimal reproducer below using the same pattern as the libgcrypt code.
Results using qemu-user:

% riscv64-linux-gnu-gcc -O2 -march=rv64gcv -static -o
libgcrypt-rvv-vlen128-assumption libgcrypt-rvv-vlen128-assumption.c
% qemu-riscv64 -cpu rv64,v=true,vlen=128 ./libgcrypt-rvv-vlen128-assumption
Element-by-element view of out[0..15]:

out[ 0] = 10001111
out[ 1] = 10002222
out[ 2] = 10003333
out[ 3] = 10004444
out[ 4] = 20001111
out[ 5] = 20002222
out[ 6] = 20003333
out[ 7] = 20004444
out[ 8] = 30001111
out[ 9] = 30002222
out[10] = 30003333
out[11] = 30004444
out[12] = 40001111
out[13] = 40002222
out[14] = 40003333
out[15] = 40004444

libgcrypt-shaped layout (VLEN=128 assumption): OK -- 4 contiguous AES
blocks at out[0..15]
% qemu-riscv64 -cpu rv64,v=true,vlen=256 ./libgcrypt-rvv-vlen128-assumption
Element-by-element view of out[0..15]:

out[ 0] = 10001111
out[ 1] = 10002222
out[ 2] = 10003333
out[ 3] = 10004444
out[ 4] = 00000000
out[ 5] = 00000000
out[ 6] = 00000000
out[ 7] = 00000000
out[ 8] = 20001111
out[ 9] = 20002222
out[10] = 20003333
out[11] = 20004444
out[12] = 00000000
out[13] = 00000000
out[14] = 00000000
out[15] = 00000000

libgcrypt-shaped layout (VLEN=128 assumption): BUG -- AES_CRYPT m4
vl=16 will not find the 4 blocks here

Where each loaded m1 register actually lands in g (per RVV intrinsic
spec, sub-register N -> elements N*VLMAX_m1 .. (N+1)*VLMAX_m1 - 1):
out[0..3] = sub-register 0 (= r0 + r0-tail)
... and so on for sub-registers 1..3
% cat libgcrypt-rvv-vlen128-assumption.c
/*
 * Minimal reproducer: libgcrypt's RVV Zvkned AES kernels miscompute on
 * RVV-capable CPUs whose VLEN > 128.  The same bug shows up as AES-CFB
 * encrypt-decrypt mismatch in libgcrypt's tests/basic on tt-ascalon
 * (VLEN=256).
 *
 * NOT A GCC BUG.  The __riscv_vundefined / __riscv_vset / __riscv_vget
 * intrinsics behave exactly as the RVV intrinsic spec defines them.
 * The bug is in libgcrypt's USAGE pattern, which silently assumes
 * VLEN=128.
 *
 * The pattern (cipher/rijndael-riscv-zvkned.c CFB-DEC m4 path):
 *
 *   size_t vl = 4;                     // u32 elements per AES block
 *   vsetvli e32, m1, vl=4 (implicit via intrinsics with vl arg)
 *   vuint32m1_t r0, r1, r2, r3;        // each = 1 AES block
 *   vuint32m4_t g = __riscv_vundefined_u32m4();
 *   g = __riscv_vset_v_u32m1_u32m4(g, 0, r0);
 *   g = __riscv_vset_v_u32m1_u32m4(g, 1, r1);
 *   g = __riscv_vset_v_u32m1_u32m4(g, 2, r2);
 *   g = __riscv_vset_v_u32m1_u32m4(g, 3, r3);
 *   AES_CRYPT(e, m4, rounds, g, vl * 4);   // vsetvli e32, m4, vl=16
 *
 * The author intended "place 4 AES blocks contiguously at elements
 * 0..15 of g, then encrypt all four".  That's what the code does on
 * VLEN=128 (m1 has 4 elements / register; the m4 group has 16 elements
 * total at 4 per sub-register).
 *
 * On VLEN=256 (m1 = 8 elements/register, m4 = 32 elements):
 *   - vset places r0 in g's *register* 0 (whole register, 8 elements).
 *     Of those 8 elements, only the first 4 are r0's active data; the
 *     other 4 are r0's tail (whatever vsetvli e32m1 vl=4 left there --
 *     "all-1s" on tt-ascalon's tail-agnostic policy, "undisturbed" or
 *     other garbage on other CPUs).
 *   - vset slot 1 -> g register 1 (whole register), holds r1 + r1-tail.
 *   - vset slot 2 -> g register 2 (whole register), holds r2 + r2-tail.
 *   - vset slot 3 -> g register 3 (whole register), holds r3 + r3-tail.
 *   - AES_CRYPT with vl=16 then processes only the first 16 elements
 *     of g (elements 0..15 = sub-registers 0 and 1 only, with each
 *     sub-register holding ONE valid block + ONE tail block).
 *
 * Net effect: AES sees blocks
 *   block 0: r0's valid 4 elements (= intended slot 0)         GOOD
 *   block 1: r0's TAIL                                         WRONG
 *   block 2: r1's valid 4 elements (intended for slot 1!)      WRONG
 *   block 3: r1's TAIL                                         WRONG
 * and r2, r3 in sub-registers 2, 3 are never touched.
 *
 * This program demonstrates the layout difference.  It builds a u32m4
 * group via the libgcrypt pattern and stores 16 elements (the same vl
 * AES_CRYPT m4 uses) -- showing where each input block actually lands.
 *
 * Build:
 *   riscv64-linux-gnu-gcc -O2 -march=rv64gcv -static \
 *     -o libgcrypt-rvv-vlen128-assumption \
 *     libgcrypt-rvv-vlen128-assumption.c
 *
 * Run:
 *   qemu-riscv64 -cpu rv64,v=true,vlen=128 ./libgcrypt-rvv-vlen128-assumption
 *     PASS: blocks 0..3 land at out[0..3], out[4..7], out[8..11], out[12..15]
 *
 *   qemu-riscv64 -cpu rv64,v=true,vlen=256 ./libgcrypt-rvv-vlen128-assumption
 *     FAIL: out[4..7] is r0's tail; out[8..11] is r1; out[12..15] is r1's tail
 *
 * The libgcrypt fix is to lay the 4 blocks out byte-contiguously in
 * memory (4 blocks * 16 bytes = 64 bytes) and reload via
 * __riscv_vle8_v_u8m4 with vl=64, so the four blocks always land at
 * element positions 0..15 of the m4 group regardless of VLEN.  See
 * targets/libgcrypt-zvkned-fix.py in the wr2 harness for the patch.
 *
 * Author: Claude Opus 4.6
 */

#include <riscv_vector.h>
#include <stdint.h>
#include <stdio.h>

int main(void)
{
    /* Distinct per-block payloads so we can identify where each
       block's data ends up in memory.  Top nibble = source slot id. */
    uint32_t b0_in[4] = { 0x10001111, 0x10002222, 0x10003333, 0x10004444 };
    uint32_t b1_in[4] = { 0x20001111, 0x20002222, 0x20003333, 0x20004444 };
    uint32_t b2_in[4] = { 0x30001111, 0x30002222, 0x30003333, 0x30004444 };
    uint32_t b3_in[4] = { 0x40001111, 0x40002222, 0x40003333, 0x40004444 };

    size_t vl = 4; /* AES block = 4 u32 = 16 bytes */

    vuint32m1_t r0 = __riscv_vle32_v_u32m1(b0_in, vl);
    vuint32m1_t r1 = __riscv_vle32_v_u32m1(b1_in, vl);
    vuint32m1_t r2 = __riscv_vle32_v_u32m1(b2_in, vl);
    vuint32m1_t r3 = __riscv_vle32_v_u32m1(b3_in, vl);

    /* libgcrypt's pattern: build u32m4 group via vundefined + 4 vsets. */
    vuint32m4_t g = __riscv_vundefined_u32m4();
    g = __riscv_vset_v_u32m1_u32m4(g, 0, r0);
    g = __riscv_vset_v_u32m1_u32m4(g, 1, r1);
    g = __riscv_vset_v_u32m1_u32m4(g, 2, r2);
    g = __riscv_vset_v_u32m1_u32m4(g, 3, r3);

    /* Store the same number of elements that AES_CRYPT m4 vl=16 would
       process (16 u32 = elements 0..15 of the m4 group). */
    uint32_t out[16];
    __riscv_vse32_v_u32m4(out, g, vl * 4);

    printf("Element-by-element view of out[0..15]:\n\n");
    for (int i = 0; i < 16; i++)
        printf("  out[%2d] = %08x\n", i, out[i]);

    /* The libgcrypt-shaped expectation: blocks 0..3 contiguous at
       out[0..3], out[4..7], out[8..11], out[12..15].  This holds on
       VLEN=128 only. */
    int libgcrypt_layout_ok = 1;
    uint32_t *blocks[4] = { b0_in, b1_in, b2_in, b3_in };
    for (int blk = 0; blk < 4; blk++) {
        for (int e = 0; e < 4; e++) {
            int idx = blk * 4 + e;
            if (out[idx] != blocks[blk][e]) {
                libgcrypt_layout_ok = 0;
                break;
            }
        }
    }

    printf("\nlibgcrypt-shaped layout (VLEN=128 assumption): %s\n",
           libgcrypt_layout_ok
               ? "OK -- 4 contiguous AES blocks at out[0..15]"
               : "BUG -- AES_CRYPT m4 vl=16 will not find the 4 blocks here");

    if (!libgcrypt_layout_ok) {
        /* Show where the blocks actually went on this VLEN. */
        printf("\nWhere each loaded m1 register actually lands in g (per "
               "RVV intrinsic spec, sub-register N -> elements "
               "N*VLMAX_m1 .. (N+1)*VLMAX_m1 - 1):\n");
        printf("  out[0..%zu]   = sub-register 0 (= r0 + r0-tail)\n",
               (size_t)(vl * 4 / 4 - 1));
        printf("  ... and so on for sub-registers 1..3\n");
    }

    return libgcrypt_layout_ok ? 0 : 1;
}