riscv-non-isa / rvv-intrinsic-doc Goto Github PK

View Code? Open in Web Editor NEW

276.0 56.0 87.0 92.22 MB

Home Page: https://jira.riscv.org/browse/RVG-153

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.03% C 99.76% Python 0.21%

rvv-intrinsic-doc's Issues

Explicit vl intrinsics without vl argument

I think there are some intrinsic functions in explicit vl api which don't need to receive vl argument,

For examples,

vmv.x.s and vfmv.f.s always work when vl == 0. (ref: riscv/riscv-v-spec#284)
vl argument is meaningless for vundefined and vreinterpret.

Is there anything else missing?
If no one have problem, I will update function list.

RFC for RVV intrinsic API proposal

Hi,
Kai wrote the RFC for RVV intrinsic API proposal based on our current discussion.
Any suggestion and feedback are welcome.

https://github.com/sifive/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md

Explicitly VL API for mask operation should have _vl suffix and vl operand

Those instructions are use VL in their instruction definition[1], so I think we should add vl to the parameter list of those functions, e.g vbool1_t vmand_mm_b1 (vbool1_t op1, vbool1_t op2); should be vbool1_t vmand_mm_b1_vl (vbool1_t op1, vbool1_t op2, size_t vl);

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#16-vector-mask-instructions

Passing vl as arguments of intrinsics

We provide intrinsics with/without vl at the same time.

vop_vv_type(a, b)
vop_vv_type_vl(a, b, gvl)

Simplify vsetvl intrinsics.

Treat SEW and LMUL as parameters to vsetvl, instead of a bunch of vsetvl intrinsics as the following.

size_t vsetvl_8m1 (size_t avl);
size_t vsetvl_8m2 (size_t avl);
size_t vsetvl_8m4 (size_t avl);
size_t vsetvl_8m8 (size_t avl);
size_t vsetvl_16m1 (size_t avl);
size_t vsetvl_16m2 (size_t avl);
size_t vsetvl_16m4 (size_t avl);
size_t vsetvl_16m8 (size_t avl);
size_t vsetvl_32m1 (size_t avl);
size_t vsetvl_32m2 (size_t avl);
size_t vsetvl_32m4 (size_t avl);
size_t vsetvl_32m8 (size_t avl);
size_t vsetvl_64m1 (size_t avl);
size_t vsetvl_64m2 (size_t avl);
size_t vsetvl_64m4 (size_t avl);
size_t vsetvl_64m8 (size_t avl);

[Discussion] Vector Intrinsic header file organization

Hi, Did you already figure out how the header organization will be? I think the same idea can be translated to documentation too

Is The Mask Register Layout in rvv intrinsics different with v spec 0.9?

Hi, I noticed that the Mask Register Layout has changed in riscv-v-spec version 0.9.

A vector mask occupies only one vector register regardless of SEW and LMUL. Each element is allocated a single mask bit in a mask vector register.
The mask bit for element i is located in bit i of the mask register, independent of SEW or LMUL.

But if I don't get it wrong, the design of mask types in rvv intrinsics still use a suffix n (n=SEW/LMUL) to tell the MLEN.

Should this part be modified to be consistent with the v0.9 spec? Is there any design content in progress that can be shared?

sincerely,
Yin Zhang

Use wider integers for shifts

I'm concerned about several intrinsics which input a "shift" amount as uint8_t. This only allows shifts up to 255, which fails to expose underlying functionality for larger SEW (and ELEN).

Particular examples are vsll_vx, vsr{a,l}_vx, vnsr{a,l}_vx, vssr{a,l}_vx, and vnclip{,u}_wx. I only skimmed quickly; perhaps there are more.

The vv/wv forms use an appropriate type-width --- one that increases with SEW --- for their (vector) op2, and I suggest matching that in the vx/wx forms.

(It's true that the vi forms are restricted to a 5-bit op2, but these forms aren't exposed in the intrinsics API.)

EDIT: on further thought, perhaps it's more appropriate to use uintXLEN_t (or whatever it's called).

Debug info for RVV types

It's placeholder to discuss debug info for extended types in RVV intrinsic types. I think we should standardize how debug info gen for those data-type.

vslide1down needs float type APIs

For example, asm codes:

lw iState, (pState) #pState is a pointer to float32_t array, we load it's ieee754 value
vslide1down.vx vState, vState, iState

Could we have a API like:
vfloat32m8_t vslide1down[_vs_f32m8] (vfloat32m8_t src, float value);
so that we don't need to cast vector type at C codes.

Calling convention for vector arguments

Currently the vector spec only defines all vector registers/CSRs as caller saved, but it does not specify how to pass vectors as arguments.

We propose a calling convention where named vector arguments are passed from v1 to v31, and for vector types with LMUL > 1, it must be allocated to the next vector register that is aligned to their LMUL. Vector types with fractional LMULs and vector mask types (vbool*_t) are treated as occupying one register. Segment vector types should be passed in consecutive vector registers aligned to the base vector's LMUL. Vector types are returned in the same manner as the first vector argument. If all vector registers for argument passing are exhausted, then the rest of the vector arguments are passed on stack as whole vector register by pointers.

Some examples (the argument name corresponds to the vector register it uses):

// Vector arguments are passed from v1, v2, ..., v31
void f(vint8m1_t v1, vint8m1_t v2);

// For LMUL=8 types, they are passed in v8, v16, v24, and the rest on stack
void f(vint8m8_t v8_v15, vint8m8_t v16_v23);
void f(vint8m8_t v8_v15, vint8m8_t v16_v23, vint8m8_t v24_v31, vint8m8_t on_stack);

// For arguments with mixed LMUL, the vector register number is aligned to LMUL
void f(vint16m2_t v2_v3, vint8m1_t v4, vint64m8_t v8_v15);

// Returning grouped vector should be aligned to its LMUL (v8~v15 in this case)
vint32m8_t f(vint16m4_t v4_v7);

// fractional LMUL or vbool types are treated as LMUL=1
void f(vint8mf2_t v1, vbool8_t v2, vint8mf8_t v3);

// Segment types are aligned to the base LMUL
void f(vint8m1_t v1, vint16m2x3 v2_v7);

We avoid allocating v0 in the calling convention due to its ubiquitous purpose as the mask register, so callee do not have to move the first argument off v0 if it needs to use masked instructions.

The spec already defines all vector registers as caller-saved, so all of them may be allocated either as argument-passing registers or as temporary registers. The proposal right now chooses all for passing arguments so it is possible to pass up to 3 m8 arguments via register, but it may be up for debate anyway.

There is also a small optimization opportunity where smaller LMUL arguments can fill holes left by alignments due to previous larger LMUL arguments, for example:

void f(vint8m1_t v1, vint8m8_t v8_v15, vint8m1_t v2_or_v16);

Since v2 to v7 remain unused in the example, the m1 argument following m8 may be packed into v2 instead of using the next v16 register. This uses the registers more efficiently at the cost of slightly more complexity in the calling convention.

Any thoughts?

Support more util functions for vector status register.

Currently, we have "vreadvl()" to get the vl, we could have more util functions for "mstatus", "vlenb", "vstart" and other status register.

Should intrinsic have to model tail elements?

current intrinsic RFC does not model tail element,
but someone maybe want to set the tail undisturbed like riscv/riscv-v-spec#157 (comment).
I think maybe we can only provide additional tail operand for reduction intrinsic functions to model tail elements.
any idea?

Query regarding riscv_vector headerfile

I am trying to run example programs but did not get "riscv_vector.h" header file.

I have already installed riscv-gnu-toolchain along with riscv-pk and riscv-spike.

Do I need to install some other tools or some other environment to access riscv vector extension intrinsics (riscv_vector.h) ?

separate riscv_vector.h into riscv_vector.h and riscv_vector_vl.h

I notice we use riscv_vector.h for intrinsics with vl and without vl argument. May we separate riscv_vector.h into riscv_vector_vl.h and riscv_vector_vl.h?
If users wants to use intrinsics with vl argument, they could include riscv_vector_vl.h.
If users wants to use intrinsics without vl argument, they could include riscv_vector.h.
In addition, compilers may give a warning (or error) if users include these files in the same source.

Casting vbool to vector

Hi, storing mask bits seems like a reasonable thing to want to do.

In assembly we can just use the v0 register, but in intrinsics vbool and vuint are separate types and I haven't seen any way to convert them.

  vuint8m8_t v = vmv_v_x_u8m8(0);
  vbool1_t m = vmseq_vv_u8m8_b1(v, v);
  uint8_t bits[256];
  vse8_v_u8m1(bits, m);

error: cannot convert 'vbool1_t' to 'vuint8m1_t'

Is there a way to do this, or can we add one?

"Vector Integer Add-with-Carry / Subtract-with-Borrow Functions" naming rule.

Hi,

May I ask why "Vector Integer Add-with-Carry / Subtract-with-Borrow Functions" should keep their suffixes, e.g.,vadc_vvm, vadc_vxm. I think they can be distinguished by their argument type just like "vadd". Thanks.

New generic intrinsic function rule.

I would like to purpose a different overloading rule to reduce the number of intrinsic functions when using generic API.

We could use the approach mention in #21 (comment) (ex. use overloadable attribute in clang) instead of C11 generic selection to support overload functions with different number of arguments.
Support overloaded as more as possible. Consider function arguments to decide unsupported instructions.
1. scalar arguments alone. (non-masked vle/vlse, etc.)
2. empty function argument. (vmclr.m/vmset.m/vid.v)
3. boolean vector arguments with return type of a non boolean vector. (viota)

Applying above rules we could reduce ~78% number of intrinsic functions when using overloading API. (from 13106 to 2716)

Any suggestions?

The detail of changed shows below:

Unsupported overloaded functions:

non-masked vle/vlse and related segment load instructions.

vint8m1_t vle8_v_i8m1 (const int8_t *base);
vint8m2_t vle8_v_i8m2 (const int8_t *base);

non-masked viota.m

vuint8mf8_t viota_m_u8mf8 (vbool64_t op1);
vuint16mf4_t viota_m_u16mf4 (vbool64_t op1);

vmv.v.x, vfmv.v.f

vint8m1_t vmv_v_x_i8m1 (int8_t src);
vint8m2_t vmv_v_x_i8m2 (int8_t src);

vmclr.m/vmset.m/vid.v (empty argument)

Supported new overloaded functions:

masked vle/vlse and related segment load instructions.
vse, vsse and indexed load/store and related segment load/store instructions.
masked viota.m
one overloading function for vmv.v.v, vmv.x.s, vmv.s.x, vfmv.f.s, vfmv.s.f
one overloading function for vmadc.vvm, vmadc.vxm, vmadc.vv, vmadc.vx
one overloading function for vmsbc.vvm, vmsbc.vxm, vmsbc.vv, vmsbc.vx

How to combine/split vectors use rvv intrinsics?

Hi,

I want to use rvv intrinsics to combine/split vectors.
Combining operation is like the vcombine instruction in NEON, to contact two short vectors.
And splitting operation is something like vget_low and vget_high instructions in NEON, to get halves of the vector.
But I didn't find intrinsics to combine/split vectors in the reference manual.
I think maybe I can use shift operation and some arithmetic operations to realize these two functions, but it is too complicated. Is there an easy way to do this? Or should we add such a function to intrinsics?

Sincerely,
Yin

[Naming] Remove postfix `_m` for merge and compress intrinsic functions.

Those two intrinsic functions have _m in postfix of function name in current RFC

vint8m1_t vmerge_vvm_i8m1_m (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vcompress_vm_i8m1_m (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t src);

We propose removing _m and changed them to

vint8m1_t vmerge_vvm_i8m1 (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vcompress_vm_i8m1 (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t src);

We thought only v0.t intructions need to have _m for distinguishing non-mask and mask operation.
Those two instructions only have one type of operation.
Mask info is already encoding in suffix vvm and vm
More clear to users, users will not guess those intrinsics also have "non _m" version.

FYI: v0.t and v0 mask register had different meaning in v-spec.

Support LMUL truncation and LMUL extension functions

There are some requirements from users they want to keep current SEW but change LMUL (in #28 and #37)

Maybe intrinsic functions could support LMUL truncation and LMUL extension regardless of vl. (It mean those functions would not change vl register)

ps. I think this is not a Reinterpret operation because theirs VLEN are different.

naming could be vlmul_[ext|trunc]_v_<src_type_with_lmul>_<dst_type_with_lmul>

interfaces would looks like

// LMUL Extension, vlmul_ext_v_<src_lmul>_<target_lmul>
vint64m2_t vlmul_ext_v_i64m1_i64m2 (vint64m1_t op1);
vint64m4_t vlmul_ext_v_i64m1_i64m4 (vint64m1_t op1);
vint64m8_t vlmul_ext_v_i64m1_i64m8 (vint64m1_t op1);
vint64m4_t vlmul_ext_v_i64m2_i64m4 (vint64m2_t op1);
vint64m8_t vlmul_ext_v_i64m2_i64m8 (vint64m2_t op1);
vint64m8_t vlmul_ext_v_i64m4_i64m8 (vint64m4_t op1);

vint64m1_t vlmul_trunc_v_i64m2_i64m1 (vint64m2_t op1);
vint64m1_t vlmul_trunc_v_i64m4_i64m1 (vint64m4_t op1);
vint64m2_t vlmul_trunc_v_i64m4_i64m2 (vint64m4_t op1);
vint64m1_t vlmul_trunc_v_i64m8_i64m1 (vint64m8_t op1);
vint64m2_t vlmul_trunc_v_i64m8_i64m2 (vint64m8_t op1);
vint64m4_t vlmul_trunc_v_i64m8_i64m4 (vint64m8_t op1);

any thought?

Approximate reciprocal/rsqrt

It seems that vfrsqrte7 and vfrece7 from v0.9 are not yet added, perhaps pending the naming discussion in riscv/riscv-v-spec#601 ?

Should reduction have a dest/merge operand?

Operations that generate a "scalar vector" (e.g. vmv.s.x, vfmv.s.f) have a dest operand of the kind of the destination so the original value, except element 0, can be preserved.

For instance

vint32m1_t vmv_s_x_i32m1 (vint32m1_t dst, int32_t src);

Reductions also generate "scalar vectors" but don't seem to have the same treatment.

vint32m1_t vredmax_vs_i32m2_i32m1 (vint32m2_t vector, vint32m1_t scalar);

Do we want to have a dest operand in this case? Like this:

vint32m1_t vredmax_vs_i32m2_i32m1 (vin32m1_t dest, vint32m2_t vector, vint32m1_t scalar);

I don't think it is super fundamental but maybe someone wants to preserve the other elements for some reason?

Vector tuple type

There is several possible to implement vector tuple type:

Define vector tuple type as primitive type
Define vector tuple type by struct/aggregate
Define vector tuple type by array
Define vector tuple type as primitive type but provide curly braces initialization only.
Define vector tuple type as primitive type but provide subscribe operator and curly braces initialization.

The advantage of 2, 3, 4 and 5 is it could provide syntax sugar to access element in the vector tuple type instead of intrinsic function call:

/* ------ Primitive style ------ */
vint32m2x3_t vt;
vint32m2_t va

vint32m2x3_t vt2 = vcreate_i32m2x3(va, va, va); // Creation.
vt = vset_i32m2x3(vt, 0, va); // Insertion.
va = vget_i32m2x3(vt, 0); // Extraction.

/* ------ Array style ------ */
typedef vint32m2_t vint32m2x3_t[3];
vint32m2x3_t vt;
vint32m2_t va

vint32m2x3_t vt2 = {va, va, va}; // Creation.
vt[0] = va; // Insertion.
va = vt[0]; // Extraction.

/* ------ Struct style ------ */
typedef struct {
  vint32m2_t x;
  vint32m2_t y;
  vint32m2_t z;
} vint32m2x3_t;
vint32m2x3_t vt;
vint32m2_t va

vint32m2x3_t vt2 = {va, va, va}; // Creation.
vt.x = va; // Insertion.
va = vt.y; // Extraction.

Currently SVE's GCC implementation is 4 and disallow declare array and struct with scalable vector type.

Add functions to copy data between these vector register types and standard C arrays.

Can you please add a few simple convenience functions for copying data between these vector register types and standard C arrays, e.g., to make debug and things like printf() of a vector easier? I’m thinking something like: int64_t foo[MVL]; foo = toArray(vs1);

https://lists.riscv.org/g/tech-vector-ext/message/135

Reinterpret between different LMUL under the same SEW

A vint8m1_t type can definitely be changed to vint8m8_t.
A vint8m8_t type can be changed to vint8m1_t if vl is already known. (Of course, users should aware what they are doing.)

vint8m2_t vreinterpret_v_i8m1_i8m2 (vint8m1_t src);
vint8m4_t vreinterpret_v_i8m1_i8m4 (vint8m1_t src);
vint8m8_t vreinterpret_v_i8m1_i8m8 (vint8m1_t src);
...
vint8m1_t vreinterpret_v_i8m8_i8m1 (vint8m8_t src);
vint8m2_t vreinterpret_v_i8m8_i8m2 (vint8m8_t src);
vint8m4_t vreinterpret_v_i8m8_i8m4 (vint8m8_t src);

should intrinsic support mask operation without masked off (merge) parameter?

SVE uses _z, _m and _x in function suffix to describe the inactive elements in the result of a predicated. _z for zero, _m for masked off(merge), and _x for don't care.

current mask operation api (apply Romain's suggestion)

vint8m1_t vadd_vv_mask_i8m1 (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t op1, vint8m1_t op2);

Should we support mask function without masked off (merge) parameter in primitive layer?
(ps. I assume intrinsic in primitive layer has ASM 1-to-1 orthogonal naming)
If yes, how to naming? ex:

vint8m1_t vadd_vv_mask_x_i8m1 (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);

maybe using the mask abbrev. and put it into function suffix is more clear.

vint8m1_t vadd_vv_i8m1(vint8m1_t op1, vint8m1_t op2);
vint8m1_t vadd_vv_i8m1_m(vbool8_t mask, vint8m1_t maskedoff, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vadd_vv_i8m1_x(vbool8_t mask, vint8m1_t op1, vint8m1_t op2);

how to naming when the operation returns the same type for different operand types

general naming rule is encoding destination type into function suffix.

Romain Dolbeau suggest:

Then I would suggest retaining the output-type-as-suffix, and altering
the type specifier from the mnemonic (e.g., the 'm' bit in the name) to
make it more significant, as it's the part that differs, i.e. something
along the line of:

unsigned long vpopc_mb1_ulong(vbool1_t op1);
unsigned long vpopc_mb2_ulong(vbool2_t op2);

A generic/overloaded could drop the extra bit.

Make reduction more flexible

Follow #25, the current reduction instructions use this form

vint8m1_t vredsum_vs_i8mf8_i8m1 (vint8m1_t dst, vint8mf8_t vector, vint8m1_t scalar);
vint8m1_t vredsum_vs_i8mf4_i8m1 (vint8m1_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint64m1_t vredsum_vs_i64m8_i64m1 (vint64m1_t dst, vint64m8_t vector, vint64m1_t scalar);

I have 2 questions for this interface.

The purpose of dst is used to preserve the tail elements. Why does this only support m1 type?
The scalar only use the first element. Why does this only support m1 type?

In my opinion, it should support like the following

vint8mf8_t vredsum_vs_i8mf8_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8m1_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8m1_t scalar);
...
vint8mf8_t vredsum_vs_i8mf4_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8m1_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint8mf4_t vredsum_vs_i8mf4_i8mf8_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf4_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf2_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8m1_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8m1_t scalar);
...

To fully support the whole combination, we need 343 (7 x 7 x 7) intrinsics for i8 type redsum.

However, users will be annoyed by this interface design. To solve this, I have the following solutions.

Just do it.

Add a type which only describes the first vector element, and a helper intrinsic to describe the first vector element. The interface would be

vint8_t get_first(vint8m1_t); // No op. This is not vmv_x_s.

vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
...
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
...

49 (7 x 7) intrinsics for i8 type redsum.

Keep the current status.
If users want to use other LMUL to preserve the tail elements, they should do vmerge by themself.
Add intrinsics to support different LMUL exchange (so that the parameter scalar can be more flexible), e.g., mf8_to_m1 and m8_to_m1.
Same intrinsics number for i8 type redsum.
In addition, this might partially solve #28 and #37 together.

I prefer the solution 2. Any idea?

C operator for scalable vector types

What kinds of C operator should we support for scalable vector types? What is the semantic of C operator on scalable vector types? Should it operate on VLMAX or vl or something else?

What is the behavior and limitation of scalable vector types?

Fractional LMUL

We have made good progress, but I'm afraid that the release 0.9 of the V spec is coming down fast and methinks that the most radical change that it introduces is the new values of LMUL.

Please, share your thoughts about it here.

Why is there no intrinsics for Type-Convert from 'vfloat32m1_t' to 'vfloat16mf2_t' in Narrowing Floating-Point Type-Convert intrinsics?

I find that there is no Narrowing Floating-Point Type-Convert intrinsics for from 'vfloat32m1_t' to 'vfloat16mf2_t' in Intrinsic Functions List

Like this:
vfloat16mf2_t vfncvt_f_f_w_f16mf2 (vfloat32m1_t src);

Is it missing? Or is there no such intrinsic for some reason?

Add a new argument to ensure the result is correct for vslideup and vslidedown

All of vslideup and vslidedown have an issue.

For example, the vslideup.vx operation is vd[i + rs1] = vs2[i].
So We don't know Vd's value from vd[i] to vd[i + rs1].
If we want to know all of vd's value, then we need to insert a argumenet,
to initial Vd value. Thus, I suggest to add a new argument for initial Vd value.

vint8m1_t vslideup_vx_i8m1 (vint8m1_t src, size_t offset);

change to

vint8m1_t vslideup_vx_i8m1 (vint8m1 dst, vint8m1_t src, size_t offset);

Support vxrm intrinsics

May we have intrinsics to set vxrm?
Follow fenv

#define VE_TONEARESTUP   /*implementation defined*/
#define VE_TONEARESTEVEN /*implementation defined*/
#define VE_DOWNWARD      /*implementation defined*/
#define VE_TOODD         /*implementation defined*/

int vegetxround();
int vesetxround(int round);
return 0 on success, non-zero otherwise.

Naming strategy

The intrinsic API should have the goal to make all the V-ext instructions accessible from C. We will provide intrinsics 1-to-1 mapping to assembly mnemonics and additional intrinsics for semantic reason, e.g. fma, splat, etc.

How to define vzero and vreinterpret

Dear all,
We are adding new intrinsics which is depending on riscv-v-spec 0.9 recently, but there remain two new intrinsics "vzero and vreinterpret" which I have never seen in riscv-v-spec 0.9.
The definition of vzero and vreinterpret we have defined is as following in
src/llvm/include/llvm/IR/IntrinsicsRISCVVector.td

def int_riscv_vzero_i8m1
  : Intrinsic<[llvm_nxv1i8_ty],
              [],
              [IntrNoMem]>;  


def int_riscv_vreinterpret_u64_i64_u64m2
  : Intrinsic<[llvm_nxv2i64_ty],
              [llvm_nxv2i64_ty],
              [IntrNoMem]>;

But in src/llvm/include/llvm/IR/RISCVInstrInfoV.td, the riscv-v-spec 0.9 doesn't give any illustration of the encoding definition to vzero and vreinterpret, Do you know how to define them?

//__builtin_riscv_vzero_i8m1()
def : Pat<(int_riscv_vzero_i8m1), ()>;

//___builtin_riscv_vreinterpret_u64_i64_u64m2(src)
def : Pat<(int_riscv_vreinterpret_u64_i64_u64m2), ()>;

One example is as following, vzero and vreinterpret don't have the encoding definition.

let hasSideEffects = 0, mayLoad = 1, mayStore = 0 in
multiclass VLoad_UnitStride<bits<3> nf, bits<1> mew, bits<2> mop,
                            bits<3> width, string opcodestr> {
  def _m  : RVInstVLoad<nf, mew, mop, width, RVV_Masked, OPC_LOAD_FP,
                  (outs VR:$vd), (ins GPR:$rs1, Zero:$zero, VMR:$vm),
                  opcodestr, "$vd, ${zero}(${rs1}), $vm"> {
    let Inst{24-20} = 0b00000;
  }
  def _um : RVInstVLoad<nf, mew, mop, width, RVV_Unmasked, OPC_LOAD_FP,
                  (outs VR:$vd), (ins GPR:$rs1, Zero:$zero),
                  opcodestr, "$vd, ${zero}(${rs1})"> {
    let Inst{24-20} = 0b00000;
  }
}  

defm VLE8_V : VLoad_UnitStride<0b000, 0b0, 0b00, 0b000, "vle8.v">, Sched<[]>;
defm VLE16_V : VLoad_UnitStride<0b000, 0b0, 0b00, 0b101, "vle16.v">, Sched<[]>;  

//__builtin_riscv_vle8_v_i8m1(base)
def : Pat<(int_riscv_vle8_v_i8m1 GPR:$rs1), (VLE8_V_um GPR:$rs1, 0)>;

//__builtin_riscv_vle16_v_i16m1(base)
def : Pat<(int_riscv_vle16_v_i16m1 GPR:$rs1), (VLE16_V_um GPR:$rs1, 0)>;

Thank you
Best Regards
William

Examples need updating

Hi,

we have been testing an early implementation of the intrinsics RFC against the two examples in this repository using a small emulator of ours. We have been able to run rvv_saxpy.c successfully using LMUL=1 instead of LMUL=8 but we found it needed a couple of changes other than those changes to go from m8 to m1.

diff --git a/rvv_saxpy.c b/rvv_saxpy.c
index 6b43025..a4de0f0 100644
--- a/rvv_saxpy.c
+++ b/rvv_saxpy.c
@@ -44,7 +44,7 @@ float output[N] = {
     0.2484350696132857};
 
 void saxpy_golden(size_t n, const float a, const float *x, float *y) {
-  for (size_t i; i < n; ++i) {
+  for (size_t i = 0; i < n; ++i) {
     y[i] = a * x[i] + y[i];
   }
 }
@@ -55,11 +55,11 @@ void saxpy_vec(size_t n, const float a, const float *x, float *y) {
   vfloat32m8_t vx, vy;
 
   for (; (l = vsetvl_e32m8(n)) > 0; n -= l) {
-    vx = vle_v_f32m8(x);
+    vx = vle32_v_f32m8(x);
     x += l;
-    vy = vle_v_f32m8(y);
+    vy = vle32_v_f32m8(y);
     vy = vfmacc_vf_f32m8(vy, a, vx);
-    vse_v_f32m8 (y, vy);
+    vse32_v_f32m8 (y, vy);
     y += l;
   }
 }

Also we understand some of the loads in sgemm needs updating to use intrinsics, like shown below:

diff --git a/rvv_sgemm.c b/rvv_sgemm.c
index 975ba99..71ebd28 100644
--- a/rvv_sgemm.c
+++ b/rvv_sgemm.c
@@ -49,7 +49,7 @@ float b_array[MAX_BLOCKSIZE] = {1.7491401329284098,  0.1325982188803279,
 float golden_array[OUTPUT_LEN];
 float c_array[OUTPUT_LEN];
 
-void sgemm_golden() {
+void sgemm_golden(void) {
   for (size_t i = 0; i < MLEN; ++i)
     for (size_t j = 0; j < NLEN; ++j)
       for (size_t k = 0; k < KLEN; ++k)
@@ -63,23 +63,24 @@ void sgemm_vec(size_t size_m, size_t size_n, size_t size_k,
                size_t ldb,
                float *c, // m * n matrix
                size_t ldc) {
-  int i, j, k;
+  int j, k;
   size_t vl;
   vfloat32m1_t vec_c;
   for (int i = 0; i < size_m; ++i) {
     j = size_n;
     const float *bnp = b;
     float *cnp = c;
-    for (; vl = vsetvl_e32m1(j); j -= vl) {
+    for (; (vl = vsetvl_e32m1(j)); j -= vl) {
       const float *akp = a;
       const float *bkp = bnp;
-      vec_c = *(vfloat32m1_t *)cnp;
+      vec_c = vle32_v_f32m1(cnp);
       for (k = 0; k < size_k; ++k) {
-        vec_c = vfmacc_vf_f32m1(vec_c, *akp, *(vfloat32m1_t *)bkp);
+        vec_c = vfmacc_vf_f32m1(vec_c, *akp,
+            vle32_v_f32m1(bkp));
         bkp += ldb;
         akp++;
       }
-      *(vfloat32m1_t *)cnp = vec_c;
+      vse32_v_f32m1(cnp, vec_c);
       cnp += vl;
       bnp += vl;
     }
@@ -98,7 +99,7 @@ int fp_eq(float reference, float actual, float relErr)
 int main() {
   // golden
   memcpy(golden_array, b_array, OUTPUT_LEN * sizeof(float));
-  sgemm_golden(MLEN, NLEN, KLEN, a_array, KLEN, b_array, NLEN, golden_array, NLEN);
+  sgemm_golden();
   // vector
   memcpy(c_array, b_array, OUTPUT_LEN * sizeof(float));
   sgemm_vec(MLEN, NLEN, KLEN, a_array, KLEN, b_array, NLEN, c_array, NLEN);

However I'm confused with the sgemm vector as it seems to load a row of the "b" matrix rather than a column? Should this load below

            vle32_v_f32m1(bkp));

load a column instead using a strided load or I totally misunderstood the code?

Kind regards,

Drop opacity of VL

Having VL an opaque type adds useless complexity to the general case, and is not justified in any way beyond "To avoid users to manipulate vl argument in explicit vl intrinsic", which doesn't make sense. The point of explicit VL in instrinsics is to be able to avoid vsetvl for short-term change. If it's needed to get the proper opaque type, what's the point?

A typical way of writing this:

for (i = 0; i<n;i++) {
  c[i]=a[i]+b[i];
}

would be (not the proper syntax to save time)

while (i < n) {
  unsigned long vl = vsetvl(n - i); 
  vtype va = vload(a+i, vl);
  vtype vb = vload(b+i, vl);
  vtype vc = vadd(va, vb, vl);
  vstore(c+i, vc, vl);
  i +=vl;
}

Forcing an opaque type on vl will add complexity by forcing an unexpected type on vl (it is semantically a number after all...).

The point of an explicit VL is that this:

vsetvl(X);
vop(a,b);

can be shortened to this and the intrinsics be made fully specified (and not just 2 out of the 3 implicit values, lmul and sew, but all three of them):

vop(a,b,X);

If a vsetvl is needed then it's useless...

To summarize:
a) with only non-fully-specified (VL-less) intrinsic, opacity adds complexity and brings nothing
b) with full-specified intrinsics, it defeats the point by forcing vsetvl() where none is needed ot wanted

Compatibility with multiple RISCV vector standards

AFAK，the riscv vector 0.7.1 is stable version like long term version of Linux kernel,

we should to consider that the functions can cover multiple version of vector spec :)

maskedoff argument to vcompress intrinsic might not be the best name

Based on the behavior of the vcompress instruction, this argument seems to be the vector to use for tail elements. It doesn't have a direct relationship to the mask the way "maskedoff" does for other instructions. The lower popcnt(mask) elements of the result of this instrinsic come from "src" while the upper elements [vl-1:popcnt(mask)] elements come from "maskedoff". The position of the mask bits doesn't control which elements from maskedoff are used, rather the number of 1s in the mask does.

This also means that this intrinsic must use the tail undisturbed policy.

Support reinterpretation between different types of the same LMUL

As https://github.com/sifive/rvv-intrinsic-doc/issues/10#issuecomment-617226293 mention,
should we support reinterpret function for different types of the same LMUL?
ex. i32m1 <-> i64m1

Is there any real scenario?

RFC: New API and types for supporting segment load/store.

Introduction

In order to supporting segment load/store, I would like to introduce bunch of new type and API for that, the function naming might changed in future to fit the 1-1 mapping naming rule, so the main part want to asking feedback is the interface and the new type.

There is 3 part in this RFC:

Vector Tuple Types
Utils functions for manipulate vector tuple value
Intrinsic API for Segment Load/Store

Vector Tuple Types

Naming rule:

v{TYPE}{SEW}m{LMUL}x{NR}_t

Type = vint | vuint | vfloat
SEW = 8 | 16 | 32 | 64
LMUL = 1 | 2 | 4 | 8
NR = 1 | 2 | 3 | 4 |5 | 6 | 7 |8
LMUL x NR < 8
- constrained by HW.
number of Valid combination: 165
- FP16 vector type (vfloat16m*_t) variant included.
e.g.
- vint32m1x7_t
- vuint8m4x2_t
- vfloat64m4x2_t

Utils functions for manipulate vector tuple value

Symbol used in naming rule:

TTYPE = Tuple Type
VTYPE = Vector Type

Creation

Naming rule:

{TTYPE} vcreate_{TYPE_L}{SEW}m{LMUL}x{NR} ({VTYPE} val1, val2...)
e.g.
- vint32m2x3_t vcreate_i32m2x3 (vint32m2_t val1, vint32m2_t val2, vint32m2_t val3);

Insertion

Naming rule:

{TTYPE} vset_{TYPE_L}{SEW}m{LMUL}x{NR} ({TTYPE} tuple, size_t index, {VTYPE} val)
e.g.
- vint32m2x3_t vset_i32m2x3 (vint32m2x3_t tuple, size_t idx, vint32m2_t val);

Extraction

Naming rule:

{VTYPE} vget_{TYPE_L}{SEW}m{LMUL}x{NR} ({TYPE}tuple, index)
e.g.
- vint32m2_t vget_i32m2x3 (vint32m2x3_t tuple, size_t idx);

Intrinsic API for Segment Load/Store

Symbol used in naming rule:

TYPE := vector type
SCALAR_TYPE := corresponding scalar type
TYPE_L := Type letter, i, u or f

Unit-stride segment load/store

{TYPE} vseg_load_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base)
void vseg_store_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, {TYPE} value)
e.g
- vint32m2x3_t vseg_load_i32m2x3 (const int32_t *base);
- void vseg_store_f64m1x7 (const int32_t *base, vfloat64m1x7_t);

Stride segment load/store

{TYPE} vseg_loads_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, ptrdiff_t bstride)
void vseg_stores_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, ptrdiff_t bstride, {TYPE} value)
e.g
- vint32m2x3_t vseg_loads_i32m2x3 (const int32_t *base, ptrdiff_t bstride);
- void vseg_stores_i32m2x3 (const int32_t *base, ptrdiff_t bstride, vfloat32m2x3_t);

Index segment load/store

{TYPE} vseg_loadx_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex)
void vseg_storex_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex, {TYPE} value)
void vseg_storeux_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex, {TYPE} value)
e.g
- vint32m2x3_t vseg_loadx_i32m2x3 (const int32_t *base, vuint32m2_t bindex);
- void vseg_storex_i64m1x7 (const int32_t *base, vuint64m1_t bindex, vfloat64m1x7_t);
- void vseg_storeux_i64m1x7 (const int32_t *base, vuint64m1_t bindex, vfloat64m1x7_t);

Expanding for float type

Hi,
I find that there are no intrinsics such as vreinterpret_v_f16m1_f32m1(vfloat16m1_t src).
How can I expand float types for example from vfloat16m1_t to vfloat32m1_t?

Intrinsic functions supported are inconsistent on different HW config

For example:
vslide1up/down and vmv.s.x have below constraints:
If XLEN < SEW: the value is sign-extended to SEW bits.
If XLEN > SEW: the least-significant bits are copied over and the high SEW-XLEN bits are ignored.

Current SiFive proposal:
provide signed/unsigned interface, but the second operand type always is long. (The second operand type is corresponding to XLEN)

In RV64 with SEW=64 config (XLEN == SEW):
vuint64m1_t vslide1up_vs_u64m1 (vuint64m1_t src, long value); // Is it weird? In this HW config, the value's type can be unsigned long

In RV32 with SEW=64 config (XLEN < SEW):
vuint64m1_t vslide1up_vs_u64m1(vuint64m1_t src, long value); // unsigned long value is illegal because value would be sign-extended to 64 bits.

Does it also mean all vector-scalar integer operations, almost the type of scalar should be long or unsigned long?
In addition, FLEN < SEW has similar problem.

There is another problem(?) in RV64 SEW=32 config (XLEN > SEW)
vint32m1_t vslide1up_vs_i32m1 (vint32m1_t src, long value); // The value will be truncated to 32-bit implicitly

one of solution is support intrinsic function optionally on different HW config.
For example, on RV32 platform, SEW=64 vector-scalar integer operations would not supported.

Any idea or suggestion?

ps. vslide1up/down: The slide instructions move one elements up and down a vector register group.
ps. vmv.s.x: The integer scalar move instruction copies the scalar integer register to element 0 of the destination vector register.
ps. spec define the value in SEW up to max(XLEN,FLEN).
ps. RV32 means XLEN and long is 32 bit
ps. in intrinsic document, vmv.s.x api had not fixed yet.

Remove splats?

The v{f}splat routines appear to just be simple aliases for the v{,f}mv_v_{x,f} routines. There's obviously no harm in including them; however, in such an enormous API I think we should take measures to reduce complexity whenever possible. Thus I propose removing the splats.

Less drastic measures are (1) to deprecate them now and revisit removing them sometime in the future, or (2) to make them an optional extension.

Support the non-zero vstart value for vector load/store intrinsics.

The vector load/store could support the non-zero vstart value.
Currently, the vector load/store intrinsic imply to call the vsetvl to update the EMUL. The vsetvl will reset the vstart to zero.

Thus, the following code will not work.

set_vstart(new_vstart_value)
dst=vload_e32_m1(ptr) // Since the vload implies a vsetvl to set the EMUL, the "new_vstart_value" will be cleared.

We might need to have the additional vstart parameter for the vector load/store intrinsic, and then set the new vstart value after the vsetvl instruction.

Use fixed-width integer type instead of long?

In a number of places, the V-extension intrinsics API uses long or unsigned long to represent an XLEN-bit integer (signed or unsigned, resp.). It's true that long has size XLEN in the currently defined ILP32 and LP64 ABIs
https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md#-named-abis
However, note that

A future version of this specification may define an ILP32 ABI for the RV64 ISA

And one can imagine a similar scenario for RV128. Hence, I think it's preferable to future-proof the intrinsics API by using typedefs [Footnote 1]. I don't think this adds much notational overhead.

Please note that we had the same discussion for the P-extension intrinsics, and arrived at the same conclusion
https://lists.riscv.org/g/tech-p-ext/topic/73148653#26

[Footnote 1]
On GCC and Clang/LLVM, you can alias the relevant types as follows:

#include <inttypes.h>
#if __riscv_xlen == 32
typedef int32_t int_xlen_t;
typedef uint32_t uint_xlen_t;
#elif __riscv_xlen == 64
typedef int64_t int_xlen_t;
typedef uint64_t uint_xlen_t;
#endif

Those who enjoy preprocessor wizardry might prefer something like

#define create_type2(a, b, c) a ## b ## c
#define create_type1(a, b, c) create_type2(a, b, c)
typedef create_type1(int,__riscv_xlen,_t) int_xlen_t;
typedef create_type1(uint,__riscv_xlen,_t) uint_xlen_t;

Encode return type into intrinsics name.

EPI believes it is better to always express the return type. The type of the operands is easy to determine by looking at the declaration of the variable being used. However if a function call is used as the operand, having the return type is more immediate (i.e. avoids to "recursively" go to the declaration of the operands to understand what is its type). This may be even more relevant for conversions where the result differs from the operands.

However, there are some operations with the same return type. We need to decide how to encode return type in intrinsics name to avoid overloading.

RFC: Explicit VL API as The Final Vector Intrinsic API Proposal

TL;DR

We want to remove implicit VL API and made some API changes and intent to change the high-level semantic of vector intrinsic, so that we can doing more aggressive optimization.

Abstract

The goal of this RFC is improving the intrinsic for vector programming, and chose the explicit VL API as the only one intrinsic API, this RFC is consistent with two part, first part is explanation why we pick explicit VL API as final proposal, and second part is removing the concept of low level register status from the intrinsic programming model.

Background

Last year we've announce the intrinsic spec for vector programming in C. We got lots of useful feedback from several different parties including BSC, SiPearl, Andes, OpenCV China, PLCT Lab, Alibaba/T-Head.

Today, we have open source implementation on both GCC and LLVM*, which is implemented in a different approach, implicit VL and explicit VL, we were expect compiler could be using simple C wrapper to implement each other API, e.g. using implicit VL API with C wrapper to implement explicit VL, however it turns out become a barrier of optimization.

The issue is because the concepts of both API are kind of incompatible, after long discussion and exploration, we think it’s time to pick one as the final proposal of intrinsics spec, in order to reduce the compiler maintenance effort and reduce the learning curve of intrinsic function.

Keep only one intrinsic API also having an advantage on the compiler optimization side, we found several optimization opportunities, but we can’t do that because we need to make sure both API have correct semantic and behavior.

So which one is better is the question, back to the reason why we have two different style API is because we don’t have a conclusion on which API is better before, but this time is different, we have enough exploration, experience and feedback to make the right decision.

LLVM parts are upstream in progress.

Explicit VL API As The Final Vector Intrinsic

After implementing the intrinsic API on both GCC and LLVM, we found several good reasons for the using explicit VL API from the compiler aspect, explicit VL define-use relationship is more natural to the compiler, it made the analysis and optimization more easier.

We also get feedback from users for both API, implicit VL API is less verbose, but is hard to track the status of VL register, that’s also made debug harder, having an explicit VL argument makes programming easier.

So based on both sides - user feedback and consideration of compiler implementation, we believe explicit is the right way to go, and based on the decision we propose following changes for the vector intrinsic API.

Abstract Low-Level Register Modeling in High-Level Language Layer

During the implementation phase, we found a fact is the status of VL register becomes an optimization barrier, we must maintain the correct order between vreadvl and vsetvli and all other explicit VL API, because explicit VL API has the semantic of writing VL register.

For example, we can’t reorder the operations across vreadvl for the following program.

n = 10000;
avl = vsetvl_e32m4 (n) // Assume VLEN=128, so avl = 16
vl1 = vreadvl(); // 16
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4);
vl2 = vreadvl(); // 4
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, 4, vl / 2);
vl3 = vreadvl(); // 8
vse32_v_i32m4_vl(a, tmp, vl);

Furthermore, we can't move explicit VL vector intrinsic across any other explicit VL vector intrinsic, because the define-use relationship is modeled as coarse-grain, we only model the intrinsic write some global status, but it’s hard to detailly describe and track which status is changed, that’s kind of compiler implementation limitation on current mainstream open-source compiler.

So abstracting the low-level register status from the high-level language layer is a straightforward option here, abstracting the VL register, making the vector length just as an argument, that releases us from implementation limitation, and we also found that’s also comes with several advantages on optimization view - we can model almost all intrinsic function as pure function except load/store and few special instructions, which is no side effect, that’s the fanatic property in the compiler optimization land.

Using an example to demonstrate the power if we treat most intrinsic functions as pure function, here is a function with a loop, and having a loop invariant there.

void foo(int *a, int n) {
  while (vl = vsetvl_e32m4 (n)) {
    vlmax = vsetvlmax_i32m4 ();
    vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
    vl = vsetvl_e32m4 (n);
    vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
    vse32_v_i32m4_vl(a, tmp, vl);
    n -= vl;
    a += vl;
  }
}

Since vsetvl and vsetvlmax is pure function now, so vlmax = vsetvlmax_i32m4 (); can be safely hoist outside the loop

void foo(int *a, int n) {
  vlmax = vsetvlmax_i32m4 ();
  while (vl = vsetvl_e32m4 (n)) {
    vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
    vl = vsetvl_e32m4 (n);
    vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
    vse32_v_i32m4_vl(a, tmp);
    n -= vl;
    a += vl;
  }
}

And then all arguments of vmv_v_x_i32m4_vl are loop invariant, so we can hoist that too.

void foo(int *a, int n) {
  vlmax = vsetvlmax_i32m4 ();
  vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
  while (vl = vsetvl_e32m4 (n)) {
    vl = vsetvl_e32m4 (n);
    vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
    vse32_v_i32m4_vl(a, tmp);
    n -= vl;
    a += vl;
  }
}

We also found the vsetvl has been called twice with the same input, because it’s pure function, so the CSE pass can easily optimize that!

void foo(int *a, int n) {
  vlmax = vsetvlmax_i32m4 ();
  vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
  while (vl = vsetvl_e32m4 (n)) {
    vl = vsetvl_e32m4 (n);
    vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
    vse32_v_i32m4_vl(a, tmp);
    n -= vl;
    a += vl;
  }
}

According to the above demonstration, the advantage of abstracting VL register status is very obvious, and that’s also hard to do for implicit VL API, considering following example:

void foo(int *a, int n) {
  while (vl = vsetvl_e32m4 (n)) {
    vsetvlmax_i32m4 ();
    vint32m4_t const_one = vmv_v_x_i32m4(1);
    vsetvl_e32m4 (n);
    vint32m4_t tmp = vadd_vx_i32m4(const_one, 4);
    vse32_v_i32m4 (a, tmp);
    n -= vl;
    a += vl;
  }
}

Since there is hidden dependency between all vector intrinsic functions, it’s impossible to reorder between any vector intrinsic function, so that means no optimization can be done at all.

Additionally, we also found this could fix other potential issues for code gen of GNU vector extension type, since all other generic compiler infrastructure won’t aware the VL register status, and that will cause poor performance.

So what is GNU vector extension type? GCC and Clang/LLVM both provide vector type extension for easier write SIMD program, for example, we can declare a type with vector size attribute, and then you can operate variables like normal scalar type via ordinary operator.

typedef int int32x4_t __attribute__ ((vector_size (16)));

int32x4_t a, b, c;
a = b + c;          // NO explicit VL reg use or def in middle-end
                    // but it will expand to vsetvli_e32m1(4) and vadd

We can code gen that with vector instruction, however this code gen path might require changing VL to fit the semantic of operation, in above example, the VL should set to 4 before doing operation, and that would be an issue if we have model VL in the compiler middle-end.

LLVM VPlan IR has similar situation on the compiler middle-end for the GNU vector code gen,

According to the above reasons, we believe abstract low-level register modeling in high-level language layer is right way to go.

API Changing

However the several API must be changed due to removing the concept of VL from the C language layer.

The first one is the vreadvl API, which exposes the status of VL register, so we must remove that to prevent leak of the low-level info to high-level programming languages.

The second, And here is an instructions in RISC-V vector ISA will change the VL other than vsetvl[i] instruction, which is vle*ff.v instructions, the instruction will update VL register if it got exception before load VL-elements, we’ll introduce an extra argument to get the content of the VL register:

vint16m1_t vle16ff_v_i16m1 (const int16_t *base);  // Current API
vint16m1_t vle16ff_v_i16m1 (const int16_t *base, size_t *vl); // New API

And the last change is the API name, the _vl suffix will become redundant and verbose once we use explicit VL API as final vector intrinsic API.

Conclusion

In this RFC we proposed chose the explicit VL API as final vector intrinsic API, to reduce the complexity of compiler implementation, second, propose abstract low-level register modeling in high-level language layer to enable the opportunity of further optimization for the vector intrinsic program, and last, we have to change part of intrinsic API due to above changes.

should compiler update vl after typecast?

For an example:

#include <riscv_vector.h>

unsigned short x[8] = {1, 2, 3, 4, 5, 6, 7, 8};
unsigned char y[16];

void foo() {
  vsetvl_e16m1(8); 
  vuint16m1_t vx = vle16_v_u16m1(x);
  vuint8m1_t vy = vreinterpret_v_u16m1_u8m1(vx);
  // vsetvl_e8m1(16); // should compiler update vl correctly? 
  vse8_v_u8m1(y, vy);
}

When cast from vuint16m1_t to vuint8m1_t, the element length also changes from 8 to 16. So should compiler update vl, or do it manually?

riscv-non-isa / rvv-intrinsic-doc Goto Github PK

rvv-intrinsic-doc's Issues

Unsupported overloaded functions:

Supported new overloaded functions:

Introduction

Vector Tuple Types

Naming rule:

Utils functions for manipulate vector tuple value

Creation

Insertion

Extraction

Intrinsic API for Segment Load/Store

Unit-stride segment load/store

Stride segment load/store

Index segment load/store

TL;DR

Abstract

Background

Explicit VL API As The Final Vector Intrinsic

Abstract Low-Level Register Modeling in High-Level Language Layer

API Changing

Conclusion

Recommend Projects

Recommend Topics

Recommend Org