riscv-non-isa / rvv-intrinsic-doc Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://jira.riscv.org/browse/RVG-153
License: BSD 3-Clause "New" or "Revised" License
Home Page: https://jira.riscv.org/browse/RVG-153
License: BSD 3-Clause "New" or "Revised" License
I think there are some intrinsic functions in explicit vl api which don't need to receive vl
argument,
For examples,
vmv.x.s
and vfmv.f.s
always work when vl == 0
. (ref: riscv/riscv-v-spec#284)vl
argument is meaningless for vundefined
and vreinterpret
.Is there anything else missing?
If no one have problem, I will update function list.
Hi,
Kai wrote the RFC for RVV intrinsic API proposal based on our current discussion.
Any suggestion and feedback are welcome.
https://github.com/sifive/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md
Those instructions are use VL
in their instruction definition[1], so I think we should add vl to the parameter list of those functions, e.g vbool1_t vmand_mm_b1 (vbool1_t op1, vbool1_t op2);
should be vbool1_t vmand_mm_b1_vl (vbool1_t op1, vbool1_t op2, size_t vl);
[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#16-vector-mask-instructions
We provide intrinsics with/without vl at the same time.
vop_vv_type(a, b)
vop_vv_type_vl(a, b, gvl)
Treat SEW and LMUL as parameters to vsetvl, instead of a bunch of vsetvl intrinsics as the following.
size_t vsetvl_8m1 (size_t avl);
size_t vsetvl_8m2 (size_t avl);
size_t vsetvl_8m4 (size_t avl);
size_t vsetvl_8m8 (size_t avl);
size_t vsetvl_16m1 (size_t avl);
size_t vsetvl_16m2 (size_t avl);
size_t vsetvl_16m4 (size_t avl);
size_t vsetvl_16m8 (size_t avl);
size_t vsetvl_32m1 (size_t avl);
size_t vsetvl_32m2 (size_t avl);
size_t vsetvl_32m4 (size_t avl);
size_t vsetvl_32m8 (size_t avl);
size_t vsetvl_64m1 (size_t avl);
size_t vsetvl_64m2 (size_t avl);
size_t vsetvl_64m4 (size_t avl);
size_t vsetvl_64m8 (size_t avl);
Hi, Did you already figure out how the header organization will be? I think the same idea can be translated to documentation too
Hi, I noticed that the Mask Register Layout has changed in riscv-v-spec version 0.9.
A vector mask occupies only one vector register regardless of SEW and LMUL. Each element is allocated a single mask bit in a mask vector register.
The mask bit for element i is located in bit i of the mask register, independent of SEW or LMUL.
But if I don't get it wrong, the design of mask types in rvv intrinsics still use a suffix n (n=SEW/LMUL) to tell the MLEN.
Should this part be modified to be consistent with the v0.9 spec? Is there any design content in progress that can be shared?
sincerely,
Yin Zhang
I'm concerned about several intrinsics which input a "shift" amount as uint8_t
. This only allows shifts up to 255, which fails to expose underlying functionality for larger SEW (and ELEN).
Particular examples are vsll_vx
, vsr{a,l}_vx
, vnsr{a,l}_vx
, vssr{a,l}_vx
, and vnclip{,u}_wx
. I only skimmed quickly; perhaps there are more.
The vv
/wv
forms use an appropriate type-width --- one that increases with SEW --- for their (vector) op2
, and I suggest matching that in the vx
/wx
forms.
(It's true that the vi
forms are restricted to a 5-bit op2
, but these forms aren't exposed in the intrinsics API.)
EDIT: on further thought, perhaps it's more appropriate to use uintXLEN_t
(or whatever it's called).
It's placeholder to discuss debug info for extended types in RVV intrinsic types. I think we should standardize how debug info gen for those data-type.
For example, asm codes:
lw iState, (pState) #pState is a pointer to float32_t array, we load it's ieee754 value
vslide1down.vx vState, vState, iState
Could we have a API like:
vfloat32m8_t vslide1down[_vs_f32m8] (vfloat32m8_t src, float value);
so that we don't need to cast vector type at C codes.
Currently the vector spec only defines all vector registers/CSRs as caller saved, but it does not specify how to pass vectors as arguments.
We propose a calling convention where named vector arguments are passed from v1
to v31
, and for vector types with LMUL > 1, it must be allocated to the next vector register that is aligned to their LMUL. Vector types with fractional LMULs and vector mask types (vbool*_t
) are treated as occupying one register. Segment vector types should be passed in consecutive vector registers aligned to the base vector's LMUL. Vector types are returned in the same manner as the first vector argument. If all vector registers for argument passing are exhausted, then the rest of the vector arguments are passed on stack as whole vector register by pointers.
Some examples (the argument name corresponds to the vector register it uses):
// Vector arguments are passed from v1, v2, ..., v31
void f(vint8m1_t v1, vint8m1_t v2);
// For LMUL=8 types, they are passed in v8, v16, v24, and the rest on stack
void f(vint8m8_t v8_v15, vint8m8_t v16_v23);
void f(vint8m8_t v8_v15, vint8m8_t v16_v23, vint8m8_t v24_v31, vint8m8_t on_stack);
// For arguments with mixed LMUL, the vector register number is aligned to LMUL
void f(vint16m2_t v2_v3, vint8m1_t v4, vint64m8_t v8_v15);
// Returning grouped vector should be aligned to its LMUL (v8~v15 in this case)
vint32m8_t f(vint16m4_t v4_v7);
// fractional LMUL or vbool types are treated as LMUL=1
void f(vint8mf2_t v1, vbool8_t v2, vint8mf8_t v3);
// Segment types are aligned to the base LMUL
void f(vint8m1_t v1, vint16m2x3 v2_v7);
We avoid allocating v0
in the calling convention due to its ubiquitous purpose as the mask register, so callee do not have to move the first argument off v0
if it needs to use masked instructions.
The spec already defines all vector registers as caller-saved, so all of them may be allocated either as argument-passing registers or as temporary registers. The proposal right now chooses all for passing arguments so it is possible to pass up to 3 m8
arguments via register, but it may be up for debate anyway.
There is also a small optimization opportunity where smaller LMUL arguments can fill holes left by alignments due to previous larger LMUL arguments, for example:
void f(vint8m1_t v1, vint8m8_t v8_v15, vint8m1_t v2_or_v16);
Since v2
to v7
remain unused in the example, the m1
argument following m8
may be packed into v2
instead of using the next v16
register. This uses the registers more efficiently at the cost of slightly more complexity in the calling convention.
Any thoughts?
Currently, we have "vreadvl()" to get the vl, we could have more util functions for "mstatus", "vlenb", "vstart" and other status register.
current intrinsic RFC does not model tail element,
but someone maybe want to set the tail undisturbed like riscv/riscv-v-spec#157 (comment).
I think maybe we can only provide additional tail operand for reduction intrinsic functions to model tail elements.
any idea?
I am trying to run example programs but did not get "riscv_vector.h" header file.
I have already installed riscv-gnu-toolchain along with riscv-pk and riscv-spike.
Do I need to install some other tools or some other environment to access riscv vector extension intrinsics (riscv_vector.h) ?
I notice we use riscv_vector.h for intrinsics with vl and without vl argument. May we separate riscv_vector.h into riscv_vector_vl.h and riscv_vector_vl.h?
If users wants to use intrinsics with vl argument, they could include riscv_vector_vl.h.
If users wants to use intrinsics without vl argument, they could include riscv_vector.h.
In addition, compilers may give a warning (or error) if users include these files in the same source.
Hi, storing mask bits seems like a reasonable thing to want to do.
In assembly we can just use the v0 register, but in intrinsics vbool and vuint are separate types and I haven't seen any way to convert them.
vuint8m8_t v = vmv_v_x_u8m8(0);
vbool1_t m = vmseq_vv_u8m8_b1(v, v);
uint8_t bits[256];
vse8_v_u8m1(bits, m);
error: cannot convert 'vbool1_t' to 'vuint8m1_t'
Is there a way to do this, or can we add one?
Hi,
May I ask why "Vector Integer Add-with-Carry / Subtract-with-Borrow Functions" should keep their suffixes, e.g.,vadc_vvm, vadc_vxm. I think they can be distinguished by their argument type just like "vadd". Thanks.
I would like to purpose a different overloading rule to reduce the number of intrinsic functions when using generic API.
overloadable
attribute in clang) instead of C11 generic selection to support overload functions with different number of arguments.Applying above rules we could reduce ~78% number of intrinsic functions when using overloading API. (from 13106 to 2716)
Any suggestions?
The detail of changed shows below:
vle
/vlse
and related segment load instructions.vint8m1_t vle8_v_i8m1 (const int8_t *base);
vint8m2_t vle8_v_i8m2 (const int8_t *base);
viota.m
vuint8mf8_t viota_m_u8mf8 (vbool64_t op1);
vuint16mf4_t viota_m_u16mf4 (vbool64_t op1);
vmv.v.x
, vfmv.v.f
vint8m1_t vmv_v_x_i8m1 (int8_t src);
vint8m2_t vmv_v_x_i8m2 (int8_t src);
vmclr.m
/vmset.m
/vid.v
(empty argument)vle
/vlse
and related segment load instructions.vse
, vsse
and indexed load/store and related segment load/store instructions.viota.m
vmv.v.v
, vmv.x.s
, vmv.s.x
, vfmv.f.s
, vfmv.s.f
vmadc.vvm
, vmadc.vxm
, vmadc.vv
, vmadc.vx
vmsbc.vvm
, vmsbc.vxm
, vmsbc.vv
, vmsbc.vx
Hi,
I want to use rvv intrinsics to combine/split vectors.
Combining operation is like the vcombine instruction in NEON, to contact two short vectors.
And splitting operation is something like vget_low and vget_high instructions in NEON, to get halves of the vector.
But I didn't find intrinsics to combine/split vectors in the reference manual.
I think maybe I can use shift operation and some arithmetic operations to realize these two functions, but it is too complicated. Is there an easy way to do this? Or should we add such a function to intrinsics?
Sincerely,
Yin
Those two intrinsic functions have _m
in postfix of function name in current RFC
vint8m1_t vmerge_vvm_i8m1_m (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vcompress_vm_i8m1_m (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t src);
We propose removing _m
and changed them to
vint8m1_t vmerge_vvm_i8m1 (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vcompress_vm_i8m1 (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t src);
v0.t
intructions need to have _m
for distinguishing non-mask and mask operation.vvm
and vm
_m
" version.FYI: v0.t
and v0
mask register had different meaning in v-spec.
There are some requirements from users they want to keep current SEW but change LMUL (in #28 and #37)
Maybe intrinsic functions could support LMUL truncation and LMUL extension regardless of vl. (It mean those functions would not change vl register)
ps. I think this is not a Reinterpret
operation because theirs VLEN are different.
naming could be vlmul_[ext|trunc]_v_<src_type_with_lmul>_<dst_type_with_lmul>
interfaces would looks like
// LMUL Extension, vlmul_ext_v_<src_lmul>_<target_lmul>
vint64m2_t vlmul_ext_v_i64m1_i64m2 (vint64m1_t op1);
vint64m4_t vlmul_ext_v_i64m1_i64m4 (vint64m1_t op1);
vint64m8_t vlmul_ext_v_i64m1_i64m8 (vint64m1_t op1);
vint64m4_t vlmul_ext_v_i64m2_i64m4 (vint64m2_t op1);
vint64m8_t vlmul_ext_v_i64m2_i64m8 (vint64m2_t op1);
vint64m8_t vlmul_ext_v_i64m4_i64m8 (vint64m4_t op1);
vint64m1_t vlmul_trunc_v_i64m2_i64m1 (vint64m2_t op1);
vint64m1_t vlmul_trunc_v_i64m4_i64m1 (vint64m4_t op1);
vint64m2_t vlmul_trunc_v_i64m4_i64m2 (vint64m4_t op1);
vint64m1_t vlmul_trunc_v_i64m8_i64m1 (vint64m8_t op1);
vint64m2_t vlmul_trunc_v_i64m8_i64m2 (vint64m8_t op1);
vint64m4_t vlmul_trunc_v_i64m8_i64m4 (vint64m8_t op1);
any thought?
It seems that vfrsqrte7 and vfrece7 from v0.9 are not yet added, perhaps pending the naming discussion in riscv/riscv-v-spec#601 ?
Operations that generate a "scalar vector" (e.g. vmv.s.x
, vfmv.s.f
) have a dest operand of the kind of the destination so the original value, except element 0, can be preserved.
For instance
vint32m1_t vmv_s_x_i32m1 (vint32m1_t dst, int32_t src);
Reductions also generate "scalar vectors" but don't seem to have the same treatment.
vint32m1_t vredmax_vs_i32m2_i32m1 (vint32m2_t vector, vint32m1_t scalar);
Do we want to have a dest operand in this case? Like this:
vint32m1_t vredmax_vs_i32m2_i32m1 (vin32m1_t dest, vint32m2_t vector, vint32m1_t scalar);
I don't think it is super fundamental but maybe someone wants to preserve the other elements for some reason?
There is several possible to implement vector tuple type:
The advantage of 2, 3, 4 and 5 is it could provide syntax sugar to access element in the vector tuple type instead of intrinsic function call:
/* ------ Primitive style ------ */
vint32m2x3_t vt;
vint32m2_t va
vint32m2x3_t vt2 = vcreate_i32m2x3(va, va, va); // Creation.
vt = vset_i32m2x3(vt, 0, va); // Insertion.
va = vget_i32m2x3(vt, 0); // Extraction.
/* ------ Array style ------ */
typedef vint32m2_t vint32m2x3_t[3];
vint32m2x3_t vt;
vint32m2_t va
vint32m2x3_t vt2 = {va, va, va}; // Creation.
vt[0] = va; // Insertion.
va = vt[0]; // Extraction.
/* ------ Struct style ------ */
typedef struct {
vint32m2_t x;
vint32m2_t y;
vint32m2_t z;
} vint32m2x3_t;
vint32m2x3_t vt;
vint32m2_t va
vint32m2x3_t vt2 = {va, va, va}; // Creation.
vt.x = va; // Insertion.
va = vt.y; // Extraction.
Currently SVE's GCC implementation is 4
and disallow declare array and struct with scalable vector type.
Can you please add a few simple convenience functions for copying data between these vector register types and standard C arrays, e.g., to make debug and things like printf() of a vector easier? I’m thinking something like: int64_t foo[MVL]; foo = toArray(vs1);
A vint8m1_t
type can definitely be changed to vint8m8_t
.
A vint8m8_t
type can be changed to vint8m1_t
if vl
is already known. (Of course, users should aware what they are doing.)
vint8m2_t vreinterpret_v_i8m1_i8m2 (vint8m1_t src);
vint8m4_t vreinterpret_v_i8m1_i8m4 (vint8m1_t src);
vint8m8_t vreinterpret_v_i8m1_i8m8 (vint8m1_t src);
...
vint8m1_t vreinterpret_v_i8m8_i8m1 (vint8m8_t src);
vint8m2_t vreinterpret_v_i8m8_i8m2 (vint8m8_t src);
vint8m4_t vreinterpret_v_i8m8_i8m4 (vint8m8_t src);
SVE uses _z
, _m
and _x
in function suffix to describe the inactive elements in the result of a predicated. _z
for zero, _m
for masked off(merge), and _x
for don't care.
current mask operation api (apply Romain's suggestion)
vint8m1_t vadd_vv_mask_i8m1 (vbool8_t mask, vint8m1_t maskedoff, vint8m1_t op1, vint8m1_t op2);
Should we support mask function without masked off (merge) parameter in primitive layer?
(ps. I assume intrinsic in primitive layer has ASM 1-to-1 orthogonal naming)
If yes, how to naming? ex:
vint8m1_t vadd_vv_mask_x_i8m1 (vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
maybe using the mask abbrev. and put it into function suffix is more clear.
vint8m1_t vadd_vv_i8m1(vint8m1_t op1, vint8m1_t op2);
vint8m1_t vadd_vv_i8m1_m(vbool8_t mask, vint8m1_t maskedoff, vint8m1_t op1, vint8m1_t op2);
vint8m1_t vadd_vv_i8m1_x(vbool8_t mask, vint8m1_t op1, vint8m1_t op2);
general naming rule is encoding destination type into function suffix.
Romain Dolbeau suggest:
Then I would suggest retaining the output-type-as-suffix, and altering
the type specifier from the mnemonic (e.g., the 'm' bit in the name) to
make it more significant, as it's the part that differs, i.e. something
along the line of:
unsigned long vpopc_mb1_ulong(vbool1_t op1);
unsigned long vpopc_mb2_ulong(vbool2_t op2);
A generic/overloaded could drop the extra bit.
Follow #25, the current reduction instructions use this form
vint8m1_t vredsum_vs_i8mf8_i8m1 (vint8m1_t dst, vint8mf8_t vector, vint8m1_t scalar);
vint8m1_t vredsum_vs_i8mf4_i8m1 (vint8m1_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint64m1_t vredsum_vs_i64m8_i64m1 (vint64m1_t dst, vint64m8_t vector, vint64m1_t scalar);
I have 2 questions for this interface.
m1
type?m1
type?In my opinion, it should support like the following
vint8mf8_t vredsum_vs_i8mf8_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8m1_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8m1_t scalar);
...
vint8mf8_t vredsum_vs_i8mf4_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8m1_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint8mf4_t vredsum_vs_i8mf4_i8mf8_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf4_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf2_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8m1_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
To fully support the whole combination, we need 343 (7 x 7 x 7) intrinsics for i8
type redsum.
However, users will be annoyed by this interface design. To solve this, I have the following solutions.
vint8_t get_first(vint8m1_t); // No op. This is not vmv_x_s.
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
...
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
...
i8
type redsum.vmerge
by themself.mf8_to_m1
and m8_to_m1
.i8
type redsum.I prefer the solution 2. Any idea?
What kinds of C operator should we support for scalable vector types? What is the semantic of C operator on scalable vector types? Should it operate on VLMAX or vl or something else?
What is the behavior and limitation of scalable vector types?
We have made good progress, but I'm afraid that the release 0.9 of the V spec is coming down fast and methinks that the most radical change that it introduces is the new values of LMUL.
Please, share your thoughts about it here.
I find that there is no Narrowing Floating-Point Type-Convert intrinsics for from 'vfloat32m1_t' to 'vfloat16mf2_t' in Intrinsic Functions List
Like this:
vfloat16mf2_t vfncvt_f_f_w_f16mf2 (vfloat32m1_t src);
Is it missing? Or is there no such intrinsic for some reason?
All of vslideup and vslidedown have an issue.
For example, the vslideup.vx operation is vd[i + rs1] = vs2[i].
So We don't know Vd's value from vd[i] to vd[i + rs1].
If we want to know all of vd's value, then we need to insert a argumenet,
to initial Vd value. Thus, I suggest to add a new argument for initial Vd value.
vint8m1_t vslideup_vx_i8m1 (vint8m1_t src, size_t offset);
change to
vint8m1_t vslideup_vx_i8m1 (vint8m1 dst, vint8m1_t src, size_t offset);
May we have intrinsics to set vxrm
?
Follow fenv
#define VE_TONEARESTUP /*implementation defined*/
#define VE_TONEARESTEVEN /*implementation defined*/
#define VE_DOWNWARD /*implementation defined*/
#define VE_TOODD /*implementation defined*/
int vegetxround();
int vesetxround(int round);
return 0 on success, non-zero otherwise.
The intrinsic API should have the goal to make all the V-ext instructions accessible from C. We will provide intrinsics 1-to-1 mapping to assembly mnemonics and additional intrinsics for semantic reason, e.g. fma, splat, etc.
Dear all,
We are adding new intrinsics which is depending on riscv-v-spec 0.9 recently, but there remain two new intrinsics "vzero and vreinterpret" which I have never seen in riscv-v-spec 0.9.
The definition of vzero and vreinterpret we have defined is as following in
src/llvm/include/llvm/IR/IntrinsicsRISCVVector.td
def int_riscv_vzero_i8m1
: Intrinsic<[llvm_nxv1i8_ty],
[],
[IntrNoMem]>;
def int_riscv_vreinterpret_u64_i64_u64m2
: Intrinsic<[llvm_nxv2i64_ty],
[llvm_nxv2i64_ty],
[IntrNoMem]>;
But in src/llvm/include/llvm/IR/RISCVInstrInfoV.td, the riscv-v-spec 0.9 doesn't give any illustration of the encoding definition to vzero and vreinterpret, Do you know how to define them?
//__builtin_riscv_vzero_i8m1()
def : Pat<(int_riscv_vzero_i8m1), ()>;
//___builtin_riscv_vreinterpret_u64_i64_u64m2(src)
def : Pat<(int_riscv_vreinterpret_u64_i64_u64m2), ()>;
One example is as following, vzero and vreinterpret don't have the encoding definition.
let hasSideEffects = 0, mayLoad = 1, mayStore = 0 in
multiclass VLoad_UnitStride<bits<3> nf, bits<1> mew, bits<2> mop,
bits<3> width, string opcodestr> {
def _m : RVInstVLoad<nf, mew, mop, width, RVV_Masked, OPC_LOAD_FP,
(outs VR:$vd), (ins GPR:$rs1, Zero:$zero, VMR:$vm),
opcodestr, "$vd, ${zero}(${rs1}), $vm"> {
let Inst{24-20} = 0b00000;
}
def _um : RVInstVLoad<nf, mew, mop, width, RVV_Unmasked, OPC_LOAD_FP,
(outs VR:$vd), (ins GPR:$rs1, Zero:$zero),
opcodestr, "$vd, ${zero}(${rs1})"> {
let Inst{24-20} = 0b00000;
}
}
defm VLE8_V : VLoad_UnitStride<0b000, 0b0, 0b00, 0b000, "vle8.v">, Sched<[]>;
defm VLE16_V : VLoad_UnitStride<0b000, 0b0, 0b00, 0b101, "vle16.v">, Sched<[]>;
//__builtin_riscv_vle8_v_i8m1(base)
def : Pat<(int_riscv_vle8_v_i8m1 GPR:$rs1), (VLE8_V_um GPR:$rs1, 0)>;
//__builtin_riscv_vle16_v_i16m1(base)
def : Pat<(int_riscv_vle16_v_i16m1 GPR:$rs1), (VLE16_V_um GPR:$rs1, 0)>;
Thank you
Best Regards
William
Hi,
we have been testing an early implementation of the intrinsics RFC against the two examples in this repository using a small emulator of ours. We have been able to run rvv_saxpy.c
successfully using LMUL=1
instead of LMUL=8 but we found it needed a couple of changes other than those changes to go from m8
to m1
.
diff --git a/rvv_saxpy.c b/rvv_saxpy.c
index 6b43025..a4de0f0 100644
--- a/rvv_saxpy.c
+++ b/rvv_saxpy.c
@@ -44,7 +44,7 @@ float output[N] = {
0.2484350696132857};
void saxpy_golden(size_t n, const float a, const float *x, float *y) {
- for (size_t i; i < n; ++i) {
+ for (size_t i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
@@ -55,11 +55,11 @@ void saxpy_vec(size_t n, const float a, const float *x, float *y) {
vfloat32m8_t vx, vy;
for (; (l = vsetvl_e32m8(n)) > 0; n -= l) {
- vx = vle_v_f32m8(x);
+ vx = vle32_v_f32m8(x);
x += l;
- vy = vle_v_f32m8(y);
+ vy = vle32_v_f32m8(y);
vy = vfmacc_vf_f32m8(vy, a, vx);
- vse_v_f32m8 (y, vy);
+ vse32_v_f32m8 (y, vy);
y += l;
}
}
Also we understand some of the loads in sgemm
needs updating to use intrinsics, like shown below:
diff --git a/rvv_sgemm.c b/rvv_sgemm.c
index 975ba99..71ebd28 100644
--- a/rvv_sgemm.c
+++ b/rvv_sgemm.c
@@ -49,7 +49,7 @@ float b_array[MAX_BLOCKSIZE] = {1.7491401329284098, 0.1325982188803279,
float golden_array[OUTPUT_LEN];
float c_array[OUTPUT_LEN];
-void sgemm_golden() {
+void sgemm_golden(void) {
for (size_t i = 0; i < MLEN; ++i)
for (size_t j = 0; j < NLEN; ++j)
for (size_t k = 0; k < KLEN; ++k)
@@ -63,23 +63,24 @@ void sgemm_vec(size_t size_m, size_t size_n, size_t size_k,
size_t ldb,
float *c, // m * n matrix
size_t ldc) {
- int i, j, k;
+ int j, k;
size_t vl;
vfloat32m1_t vec_c;
for (int i = 0; i < size_m; ++i) {
j = size_n;
const float *bnp = b;
float *cnp = c;
- for (; vl = vsetvl_e32m1(j); j -= vl) {
+ for (; (vl = vsetvl_e32m1(j)); j -= vl) {
const float *akp = a;
const float *bkp = bnp;
- vec_c = *(vfloat32m1_t *)cnp;
+ vec_c = vle32_v_f32m1(cnp);
for (k = 0; k < size_k; ++k) {
- vec_c = vfmacc_vf_f32m1(vec_c, *akp, *(vfloat32m1_t *)bkp);
+ vec_c = vfmacc_vf_f32m1(vec_c, *akp,
+ vle32_v_f32m1(bkp));
bkp += ldb;
akp++;
}
- *(vfloat32m1_t *)cnp = vec_c;
+ vse32_v_f32m1(cnp, vec_c);
cnp += vl;
bnp += vl;
}
@@ -98,7 +99,7 @@ int fp_eq(float reference, float actual, float relErr)
int main() {
// golden
memcpy(golden_array, b_array, OUTPUT_LEN * sizeof(float));
- sgemm_golden(MLEN, NLEN, KLEN, a_array, KLEN, b_array, NLEN, golden_array, NLEN);
+ sgemm_golden();
// vector
memcpy(c_array, b_array, OUTPUT_LEN * sizeof(float));
sgemm_vec(MLEN, NLEN, KLEN, a_array, KLEN, b_array, NLEN, c_array, NLEN);
However I'm confused with the sgemm vector as it seems to load a row of the "b" matrix rather than a column? Should this load below
vle32_v_f32m1(bkp));
load a column instead using a strided load or I totally misunderstood the code?
Kind regards,
Having VL an opaque type adds useless complexity to the general case, and is not justified in any way beyond "To avoid users to manipulate vl argument in explicit vl intrinsic", which doesn't make sense. The point of explicit VL in instrinsics is to be able to avoid vsetvl for short-term change. If it's needed to get the proper opaque type, what's the point?
A typical way of writing this:
for (i = 0; i<n;i++) {
c[i]=a[i]+b[i];
}
would be (not the proper syntax to save time)
while (i < n) {
unsigned long vl = vsetvl(n - i);
vtype va = vload(a+i, vl);
vtype vb = vload(b+i, vl);
vtype vc = vadd(va, vb, vl);
vstore(c+i, vc, vl);
i +=vl;
}
Forcing an opaque type on vl will add complexity by forcing an unexpected type on vl (it is semantically a number after all...).
The point of an explicit VL is that this:
vsetvl(X);
vop(a,b);
can be shortened to this and the intrinsics be made fully specified (and not just 2 out of the 3 implicit values, lmul and sew, but all three of them):
vop(a,b,X);
If a vsetvl is needed then it's useless...
To summarize:
a) with only non-fully-specified (VL-less) intrinsic, opacity adds complexity and brings nothing
b) with full-specified intrinsics, it defeats the point by forcing vsetvl() where none is needed ot wanted
AFAK,the riscv vector 0.7.1 is stable version like long term version of Linux kernel,
we should to consider that the functions can cover multiple version of vector spec :)
Based on the behavior of the vcompress instruction, this argument seems to be the vector to use for tail elements. It doesn't have a direct relationship to the mask the way "maskedoff" does for other instructions. The lower popcnt(mask) elements of the result of this instrinsic come from "src" while the upper elements [vl-1:popcnt(mask)] elements come from "maskedoff". The position of the mask bits doesn't control which elements from maskedoff are used, rather the number of 1s in the mask does.
This also means that this intrinsic must use the tail undisturbed policy.
As https://github.com/sifive/rvv-intrinsic-doc/issues/10#issuecomment-617226293 mention,
should we support reinterpret function for different types of the same LMUL?
ex. i32m1 <-> i64m1
Is there any real scenario?
In order to supporting segment load/store, I would like to introduce bunch of new type and API for that, the function naming might changed in future to fit the 1-1 mapping naming rule, so the main part want to asking feedback is the interface and the new type.
There is 3 part in this RFC:
v{TYPE}{SEW}m{LMUL}x{NR}_t
Symbol used in naming rule:
Naming rule:
Naming rule:
Naming rule:
Symbol used in naming rule:
TYPE := vector type
SCALAR_TYPE := corresponding scalar type
TYPE_L := Type letter, i, u or f
{TYPE} vseg_loadx_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex)
void vseg_storex_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex, {TYPE} value)
void vseg_storeux_{TYPE_L}{SEW}m{LMUL}x{NR} (const {SCALAR_TYPE} *base, vuint{SEW}m{LMUL}_t bindex, {TYPE} value)
e.g
Hi,
I find that there are no intrinsics such as vreinterpret_v_f16m1_f32m1(vfloat16m1_t src)
.
How can I expand float types for example from vfloat16m1_t
to vfloat32m1_t
?
For example:
vslide1up/down
and vmv.s.x
have below constraints:
If XLEN
< SEW
: the value is sign-extended
to SEW bits.
If XLEN
> SEW
: the least-significant bits are copied over and the high SEW-XLEN bits are ignored.
Current SiFive proposal:
provide signed/unsigned interface, but the second operand type always is long. (The second operand type is corresponding to XLEN
)
In RV64 with SEW
=64 config (XLEN
== SEW
):
vuint64m1_t vslide1up_vs_u64m1 (vuint64m1_t src, long value);
// Is it weird? In this HW config, the value's type can be unsigned long
In RV32 with SEW
=64 config (XLEN
< SEW
):
vuint64m1_t vslide1up_vs_u64m1(vuint64m1_t src, long value);
// unsigned long value is illegal because value would be sign-extended
to 64 bits.
Does it also mean all vector-scalar integer operations, almost the type of scalar should be long or unsigned long?
In addition, FLEN
< SEW
has similar problem.
There is another problem(?) in RV64 SEW
=32 config (XLEN
> SEW
)
vint32m1_t vslide1up_vs_i32m1 (vint32m1_t src, long value);
// The value will be truncated to 32-bit implicitly
one of solution is support intrinsic function optionally on different HW config.
For example, on RV32 platform, SEW
=64 vector-scalar integer operations would not supported.
Any idea or suggestion?
ps. vslide1up/down
: The slide instructions move one elements up and down a vector register group.
ps. vmv.s.x
: The integer scalar move instruction copies the scalar integer register to element 0 of the destination vector register.
ps. spec define the value in SEW up to max(XLEN,FLEN).
ps. RV32 means XLEN
and long is 32 bit
ps. in intrinsic document, vmv.s.x
api had not fixed yet.
The v{f}splat
routines appear to just be simple aliases for the v{,f}mv_v_{x,f}
routines. There's obviously no harm in including them; however, in such an enormous API I think we should take measures to reduce complexity whenever possible. Thus I propose removing the splats.
Less drastic measures are (1) to deprecate them now and revisit removing them sometime in the future, or (2) to make them an optional extension.
The vector load/store could support the non-zero vstart value.
Currently, the vector load/store intrinsic imply to call the vsetvl to update the EMUL. The vsetvl will reset the vstart to zero.
Thus, the following code will not work.
set_vstart(new_vstart_value)
dst=vload_e32_m1(ptr) // Since the vload implies a vsetvl to set the EMUL, the "new_vstart_value" will be cleared.
We might need to have the additional vstart parameter for the vector load/store intrinsic, and then set the new vstart value after the vsetvl instruction.
In a number of places, the V-extension intrinsics API uses long
or unsigned long
to represent an XLEN-bit integer (signed or unsigned, resp.). It's true that long
has size XLEN in the currently defined ILP32 and LP64 ABIs
https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md#-named-abis
However, note that
A future version of this specification may define an ILP32 ABI for the RV64 ISA
And one can imagine a similar scenario for RV128. Hence, I think it's preferable to future-proof the intrinsics API by using typedefs [Footnote 1]. I don't think this adds much notational overhead.
Please note that we had the same discussion for the P-extension intrinsics, and arrived at the same conclusion
https://lists.riscv.org/g/tech-p-ext/topic/73148653#26
[Footnote 1]
On GCC and Clang/LLVM, you can alias the relevant types as follows:
#include <inttypes.h>
#if __riscv_xlen == 32
typedef int32_t int_xlen_t;
typedef uint32_t uint_xlen_t;
#elif __riscv_xlen == 64
typedef int64_t int_xlen_t;
typedef uint64_t uint_xlen_t;
#endif
Those who enjoy preprocessor wizardry might prefer something like
#define create_type2(a, b, c) a ## b ## c
#define create_type1(a, b, c) create_type2(a, b, c)
typedef create_type1(int,__riscv_xlen,_t) int_xlen_t;
typedef create_type1(uint,__riscv_xlen,_t) uint_xlen_t;
EPI believes it is better to always express the return type. The type of the operands is easy to determine by looking at the declaration of the variable being used. However if a function call is used as the operand, having the return type is more immediate (i.e. avoids to "recursively" go to the declaration of the operands to understand what is its type). This may be even more relevant for conversions where the result differs from the operands.
However, there are some operations with the same return type. We need to decide how to encode return type in intrinsics name to avoid overloading.
We want to remove implicit VL API and made some API changes and intent to change the high-level semantic of vector intrinsic, so that we can doing more aggressive optimization.
The goal of this RFC is improving the intrinsic for vector programming, and chose the explicit VL API as the only one intrinsic API, this RFC is consistent with two part, first part is explanation why we pick explicit VL API as final proposal, and second part is removing the concept of low level register status from the intrinsic programming model.
Last year we've announce the intrinsic spec for vector programming in C. We got lots of useful feedback from several different parties including BSC, SiPearl, Andes, OpenCV China, PLCT Lab, Alibaba/T-Head.
Today, we have open source implementation on both GCC and LLVM*, which is implemented in a different approach, implicit VL and explicit VL, we were expect compiler could be using simple C wrapper to implement each other API, e.g. using implicit VL API with C wrapper to implement explicit VL, however it turns out become a barrier of optimization.
The issue is because the concepts of both API are kind of incompatible, after long discussion and exploration, we think it’s time to pick one as the final proposal of intrinsics spec, in order to reduce the compiler maintenance effort and reduce the learning curve of intrinsic function.
Keep only one intrinsic API also having an advantage on the compiler optimization side, we found several optimization opportunities, but we can’t do that because we need to make sure both API have correct semantic and behavior.
So which one is better is the question, back to the reason why we have two different style API is because we don’t have a conclusion on which API is better before, but this time is different, we have enough exploration, experience and feedback to make the right decision.
After implementing the intrinsic API on both GCC and LLVM, we found several good reasons for the using explicit VL API from the compiler aspect, explicit VL define-use relationship is more natural to the compiler, it made the analysis and optimization more easier.
We also get feedback from users for both API, implicit VL API is less verbose, but is hard to track the status of VL register, that’s also made debug harder, having an explicit VL argument makes programming easier.
So based on both sides - user feedback and consideration of compiler implementation, we believe explicit is the right way to go, and based on the decision we propose following changes for the vector intrinsic API.
During the implementation phase, we found a fact is the status of VL register becomes an optimization barrier, we must maintain the correct order between vreadvl and vsetvli and all other explicit VL API, because explicit VL API has the semantic of writing VL register.
For example, we can’t reorder the operations across vreadvl for the following program.
n = 10000;
avl = vsetvl_e32m4 (n) // Assume VLEN=128, so avl = 16
vl1 = vreadvl(); // 16
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4);
vl2 = vreadvl(); // 4
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, 4, vl / 2);
vl3 = vreadvl(); // 8
vse32_v_i32m4_vl(a, tmp, vl);
Furthermore, we can't move explicit VL vector intrinsic across any other explicit VL vector intrinsic, because the define-use relationship is modeled as coarse-grain, we only model the intrinsic write some global status, but it’s hard to detailly describe and track which status is changed, that’s kind of compiler implementation limitation on current mainstream open-source compiler.
So abstracting the low-level register status from the high-level language layer is a straightforward option here, abstracting the VL register, making the vector length just as an argument, that releases us from implementation limitation, and we also found that’s also comes with several advantages on optimization view - we can model almost all intrinsic function as pure function except load/store and few special instructions, which is no side effect, that’s the fanatic property in the compiler optimization land.
Using an example to demonstrate the power if we treat most intrinsic functions as pure function, here is a function with a loop, and having a loop invariant there.
void foo(int *a, int n) {
while (vl = vsetvl_e32m4 (n)) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp, vl);
n -= vl;
a += vl;
}
}
Since vsetvl and vsetvlmax is pure function now, so vlmax = vsetvlmax_i32m4 ();
can be safely hoist outside the loop
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
while (vl = vsetvl_e32m4 (n)) {
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
And then all arguments of vmv_v_x_i32m4_vl are loop invariant, so we can hoist that too.
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
while (vl = vsetvl_e32m4 (n)) {
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
We also found the vsetvl has been called twice with the same input, because it’s pure function, so the CSE pass can easily optimize that!
void foo(int *a, int n) {
vlmax = vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4_vl(1, 4, vlmax);
while (vl = vsetvl_e32m4 (n)) {
vl = vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4_vl(const_one, vl);
vse32_v_i32m4_vl(a, tmp);
n -= vl;
a += vl;
}
}
According to the above demonstration, the advantage of abstracting VL register status is very obvious, and that’s also hard to do for implicit VL API, considering following example:
void foo(int *a, int n) {
while (vl = vsetvl_e32m4 (n)) {
vsetvlmax_i32m4 ();
vint32m4_t const_one = vmv_v_x_i32m4(1);
vsetvl_e32m4 (n);
vint32m4_t tmp = vadd_vx_i32m4(const_one, 4);
vse32_v_i32m4 (a, tmp);
n -= vl;
a += vl;
}
}
Since there is hidden dependency between all vector intrinsic functions, it’s impossible to reorder between any vector intrinsic function, so that means no optimization can be done at all.
Additionally, we also found this could fix other potential issues for code gen of GNU vector extension type, since all other generic compiler infrastructure won’t aware the VL register status, and that will cause poor performance.
So what is GNU vector extension type? GCC and Clang/LLVM both provide vector type extension for easier write SIMD program, for example, we can declare a type with vector size attribute, and then you can operate variables like normal scalar type via ordinary operator.
typedef int int32x4_t __attribute__ ((vector_size (16)));
int32x4_t a, b, c;
a = b + c; // NO explicit VL reg use or def in middle-end
// but it will expand to vsetvli_e32m1(4) and vadd
We can code gen that with vector instruction, however this code gen path might require changing VL to fit the semantic of operation, in above example, the VL should set to 4 before doing operation, and that would be an issue if we have model VL in the compiler middle-end.
LLVM VPlan IR has similar situation on the compiler middle-end for the GNU vector code gen,
According to the above reasons, we believe abstract low-level register modeling in high-level language layer is right way to go.
However the several API must be changed due to removing the concept of VL from the C language layer.
The first one is the vreadvl API, which exposes the status of VL register, so we must remove that to prevent leak of the low-level info to high-level programming languages.
The second, And here is an instructions in RISC-V vector ISA will change the VL other than vsetvl[i] instruction, which is vle*ff.v instructions, the instruction will update VL register if it got exception before load VL-elements, we’ll introduce an extra argument to get the content of the VL register:
vint16m1_t vle16ff_v_i16m1 (const int16_t *base); // Current API
vint16m1_t vle16ff_v_i16m1 (const int16_t *base, size_t *vl); // New API
And the last change is the API name, the _vl suffix will become redundant and verbose once we use explicit VL API as final vector intrinsic API.
In this RFC we proposed chose the explicit VL API as final vector intrinsic API, to reduce the complexity of compiler implementation, second, propose abstract low-level register modeling in high-level language layer to enable the opportunity of further optimization for the vector intrinsic program, and last, we have to change part of intrinsic API due to above changes.
For an example:
#include <riscv_vector.h>
unsigned short x[8] = {1, 2, 3, 4, 5, 6, 7, 8};
unsigned char y[16];
void foo() {
vsetvl_e16m1(8);
vuint16m1_t vx = vle16_v_u16m1(x);
vuint8m1_t vy = vreinterpret_v_u16m1_u8m1(vx);
// vsetvl_e8m1(16); // should compiler update vl correctly?
vse8_v_u8m1(y, vy);
}
When cast from vuint16m1_t to vuint8m1_t, the element length also changes from 8 to 16. So should compiler update vl, or do it manually?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.