Comments (31)
My opinion is that all functionality exposed to the RVV assembly programmer should also be exposed to the RVV intrinsics programmer; this includes the vta
and vma
settings. Since these bits are part of vtype
(along with vsew
and vlmul
), it raises the question of whether they should also be encoded into the intrinsic names, passed in as arguments, etc.
from rvv-intrinsic-doc.
My 2 cents on the subject...
In most case when coding at high level, unused values in the registers (both inactive and tail) are irrelevant and the user doesn't care [1] what's in it. For peel/tail they are not loaded, not computed and not stored. For masked-out values (e.g. convergence loops), it's the same. The one obvious big exception is conditionals in loop, where usually the old value should be kept (either because the condition if false for 'if/then', or we're in the 'else' merging with the 'then' for 'if/then/else').
So in my opinion the default behavior should be 'keep whatever is in there' (undisturbed), as it should be quite enough for most use cases (as it satisfies both 'don't care' and 'merge' semantic). That's also why vundefined() is in there - it explicits the 'don't care' by inputing undefined value in the 'merge' semantic. For user in need of a specific value instead of undefined(), they can easily use an explicit intrinsic or a value as an input.
So in v0.9-speak, default current behavior should be either {mu, tu} (if the maskedoff input is valid) or {ma,ta} (if it's vundefined() or not present in the intrinsics, thus implicitly undefined, e.g. the unmasked intrinsics).
The question is, how to also model {ma,tu} and {mu,ta} in the intrinsics... and how does that interact with the above? In the current design, having {,tu} or {mu,} only works if there is a valid maskedoff parameter, otherwise the user can't specify the values to be left undisturbed... but from my reading of v0.9, the non-masked variant should be eligible for {*,tu} (not that {mu,tu} makes much sense w/o a mask...).
A simple solution would be to add a set of 'tail agnostic' intrinsics to the masked ones. So the behavior would be:
a) no maskedoff parameter, or maskedoff is vundefined(): {ma,ta}
b) valid maskedoff parameter, no tail specification: {mu, tu}
c) valid maskedoff parameter, tail agnostic: {mu, ta}
This does not give access to {ma,tu} unfortunately - though to be honest I don't see a use case at the moment for that combination. Would hardware be able to do better on {ma,tu} than on {mu,tu}? [and case b) and c) could be reversed to default to _ta and require a specification for _tu, which probably makes more sense in practice]
Another solution is to add vta/vma to vsetvl() but a) I don't like hidden state in high-level code and b) {mu,tu}, {mu,ta} and {ma,tu} don't make much sense with vundefined() or any intrinsics without a maskedoff parameter...
A third solution would be to assume for {ma,} or {mu,} as above, but rely on the hidden state for {,ta} and {,tu}. It's sort of half-and-half...
[the heavyweight solution is to have 4 variant of the intrinsics: _ma_ta (not maskedoff parameter needed), _mu_tu, _ma_tu and _mu_ta (all three with a mandatory maskedoff parameter), drop vundefined(), and let the user do all the work... perhaps with just _ta (no maskedoff) and _tu (mandatory maskedoff) for non-masked intrinsics... the _ma or _mu bit would take over the current _m prefix for the masked version].
[1] Except for security reasons perhaps (avoiding information leakage), but that's out-of-scope here I think.
from rvv-intrinsic-doc.
In our prototype (which does not account for float16 and fractional LMUL yet) we already have ~7500 intrinsics. Multiplying them by 4 to cover all the combinations is a bit disheartening.
One option is to add a tail
operand to the unmasked intrinsics. This way:
- unmasked operations with a
tail
that isvundefined()
would map to{*, ta}
. There is no mask so eithermu
orma
would be OK,ma
is the least defined and so it imposes the least behaviour and might be preferred. So{ma,ta}
. - unmasked operations with a
tail
that is not a result fromvundefined()
would map to{ma,tu}
. As Romain mentioned earlier{ma,tu}
is not obviously useful but again we might want to prefer, if possible, modes that impose the least defined behaviour (i.e.mu
is a valid implementation ofma
).
Then, for masked operations:
- If their maskedoff operand is
vundefined()
they would map to{ma,ta}
. - If their maskedoff operand is not
vundefined()
they would map to{mu,ta}
.
An issue with this approach is that {mu,tu}
is not directly "expressible": we might want a guarantee that the tail is preserved that is stronger than "the implementation of ta
is tu
in this machine".
A way to address this is to add a new version of the masked intrinsics (with suffix, say, _mt
), with an undisturbed
operand (rather than maskedoff
). We can allow vundefined()
in that operand but it would not be different to the existing masked versions:
- If their undisturbed operand is
vundefined()
they would map to{ma,ta}
. - If their undisturbed operand is not
vundefined()
they would map to{mu,tu}
.
We can look at it in the other way round, depending on what are the user needs (note that if something is not needed to be preserved it would still be correct, though unnecessary, to preserve it).
Masked? | Needs tail preserved | Needs maskedoff preserved | Intrinsic |
---|---|---|---|
No | No | N/A | vadd_vv_<ty>_vl(vundefined(), a, b, vl) |
No | Yes | N/A | vadd_vv_<ty>_vl(tail, a, b, vl) |
Yes | No | No | vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl) 1 |
Yes | No | Yes | vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl) |
Yes | Yes | No | vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl) 2 |
Yes | Yes | Yes | vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl) |
1vadd_vv_<ty>_mt_vl(vundefined(), va, vb, vl)
could be allowed as an equivalent intrinsic.
2we believe this case is not very common, so implementing it as {mu,tu}
might be a reasonable trade-off.
A downside with this approach is that now we have to specify the tail
operand, which in many cases is going to be vundefined()
. We could mitigate this by making the unmasked operations always {ma,ta}
and then forcing the user to use _mt
with an "all ones" mask if the tail must be preserved. The compiler could then soften the operation to a {ma,tu}
unmasked operation.
The table above would look like this now:
Masked? | Needs tail preserved | Needs maskedoff preserved | Intrinsic |
---|---|---|---|
No | No | N/A | vadd_vv_<ty>_vl(a, b, vl) |
No | Yes | N/A | vadd_vv_<ty>_mt_vl(undisturbed, all-ones, a, b, vl) |
Yes | No | No | vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl) |
Yes | No | Yes | vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl) |
Yes | Yes | No | vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl) 3 |
Yes | Yes | Yes | vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl) |
3we believe this case is not very common, so implementing it as {mu,tu}
might be a reasonable trade-off.
I might prefer the second approach: preserving the tail is not that usual to justify impacting all the unmasked operations. Also the softening required may also be useful in other circumstances so it seems it is something a compiler will want to implement anyways.
from rvv-intrinsic-doc.
current RFC has almost 20k intrinsic functions and 5k functions with function overloading (include Zvlsseg extension), definitely we don't want to support 4 variant of the intrinsics...
although @rofirrim 's second approach would increase the 1/3 number of functions, but I agree it's a good way to model tail element. ex. avoiding additional merging instruction if providing tail
and maskedoff
arguments, keep non-mask non-tail intrinsic function simple, etc.
I think the only one question is, would hardware be able to do better on {ma,tu} than on {mu,tu}? (@rdolbeau had mentioned above)
If the answer is yes, compiler can not generate {ma,tu}
setting in second approach.
from rvv-intrinsic-doc.
Agree with @rofirrim's second approach. Provide another set of intrinsics ending with _mt_vl
to cover tail undisturbed cases.
from rvv-intrinsic-doc.
My colleagues and I were looking today at some code that uses vfmacc
and I realised that the intrinsics for instructions that use one of the input registers as output registers (such as vfmacc
, vmacc
, etc.) won't need any additional parameter in the _mt
versions.
The reason is that the clobbered input is already what is going to be "undisturbed" in the final result.
So the intrinsic C prototype of the _mt
version is exactly the same as the agnostic ones. They will be emitted using a different governing vsetvli
though.
Does this make sense?
from rvv-intrinsic-doc.
Yes, it makes sense to me.
from rvv-intrinsic-doc.
I'd like to continue the discussion of this. Sifive has been working on upstreaming the IR intrinsics into llvm.org and we'll we working on the C intrinsics soon. I'd like to make sure what we're doing is scalable to the future.
What we have in our internal branch is to always use a tail undisturbed policy. But many intrinsics don't take an argument for the undisturbed value so it is not controlled by the user. It will just be whatever is in the register that gets picked by the register allocator. It would also create a false dependency on implementations with vector register renaming. So this doesn't seem like a good long term behavior.
As we started upstreaming we initially used tail agnostic. A recent change was made upstream to use tail undisturbed for any masked intrinsic, fma intrinsic, or vcompress intrinsic and tail agnostic since for those intrinsics that user would have control and might want to preserve tail elements. We have not added detection of vundefined yet so we always use mask undisturbed.
I feel like this has put our implementation into a state that's not easy to explain as there's nothing in the name to tell the user which intrinsics use tail undisturbed and which don't. So I think I like @rofirrim's second approach where the tail undisturbed policy is explicit in each intrinsic.
I think what I would like to say is that the currently defined C interface is always tail agnostic and we can add new tail undisturbed intrinsics with a consistent naming convention. I think that matches the behavior of the non-"mt" intrinsics in @rofirrim second approach. We would need to redefine the vcompress intrinsic to drop the "maskedoff" argument which #55 proposes to rename. The current definition would be used for an t version.
I do worry that since tail undisturbed is a valid implementation of tail agnostic in hardware, that a user could use a masked or FMA intrinsic and accidentally become dependent on tail elements being preserved on today's hardware. But have it break in the future.
What do others think?
from rvv-intrinsic-doc.
from rvv-intrinsic-doc.
I am against encoding the mask and tail preservation policies explicitly in the intrinsics or builtins. Rather, I favor using new function attributes to specify such policies, defaulting to mu
and ta
.
from rvv-intrinsic-doc.
I am against encoding the mask and tail preservation policies explicitly in the intrinsics or builtins. Rather, I favor using new function attributes to specify such policies, defaulting to
mu
andta
.
I think @rofirrim's proposal only encodes the tail policy into the intrinsic name. The mask policy is based on vundefined being passed.
Tail undisturbed requires at least one additional operand for many of the intrinsics to provide the undisturbed value. So I don't think we accomplish that with just a function attribute unless we make that extra operand always be there and ignored under tail agnostic.
from rvv-intrinsic-doc.
I wonder if in the line of @ebahapo suggestion we could give an alternative interpretation to the intrinsic based on such attribute (btw I assume Evandro means __attribute__
but correct me if it is not that kind of attribute, or maybe it is an attribute in the call site such as [[...]]
?).
For instance one of the operands (e.g. the first) could used for the tail undisturbed. It turns something like vsetvl ... ta,ma; vdest ← vop vsrc1, vsrc2
into something closer to vsetvl ... tu,ma; vdest ← vop vdest vsrc2
(assuming unmasked operations here). This way we avoid the explosion of intrinsics at expense of a slightly more complex semantics as now it would depend on the presence of that attribute.
If our expectation is that tail agnostic is the preferred default, this attribute acting as a modifier might be a reasonable approach.
I think this also disambiguates the case for three-input operations such as FMA, does it? Perhaps I'm missing some nuance here.
from rvv-intrinsic-doc.
current RFC has almost 20k intrinsic functions and 5k functions with function overloading (include Zvlsseg extension), definitely we don't want to support 4 variant of the intrinsics...
In fact, there are too much intrinsic functions for RVV programming..
Maybe we should teach intrinsic users how to write a vectorized program with tail agnostic api. (Ex. giving a reduction loop example)
It means rvv intrinsics programming model only support tail agnostic behavior because ta
(always?) has a better performance.
I think the one thing users really care about is how to write a high performance rvv code.
from rvv-intrinsic-doc.
I would like a way to set tu (Tail Undisturbed) in the intrinsics as I use it in this (and other) code: https://godbolt.org/z/cdcavjTs1
Can anyone think of a fast way of doing these operatons with ta?
I'm OK with ta most of the time, but tu is useful for in-situ operations.
In this conversation: riscv/riscv-v-spec#664 (comment) @aswaterman says:
On some implementations, the tail-undisturbed operations will have noticeable overhead, as @solomatnikov mentioned. This overhead generally shouldn't be extreme (often <= 1 cycle per instruction in practice) but will add up.
I've also commented: riscv/riscv-v-spec#664 (comment)
from rvv-intrinsic-doc.
In order to mitigate the increase in intrinsics, I'd like to propose adding a tail policy operand to all of the masked intrinsics.
The table would then become. The define name can be renamed/shortened in the implementation.
Masked? | Needs tail preserved | Needs maskedoff preserved | Intrinsic |
---|---|---|---|
No | No | N/A | vadd_vv_<ty>_vl(a, b, vl) |
No | Yes | N/A | vadd_vv_<ty>_m_vl(undisturbed, all-ones, a, b, vl, TAIL_UNDISTURBED) |
Yes | No | No | vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl, TAIL_AGNOSTIC) |
Yes | No | Yes | vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl, TAIL_AGNOSTIC) |
Yes | Yes | No | vadd_vv_<ty>_m_vl(undisturbed, mask, a, b, vl, TAIL_UNDISTURBED) |
Yes | Yes | Yes | vadd_vv_<ty>_m_vl(undisturbed, mask, a, b, vl, , TAIL_UNDISTURBED) |
This still suffers from an inability to specify tu+ma. It would give tu+mu, but tu+ma is likely rare.
Since not all instructions have masked forms we would also need tail undisturbed versions with an extra source operand for at least these instructions
vmv.v.v/vmv.v.x/vmv.v.i
vmerge.vvm/vmerge.vxm/vmerge.vim/vfmerge.vfm
vadc.vvm/vadc.vxm/vadc.vim
vsbc.vvm/vsbc.vxm
Should we add tail policy to multiply accumulate intrinsics like vfmacc or just say they are always tail agnostic and use the masked version to access tail undisturbed?
from rvv-intrinsic-doc.
In order to mitigate the increase in intrinsics, I'd like to propose adding a tail policy operand to all of the masked intrinsics.
This looks like a reasonable trade-off without having to introduce m
and mt
versions as I suggested earlier.
This still suffers from an inability to specify tu+ma. It would give tu+mu, but tu+ma is likely rare.
Again reasonable for me (I have not seen an example where tu+ma
is useful yet)
Since not all instructions have masked forms we would also need tail undisturbed versions with an extra source operand for at least these instructions
vmv.v.v/vmv.v.x/vmv.v.i vmerge.vvm/vmerge.vxm/vmerge.vim/vfmerge.vfm vadc.vvm/vadc.vxm/vadc.vim vsbc.vvm/vsbc.vxm
I'd be inclined to do so.
Bikeshedding: the new suffix for the intrinsics in the table above could be mt
(rather than just m
) as the extra operands are mask and tail policy. These ones as they don't have mask the suffix could be just t
.
Should we add tail policy to multiply accumulate intrinsics like vfmacc or just say they are always tail agnostic and use the masked version to access tail undisturbed?
I'd say the second option as it looks to me as more consistent with the rest of intrinsics.
from rvv-intrinsic-doc.
Pulling part of the conversation from a compiler thread, I have a few concerns here:
-
Vector processors are going to have very different constraints - an OoO core doesn't want register dependencies / partial dependence stalls because it doesn't know if a vector register is carrying data from its previous value or being fully over written. An in order core doesn't have any problem with it. Further future processors may have other constraints that will be different in the decades ahead.
-
Despite this, we need the C programming model (and the corresponding LLVM IR intrinsic semantics) to remain stable, even though the cores have different performance characteristics. We don't want low-level C intrinsics to paper over the difference, we want to give power to the programmer.
-
On the third hand, VLS codegen is in a gray zone. That is effectively a higher level abstraction on top of the VLA model, and vsetvli insertion etc for it can absolutely know about the target CPU and generate efficient code for it with different strategies. This is what -mtune etc are for.
I think that all of this together argues that we need:
-
A predictable VLA intrinsic model and a C API that provides full control over the hardware, even at the expense of programmability, ergonomics, and "compiler optimizability". The compiler wouldn't be able to reschedule or optimize things in general, because all the operations would depend on mutable global state registers.
-
A useable VLS instruction/intrinsic model that is higher level, effectively allowing C programmers and higher level compilers (MLIR etc) to delegate this aspect of codegen to LLVM. This would be optimizable by the compiler. This isn't just the standard LLVM IR / OpenCL generic vector instructions -- we'd likely also need to add RVV-specific intrinsics for VLS.
-
It's fine to have clang compiler flags that affect codegen and interpretation of these things, but they should affect the IR generated, not the interpretation of the IR.
Does this make sense or do y'all disagree with this?
-Chris
from rvv-intrinsic-doc.
I use Tail Undisturbed for many things, for some examples see: riscv/riscv-v-spec#664 (comment)
I don't like the idea of adding all 4 tail options to all the intrinsics, that's madness, and I don't like the compiler choosing which tail policy to use either. I need to be able to specify the tail policy when I need to.
It was mentioned above, but surely the simplest way to do this is have all intrinsics set to a default and the programmer can switch to a specific mode either for a specific instruction (using vsetvl*()
) or change the mode (maybe by a compiler switch, #pragma
or pseudo function call).
Proposal either:
- Add ma/mu & ta/tu parameters to
vsetvl*()
(This will break existing code); or - Add ma/mu & ta/tu parameters to new
vsetvl*()
intrinsics e.g.vsetvl*_tail()
; or - Add new
vsetvl
intrinsics with the tail policy specified in the function names, e.g.vsetvl_*_m?_t?()
andvsetvlmax_*_m?_t?()
When the programmer wants to use a specific tail policy, they simply place the required tail policy vsetvl intrinsic before the instruction requiring it.
I currently have to do this using an assembly insert, as follows in LLVM:
vl = vsetvl_e16m4(blkCnt);
/* The following seems to force tu policy for the vwadd_wv instruction, only if vsetvl was previously set to the same (_e16m4) */
asm("vsetvli %0, %1, e16,m4,tu,mu"
: "=r" (vl)
: "r" (blkCnt) );
vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); /* Widen again to Q31 and accumulate the vectors */
/*
The above outputs double vsetvli, but at least tu is set:
5f0cc: 57 f7 a6 04 vsetvli a4, a3, e16,m4,ta,mu
5f0d0: 57 f7 a6 00 vsetvli a4, a3, e16,m4,tu,mu
5f0d4: 57 28 8e d6 vwadd.wv v16, v8, v28
*/
from rvv-intrinsic-doc.
Adding an extra argument to vsetvli intrinsic means we have same issue as implicit VL model: that introduce an implicit use-def relation between vsetvl and other vector operations.
See #60 for more detail for implicit VL vs explicit VL
from rvv-intrinsic-doc.
@tony-cole Which compiler are you using? Adding a vsetvl via inline assembly probably shouldn't work based on the programming model we were trying to implement. If we allow inline assembly to work we can't be sure sew/lmul have been set correctly which we were trying to have the compiler manage.
In order to have the compile manage sew/lmul from the operation intrinsic, the tail policy and mask policy also need to be inferable from operation intrinsic. This way the compiler can provide each instruction with the vsetvl it needs.
My proposal was only to add the tail policy to the intrinsic not the mask policy too. And only to the masked intrinsics and a few other cases. Tail undisturbed only works for intrinsics that have an operand to provide the value for the undisturbed elements. In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
from rvv-intrinsic-doc.
@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
This is the functionality I require.
After reading #60 I see that the LLVM way of doing it (Explicit VL) can produce better optimisations. But as programmer using the vector intrinsics, I need a way to specify ta/tu and ma/mu for all instruction where they can make a difference to the result.
Is there a way to do this easily?
For instance, in C++, we could have default parameters specifying ta/mu, but we can change them if required by adding the additional tu/ma parameters.
In C we could use the variable function or macro parameters (...
, __VA_ARGS__
and __VA_OPT__
), these could be used to effect a similar default parameter functionality for the intrinsics? This will keep compatibility with existing code, but allow the tail policies to be changed when required in the Explicit way.
from rvv-intrinsic-doc.
@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
This is the functionality I require.
Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt);
so I assume the vecAccQ register for both the destination and first source.
Here's where we are in LLVM today.
-Intrinsics are implemented as builtin aliases rather than macros to keep the header size and compile time down. We're looking at making the header file even smaller.
-LLVM always uses mu today. We should change to ma based on maskedoff being vundefined, but I don't think that's been implemented yet.
-LLVM use tail undisturbed for all masked intrinsics except reductions(which don't only write element 0) and intrinsics that produce mask results like compares. This was chosen as the conservatively correct thing to do because there is a destination register operand and we don't know if the user wants the tail elements preserved. You can pass an all ones mask to these to get a tail undisturbed instruction today, but you'll tie up v0 with the mask. Passing vundefined will switch to ta/mu, though it should be ta/ma.
-LLVM also uses tail undisturbed for vfmacc, vfmadd, vmadd, vmacc, etc. instructions where one source is also the destination due to the instruction encoding. Again we chose the most conservative behavior. Ideally the compiler would switch between vfmacc/vfmadd to improve register allocation, but we can't with tail undisturbed.
-Reductions, vslide*, vcompress, vmv.s.x vfmv.s.f all use tail undisturbed unless vundefined is passed to the dest
operand.
A patch #101 has been posted to make some interface changes.
-All masked intrinsics that currently default to tu gain an argument to specify the tail behavior. No default value because we can't implement it in C and the intrinsics aren't macros. We'd also have to default to tu to maintain compatibility, but that's wrong for most uses. So forcing the user to explicitly state their intention is good, but breaks compatibility. Hopefully this is acceptable at this stage since the V extension hasn't been ratified yet.
-vfmacc, vfmadd, vmadd, vmacc will also gain a tail policy argument.
-Unmasked vslide will continue to infer tail policy from dest operand. Masked will get it from the tail policy argument.
-vmv, vmerge, vadc, vsbc gain a dest operand to control tail policy and the tail element value.
We still need to address the intrinsics that don't have a dest
operand and don't use a mask. The user can use the masked version to get the tail undisturbed policy at the cost of all ones in v0 and the register pressure that entails. We could teach the compiler to optimize this pattern, or we can add a dest operand to all unmasked intrinsics, or add new intrinsics that have the extra operand. If we got with adding new intrinsics we should do the same for vmv, vmerge, vadc, vsbc, I think.
from rvv-intrinsic-doc.
Thank you for the info on how the policy is currently selected, is this documented somewhere?
@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
This is the functionality I require.
Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); so I assume the vecAccQ register for both the destination and first source.
Yes, vecAccQ is both the destination and first source, but the compiler uses a different destination register (V16) and moves it back to the source (V8) for the next iteration later on:
Here is my loop code (that accumulates a vector, ready for reduction-sum then mean):
blkCnt = blockSize;
/* Perform the accumulates */
while (blkCnt > 0U)
{
vl = vsetvl_e32m4(blkCnt)
vecIn = vle32_v_i32m4(pSrc, blkCnt);
/* The following vwadd_wv instruction requires tail undisturbed policy for this function to work correctly when
blockSize > vlmax and (blockSize % vlmax) != 0 */
/* This seems to force tu policy for the vwadd_wv instruction on the current LLVM compiler */
asm("vsetvli %0, %1, e32,m4,tu,mu" : "=r" (vl) : "r" (blkCnt) );
vecAccW = vwadd_wv_i64m8(vecAccW, vecIn, blkCnt); /* Accumulate the vectors */
pSrc += vl;
blkCnt -= vl;
}
5f056: 57 f7 26 05 vsetvli a4, a3, e32,m4,ta,mu
5f05a: 07 6e 05 02 vle32.v v28, (a0) # Load new vector data in to V28
5f05e: 57 f7 26 01 vsetvli a4, a3, e32,m4,tu,mu # The asm insert adds this vsetvl (for tu policy)
5f062: 57 28 8e d6 vwadd.wv v16, v8, v28 # Widen V28, accumulate with V8 and store in V16 (using tu policy)
5f066: 93 17 27 00 slli a5, a4, 2
5f06a: 99 8e sub a3, a3, a4
5f06c: 3e 95 add a0, a0, a5
5f06e: 57 b4 03 9f vmv8r.v v8, v16 # Move V16 back to V8
5f072: f5 f2 bnez a3, 0x5f056 <arm_mean_q31+0x16>
I want Tail Undisturbed policy so the tail of vecAccW (V16) remains undisturbed (the tail holds previously accumulated data I want to keep). V16 (dest) gets moved back to V8 (source) for the next iteration later on.
It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?
The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...
Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.
Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?
from rvv-intrinsic-doc.
Thank you for the info on how the policy is currently selected, is this documented somewhere?
@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
This is the functionality I require.
Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); so I assume the vecAccQ register for both the destination and first source.Yes, vecAccQ is both the destination and first source, but the compiler uses a different destination register (V16) and moves it back to the source (V8) for the next iteration later on:
Here is my loop code (that accumulates a vector, ready for reduction-sum then mean):
blkCnt = blockSize; /* Perform the accumulates */ while (blkCnt > 0U) { vl = vsetvl_e32m4(blkCnt) vecIn = vle32_v_i32m4(pSrc, blkCnt); /* The following vwadd_wv instruction requires tail undisturbed policy for this function to work correctly when blockSize > vlmax and (blockSize % vlmax) != 0 */ /* This seems to force tu policy for the vwadd_wv instruction on the current LLVM compiler */ asm("vsetvli %0, %1, e32,m4,tu,mu" : "=r" (vl) : "r" (blkCnt) ); vecAccW = vwadd_wv_i64m8(vecAccW, vecIn, blkCnt); /* Accumulate the vectors */ pSrc += vl; blkCnt -= vl; } 5f056: 57 f7 26 05 vsetvli a4, a3, e32,m4,ta,mu 5f05a: 07 6e 05 02 vle32.v v28, (a0) # Load new vector data in to V28 5f05e: 57 f7 26 01 vsetvli a4, a3, e32,m4,tu,mu # The asm insert adds this vsetvl (for tu policy) 5f062: 57 28 8e d6 vwadd.wv v16, v8, v28 # Widen V28, accumulate with V8 and store in V16 (using tu policy) 5f066: 93 17 27 00 slli a5, a4, 2 5f06a: 99 8e sub a3, a3, a4 5f06c: 3e 95 add a0, a0, a5 5f06e: 57 b4 03 9f vmv8r.v v8, v16 # Move V16 back to V8 5f072: f5 f2 bnez a3, 0x5f056 <arm_mean_q31+0x16>
I want Tail Undisturbed policy so the tail of vecAccW (V16) remains undisturbed (the tail holds previously accumulated data I want to keep). V16 (dest) gets moved back to V8 (source) for the next iteration later on.
Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation? Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?
It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?
It's not a hardware restriction. It's the compiler not knowing what's going on.
The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...
Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.
Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?
That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.
from rvv-intrinsic-doc.
Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation?
No and no.
Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?
Note: The accumulator vecAccW (v16) is set to zero before the loop (not shown in the above code, sorry!).
Yes it doesn't matter:
If the memory array length (blockSize) is <= VLMAX then there is only 1 iteration and the array length (blockSize) is used for the following vector reduction instruction (vredsum).
If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) == 0 then there is more than one iteration. On the last iteration the complete vector register length (VLMAX) is written to and VLMAX is used for the following vector reduction instruction (vredsum).
If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) != 0 then there is more than one iteration and on the last iteration the complete vector register (VLMAX) is not written to, this is where I need Tail Undisturbed as I want to preserve the previous tail elements for the following vector reduction (vredsum) instruction on all (VLMAX) elements (not shown in the code above).
It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?
It's not a hardware restriction. It's the compiler not knowing what's going on.
How do I tell the compiler what's going on? Is the asm() insert incorrect or missing information?
The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...
Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.
Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?
That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.
I think it would support it. The tail/mask agnostic/undisturbed policy is always for the destination register, so if the undisturbed elements are to come from a third register, then that third register must be the destination register (so it's type could be, say, vint64m8_tumu_t to specify undisturbed).
3.3.3. Vector Tail Agnostic and Vector Mask Agnostic vta and vma
These two bits modify the behavior of destination tail elements and destination inactive masked-off elements respectively during the execution of vector instructions.
Also from the same section is this note:
NOTE: To maintain backward compatibility in the short term and reduce software churn in the move to 0.9, when these flags are not specified on a
vsetvli
, they should default to mask-undisturbed/tail-undisturbed. The use ofvsetvli
without these flags should be deprecated, however, such that the specifying a flag setting becomes mandatory. If anything, the default should be tail-agnostic/mask-agnostic, so software has to specify when it cares about the non-participating elements, but given the historical meaning of the instruction prior to introduction of these flags, it is safest to always require them in future assembly code.
So, the compiler should default to tu/mu until a method of specifying tu/mu is available.
from rvv-intrinsic-doc.
This is just another manifestation of the same problem we are seeing elsewhere (e.g., #84). The V-extension defines several control and status registers which are inadequately exposed to the C programmer.
- The
vl
register is exposed via "explicit-VL" arguments; - The
vlenb
register is exposed by callingvsetvlmax*
intrinsics; - The
vsew
andvlmul
fields of thevtype
register are partially[*] exposed by hardcoding them into intrinsic function names; - The
vta
andvma
fields of thevtype
register are partially[*] exposed by complicated heuristics that we're debating here; - The
vill
field of thevtype
register is not exposed; - The
vstart
register is not exposed; - The
vxrm
andvxsat
registers (also accessible throughvcsr
) are not exposed; - If floating-point is present, the
frm
andfflags
registers (also accessible throughfcsr
) are not exposed.
[*] I say "partially" because their exposure is static, whereas the assembly language allows them to be dynamic. For example, the intrinsics programmer cannot access the vsetvl
instruction (which inputs dynamic vtype
).
from rvv-intrinsic-doc.
Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation?
No and no.
Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?
Note: The accumulator vecAccW (v16) is set to zero before the loop (not shown in the above code, sorry!).
Yes it doesn't matter:
If the memory array length (blockSize) is <= VLMAX then there is only 1 iteration and the array length (blockSize) is used for the following vector reduction instruction (vredsum).
If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) == 0 then there is more than one iteration. On the last iteration the complete vector register length (VLMAX) is written to and VLMAX is used for the following vector reduction instruction (vredsum).
If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) != 0 then there is more than one iteration and on the last iteration the complete vector register (VLMAX) is not written to, this is where I need Tail Undisturbed as I want to preserve the previous tail elements for the following vector reduction (vredsum) instruction on all (VLMAX) elements (not shown in the code above).
It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?
It's not a hardware restriction. It's the compiler not knowing what's going on.
How do I tell the compiler what's going on? Is the asm() insert incorrect or missing information?
Right now the closet way to tell the compiler what is going on is to use vwadd_wv_i32m8_m(vmset_m_b4(bclkCnt), vecAccQ, vecAccQ, vecAccW, blkCnt)
or create a VLMAX all ones mask outside the loop. The vwadd_wv_i32m8_m intrinsic defaults to tu the important thing is that it has a maskedoff operand which must be the same register as the destination. I know this is not ideal and we're very likely going to change vwadd_wv_i32m8_m
to default to ta so it is going to break. So the only forward compatible way right now is to use vid.v and vmslt.vx to create a mask and not use tail undisturbed. I'm very sorry about this. I think intrinsic interface was defined before tu existed and tail elements were always zeroed.
The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...
Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.
Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.
I think it would support it. The tail/mask agnostic/undisturbed policy is always for the destination register, so if the undisturbed elements are to come from a third register, then that third register must be the destination register (so it's type could be, say, vint64m8_tumu_t to specify undisturbed).
Like any C/C++ function, the behavior of the intrinsic must defined completely by the values and types of its operands. The assignment operator is considered a separate operation its type cannot be used to define the behavior. It's not required to be there, you could pass the result of the intrinsic directly to another instrinsic or another function call. So we need the value containing the tail undisturbed elements to be passed to the intrinsic. This requires a tail undisturbed vadd intrinsic to take 3 operands instead of 2.
3.3.3. Vector Tail Agnostic and Vector Mask Agnostic vta and vma
These two bits modify the behavior of destination tail elements and destination inactive masked-off elements respectively during the execution of vector instructions.Also from the same section is this note:
NOTE: To maintain backward compatibility in the short term and reduce software churn in the move to 0.9, when these flags are not specified on a
vsetvli
, they should default to mask-undisturbed/tail-undisturbed. The use ofvsetvli
without these flags should be deprecated, however, such that the specifying a flag setting becomes mandatory. If anything, the default should be tail-agnostic/mask-agnostic, so software has to specify when it cares about the non-participating elements, but given the historical meaning of the instruction prior to introduction of these flags, it is safest to always require them in future assembly code.So, the compiler should default to tu/mu until a method of specifying tu/mu is available.
The compiler does default to tu/mu on every intrinsic that has a dest operand, a maskedoff operand, or is an fma instruction. That operand is needed to tell the compiler which register to pick for the destination. The data flow must be explicit in the C code.
from rvv-intrinsic-doc.
In the meantime, please can we have a #pragma
and/or command line switch to default all instructions to Tail and Mask Undisturbed?
Also, I don't think the compiler should be deciding on the mask/tail policy, I think it should default to the safe Undisturbed policy because as a programmer I expect registers to remain stable and not have some of their data changed without me deciding that's what I want it to do. So, my vote it to make all instruction default Mask/Tail Undisturbed and then have mechanisms to allow programmers to specify otherwise for specific speedups.
from rvv-intrinsic-doc.
I believe compiler could select the best tail policy for users because clear api design is doing what user want to do.
It's why we need a new API #101 to model tail policy.
I think #pragma
is not workable because different policy have different number of argument. Usually tu
need an additional argument to specific tail value.
from rvv-intrinsic-doc.
In the meantime, please can we have a
#pragma
and/or command line switch to default all instructions to Tail and Mask Undisturbed?Also, I don't think the compiler should be deciding on the mask/tail policy, I think it should default to the safe Undisturbed policy because as a programmer I expect registers to remain stable and not have some of their data changed without me deciding that's what I want it to do. So, my vote it to make all instruction default Mask/Tail Undisturbed and then have mechanisms to allow programmers to specify otherwise for specific speedups.
We have vl and mask to control which elements are relevant in the vector operations. In most cases, users should not care about the tail elements. We are defining new set of intrinsics with _t
and _mt
suffixes to control tail policy. With these intrinsics, we could use tail agnostic as the default policy for _m
intrinsics.
from rvv-intrinsic-doc.
We have merged RFC #137 that is implemented in GCC and implementing in LLVM that is expected to land soon. This issue is expected to be resolved for v1.0. Closing the issue.
from rvv-intrinsic-doc.
Related Issues (20)
- [Requirement]: The RISC-V RVV vector intrinsic must include support for vector groups in the __riscv_vfredosum function HOT 4
- Type-relative overloads for vreinterpret, vlmul_ext, vlmul_trunc, etc. HOT 1
- How to use a class to wrap or derive from a sizeless vector type HOT 1
- Encode all the effects of vsetvl in the return type, for use in subsequent type deductions HOT 1
- Does `__riscv_v_intrinsic >= 1000000` imply overloaded intrinsics are supported?
- Create bibliography from reference section HOT 3
- Simple questions about inline assembly in vmv.x.s instruction HOT 2
- Asterisks are not subscripts
- the wrong result of "vmerge_vvm_i32m1" HOT 5
- ta,ma reduction destination with vl=0 HOT 1
- Clarify the consequences of vxsat not being handled by the intrinsics HOT 3
- Add a section with examples HOT 3
- Rename uses of {implicit,explicit}-frm into {Implicit, Explicit} FP rounding mode HOT 1
- Clarify the mapping of pseudo-intrinsics
- Clarify what float and double means HOT 1
- Fix authors in the document
- How to use LMUL in rvv-intrinsic? HOT 6
- Mismatched bfloat16 autogenerated files HOT 3
- Freeze the specification HOT 1
- `vfirst` and `vcpop` return types unexpectedly changed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rvv-intrinsic-doc.