For example: vslide1up/down and <code class="notr

vslide1up's rs1 is XLEN size register, <code clas

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Intrinsic functions supported are inconsistent on different HW config about rvv-intrinsic-doc HOT 14 CLOSED

riscv-non-isa commented on August 17, 2024

Intrinsic functions supported are inconsistent on different HW config

from rvv-intrinsic-doc.

Comments (14)

rdolbeau commented on August 17, 2024

All the vector values have the bitwidth encoded in them, why not the scalar? That's what 'stdint.h' is for. Using 'long' for anything is non-portable (illustrated by the fact RV64's long is 64 bits vs. 32 bits in several other architectures; Intel's had the same problem in some old MKL interface where x86-64 & IA64 where behaving differently...)., I'd avoid it if possible, and use the types from stdint.h everywhere:

vuint64m1_t vslide1up_vs_u64m1 (vuint64m1_t src, uint64_t value);
vint64m1_t vslide1up_vs_u64m1 (vint64m1_t src, int64_t value);
vuint32m1_t vslide1up_vs_i32m1 (vuint32m1_t src, uint32_t value);
vint32m1_t vslide1up_vs_i32m1 (vint32m1_t src, int32_t value);

The semantic is well-defined to the user (including sign-extension or not), then the compiler only needs to make things work as expected (... which brings us back to our discussion of semantic vs. 1-1 mapping in a way...).

Cordially,

from rvv-intrinsic-doc.

zakk0610 commented on August 17, 2024

vslide1up's rs1 is XLEN size register,
vslide1up.vx vd, vs2, rs1, vm # vd[0]=x[rs1], vd[i+1] = vs2[i]

so in RV64, interface would looks like

vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, int64_t value);  
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, int64_t value);

in RV32,

vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, int32_t value);
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, int32_t value);

Explicit type is clear but unfortunately RV32 and RV64 can not share the same api interface.
It's why we choose long type for value.

from rvv-intrinsic-doc.

rdolbeau commented on August 17, 2024

@zakk0610 You don't have to have both interfaces if the hardware behave differently, but it helps.

Let me try to explain what I mean. It would be possible for the headers to look like:

#if __riscv_xlen == 64
vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, int64_t value);  
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, int64_t value);
#elif __riscv_xlen == 32
vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, int32_t value);
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, int32_t value);
#else
#error "Oups"
#endif

This means code using vslide1up will have to be aware of discrepancy. It's a nuisance, but it reflects the hardware... a bit too much for my (software) taste.

I'd much rather have an 'homogeneous' interface, where the code will not compile on the wrong bit-width. I think the names should be considered globally rather than per-XLEN, which makes those part of the conflicting namespace from #3. So I first at would suggest something along the line of (I probably don't use the proper naming to discriminate, but you'll get the point):

#if __riscv_xlen == 64
vint32m1_t vslide1up_vs64_i32m1(vint32m1_t src, int64_t value);  
vint64m1_t vslide1up_vs64_i64m1(vint64m1_t src, int64_t value);
#elif __riscv_xlen == 32
vint32m1_t vslide1up_vs32_i32m1(vint32m1_t src, int32_t value);
vint64m1_t vslide1up_vs32_i64m1(vint64m1_t src, int32_t value);
#else
#error "Oups"
#endif

The code won't compile on the wrong architecture, so software is reasonably safe.

Ultimately, it should be possible to implement the 'semantic variants' of those, where for example vslide1up_vs32_i32m1(vint32m1_t src, int32_t value); will have the same behavior than vslide1up_vs64_i32m1 in RV64GCV, but doesn't need to ignore the upper 32 bits of the scalar - trivial to implement.

Similarly, vslide1up_vs64_i64m1 in RV32GCV would have to use two vector instructions, moving 32 bits at a time from the pair of scalar register (used to implement the scalar type int64_t) into the (semantically) lower 64 bits of the vint64m1_t vector - less trivial, but not excessively complex.

This way, the developer doesn't have to worry about implementation details, it's taken care of by the tools - or it doesn't compile if it won't work as expected. It may hide a bit the performance question, but that's a different problem.

from rvv-intrinsic-doc.

zakk0610 commented on August 17, 2024

@rdolbeau
Sorry, I don't understand what long is non-portable, long type is is corresponding toXLEN so it's workable on RV32 and RV64 platform.
In the future, if we have RV128GCV, I think the interfaces would become more complicated in your approach.

I would prefer below declaration or detect an error in compiler time:
#if __riscv_xlen == 64
vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, long value);
vint32m1_t vslide1up_vs_u32m1(vunt32m1_t src, unsigned long value);
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, long value);
vint64m1_t vslide1up_vs_i64m1(vint64m1_t src, unsigned long value);
#elif __riscv_xlen == 32
vint32m1_t vslide1up_vs_i32m1(vint32m1_t src, long value);
vint32m1_t vslide1up_vs_u32m1(vunt32m1_t src, unsigned long value);
#else
#error "Oups"
#endif

because in RV32, SEW=64, native HW does not support unsigned extension to SEW.
ex. vint64m1_t vslide1up_vs_u64m1(vunt64m1_t src, unsigned long value); is illegal interface

ps. I expect any one to many instruction expansion should happen in higher abstraction layer or semantic intrinsics layer.

from rvv-intrinsic-doc.

rdolbeau commented on August 17, 2024

@zakk0610 'long' is not portable as the amount of data it represents is variable depending on the machine on which you're working. You're right that it's always XLEN on RISC-V, but XLEN varies. So it can hold 32 (RV32, x86) or 64 bits (RV64, IA64).

Algorithms seldom work that way. They tend to work with a know amount of data (usually for integer) or a required accuracy (usually for FP). That's illustrated by the vector types: you don't use 'vint or vlong', you use 'vint32' or 'vint64'.

For me, anything that inject/extract data in/from a vector should follow the same principle, and specify the amount of bits in the source code - thus using 'int32_t' or 'int64_t', rather than 'int' or 'long' - because the later will vary between platforms despite the fact that the SEW doesn't. SO you find yourself in the situation above, where you need to have 'long' means either 32 bits or 64 bits and add a level of complexity.

With my proposal, ultimately (i.e. once 'semantic' is implemented) you don't need to test for architecture in the header:

vint32m1_t vslide1up_vs64_i32m1(vint32m1_t src, int64_t value);
vint64m1_t vslide1up_vs64_i64m1(vint64m1_t src, int64_t value);
vint32m1_t vslide1up_vs32_i32m1(vint32m1_t src, int32_t value);
vint64m1_t vslide1up_vs32_i64m1(vint64m1_t src, int32_t value);
(... you can have the unsigned variant as well ...)

This is fully specified on either RV32 or RV64. The code generated will be different, sure, but that's the compiler problem, not the problem or whomever is implementing the algorithm.

I do agree to an extant with you post-scriptum ; you could have the 'native' version (perhaps even using the native 'long'), and then have the 'semantic' version that would be defined as above, using bit-fixed width for scalar, illustrating this is less 'native' and more 'semantic'.

from rvv-intrinsic-doc.

rdolbeau commented on August 17, 2024

@zakk0610 One more thing; when you say 'ex. vint64m1_t vslide1up_vs_u64m1(vunt64m1_t src, unsigned long value); is illegal interface' - I disagree. What I think you should say is that the operation is not supported as a single instruction in hardware. But the semantic of the function is well defined (if you know how big a 'long' is...). It's perfectly implementable by moving 32 bits twice, the first all-0, the second the value parameter.

That's what I started our discussions by insisting on the difference between '1-to-1 maping' intrinsics and 'semantic' intrinsics. The first category is a pain to use because it exposes a lot of the hardware subtlety that are not important to software developers. The second category is a bit more complex to support because some of the corner cases may need workarounds in the compiler to handle those subtleties, but it allows for much easier writing/debugging and much cleaner code.

Edit: it's also why you often need to reinterpret data - to be able to implement the missing semantic, you sometimes need to fallback on other data width & some voodoo. The most egregious example I know of is that you don't have 8-bits shift in SSE/AVX (!), so you use 16-bit shifts + some masking... it's ugly, but it works! (and then you try to remember to fix that ugliness in other, cleaner SIMD ISA otherwise you get eggs on your face openzfs/zfs#9725 ;-) )

from rvv-intrinsic-doc.

zakk0610 commented on August 17, 2024

Thanks for you patient, I understood what is drawback of interface design when putting long type in API.
I still think long type scalar is fine and only long type vector is terrible as you mentioned. I think it's why in current EPI and SiFive's proposal we only define long type for scalar value.

But you are right, we can make anything consistent and friendly in 'semantic' intrinsics layer, it is very helpful for user.

Edit: it's also why you often need to reinterpret data - to be able to implement the missing semantic, you sometimes need to fallback on other data width & some voodoo. The most egregious example I know of is that you don't have 8-bits shift in SSE/AVX (!), so you use 16-bit shifts + some masking... it's ugly, but it works! (and then you try to remember to fix that ugliness in other, cleaner SIMD ISA otherwise you get eggs on your face openzfs/zfs#9725 ;-) )

agree, reinterpret function is important :)
BTW, I am curious about does SSE/AVX have 'semantic' intrinsics to support 8-bits shift? or 'semantic' intrinsics is always implemented by intrinsic user?

from rvv-intrinsic-doc.

rdolbeau commented on August 17, 2024

No, they don't have the 8-bits shift...

They do offer many 'semantic-like' intrinsics in the intrinsics guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide), such as the whole SVML (Short Vector Math Library), but they're not really defined as such - it's more an 'ad-hoc' behavior of the compiler...

Most intrinsics will behave 'semantically' in practice. For instance, take "_mm_add_ps". It's documented as an SSE intrinsic, and it will generate an SSE ADDPS by default. But if you specify you want support only for AVX machines (e.g. "icc -xCORE-AVX-I"), then it will emit a 128-bits AVX VADDPS, VEX-encoded. If you're on a AVX-512 machine (with the VL extension, e.g. "icc -xCORE-AVX512"), you may even get a 128-bits VADDPS but EVEX-encoded instead, if it needs to get access to the extra 16 vector registers available in AVX-512. Of course, wider register will require the corresponding extension or better (_mm256_add_ps doesn't work unless you enable AVX, ...).

And beyond that, if you have something like _mm_add_ps(_mm_mul_ps), on an FMA-enabled machine, the compiler might combine them in a FMA operation if otherwise permitted to do so...

It's quite convenient in practice, but not well defined in theory... but what I do is practice so it works for me ;-)

from rvv-intrinsic-doc.

zakk0610 commented on August 17, 2024

Got it, thanks :)
It looks like compiler can help to support semantically intrinsics .

In your example, _mm_add_ps interface work on SSE and AVX because this intrinsic functions are implemented by GCC's vector operator, and AVX backend also supports vector operator.
It means if the intrinsic functions can be translated to compiler internal representation, intrinsics will behave 'semantically'.

In other case _mm_add_ss intrinsic function would not work for AVX because it call SSE specific builtin function __builtin_ia32_addss and it is only available when-msse option is used. (https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html)

from rvv-intrinsic-doc.

Hsiangkai commented on August 17, 2024

Why do we need to provide different intrinsics under different HW configuration? We could provide the unified interface for users. For example,

vint32m1_t vslide1up_vx_i32m1(vint32m1_t src, int32_t value);
vint64m1_t vslide1up_vx_i64m1(vint64m1_t src, int64_t value);

The type of scalar is consistent with the SEW of vector types.
It should make sense for users. Although the code generation will be different under different HW configuration, it is the job of the compiler. The only thing users need to aware is that using int64_t to operate on vint64m1_t will be slower if XLEN = 32.

I think we have no need to provide intrinsics with long type. I doubt the users will expect use long to operate with vint8m1_t, vint16m1_t, etc.

When SEW <= XLEN, it still be one-to-one mapping. Only when SEW > XLEN, the compiler needs to generate a sequence of instructions to deal with it.

from rvv-intrinsic-doc.

rdolbeau commented on August 17, 2024

@Hsiangkai Well, that's what I suggested in https://github.com/sifive/rvv-intrinsic-doc/issues/9#issuecomment-615171839 :-)

Then you run into the differences between intrinsics I tried to describe in https://github.com/sifive/rvv-intrinsic-doc/issues/7#issuecomment-615751884

from rvv-intrinsic-doc.

zakk0610 commented on August 17, 2024

https://github.com/sifive/rvv-intrinsic-doc/commit/7d20e8477f3a4dbd11aac31293d4fd2a8795f33b

from rvv-intrinsic-doc.

Hsiangkai commented on August 17, 2024

I write a section for this issue.
https://github.com/sifive/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#scalar-in-vector-operations

from rvv-intrinsic-doc.

topperc commented on August 17, 2024

In LLVM, vmv_x_s_i64m1_i64 reads bits 63:0 of the vector register into two GPRs. For the return from vmvxsi64m1_s_nomask_builtin_test bits 31:0 will put in a0, and bits 63:32 will be put in a1. The intent is to make rv32 vs rv64 invisible to the C programmer. ~Craig

…

On Feb 15, 2023, at 10:20 PM, Jin Ma ***@***.***> wrote: hi ， I try to process the function vmv_x_s_i64m1_i64 on rv32 in gcc, and when my return value is int64_t (low 32 bits are stored in register a0 and high 32 bits are stored in register a1), since the actual return value is stored in a 32-bit register(a0), how should the high 32 bits of int64_t(a1) be handled? I think a signed extension process from int32_t to int64_t may be needed. Or simply ignore the upper 32 bits(a1), but this will cause some unknown data to be generated, resulting in wrong results. This is my function: int64_t vmvxsi64m1_s_nomask_builtin_test (int64_t *x, size_t vl) { vint64m1_t vx; vx = vle64_v_i64m1 (x, vl); return vmv_x_s_i64m1_i64 (vx); } — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMFNKUI4FQ7GRMHW2XZ6ITWXXBKXANCNFSM4MKMSOXQ>. You are receiving this because you are subscribed to this thread.

from rvv-intrinsic-doc.

Intrinsic functions supported are inconsistent on different HW config about rvv-intrinsic-doc HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent