As <a href="https://github.com/sifive/rvv-intrinsic-doc/issues/10#issuecomment-6172262

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I put an example in <a class="issue-link js-issue-link" data-error-text="Failed to loa

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support reinterpretation between different types of the same LMUL about rvv-intrinsic-doc HOT 18 CLOSED

riscv-non-isa commented on July 18, 2024

Support reinterpretation between different types of the same LMUL

from rvv-intrinsic-doc.

Comments (18)

rdolbeau commented on July 18, 2024 1

@zakk0610 AVX doesn't type the integer vector down to SEW, so in AVX implementation you never need 'reinterpret' between integer types - as there is only one per register width (__m128i, __m256i, __m512i) .Of course, other ISAs may have different syntactic requirements, this is just illustrating the kind of things algorithms do.

from rvv-intrinsic-doc.

rdolbeau commented on July 18, 2024

I put an example in #8 where I'd use reinterpretation between LMUL as a poor man's version of a 128-bit shuffle...

For examples of code reinterpreting data, https://bench.cr.yp.to/supercop.html, crypto is a repeat offender...

crypto_core/mult3sntrup761/avx2unsigned/mult3_32x32.c:

  __m256i aodd_b2 = _mm256_mul_epi32(aodd, b_br); // <= 16
(...)
  aeve_b3 = _mm256_add_epi8( aeve_b3 , aodd_b2 ); // <= 32

crypto_encode/761x1531/avx/encode.c:

    x = _mm256_add_epi16(x,_mm256_set1_epi16(2295));
    x &= _mm256_set1_epi16(16383);
    x = _mm256_mulhi_epi16(x,_mm256_set1_epi16(21846));
    y = x & _mm256_set1_epi32(65535);
    x = _mm256_srli_epi32(x,16);

crypto_kem/kyber90s768/avx2/aes256ctr.c:

nv1 = _mm_shuffle_epi8(_mm_add_epi32(nv0i, _mm_set_epi64x(1,0)), _mm_set_epi8(8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7));

from rvv-intrinsic-doc.

zakk0610 commented on July 18, 2024

Hi @rdolbeau
Thanks! I understood the reason, in encode.c case

1: x = _mm256_add_epi16(x,_mm256_set1_epi16(2295));
2: x &= _mm256_set1_epi16(16383);
3: x= _mm256_mulhi_epi16(x,_mm256_set1_epi16(21846));
4: y = x & _mm256_set1_epi32(65535);
5: x = _mm256_srli_epi32(x,16);

line 1~3 and 5 is not a problem, but in line 4,
x is 16 bit element vector result and it will operate with 32 bit element vector.
Without reinterpret function, user need additional functions to achieve the same result.

from rvv-intrinsic-doc.

zakk0610 commented on July 18, 2024

As https://github.com/sifive/rvv-intrinsic-doc/issues/8#issuecomment-617470760 mentioned, SLEN would change the layout of data, it means "reinterpretation between different types of the same LMUL" only work on LMUL=1.
We can see example 5 and 6 in https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#42-mapping-with-lmul--1 shows that in SLEN=128b if we reinterpret SEW=32b to SEW=256b, the 0-7 elements in SEW 32b is not equal to first element in SEW 256b.
But in SLEN=256b config, reinterpreting SEW=32 to SEW=256 is workable because the corresponding index is correct (index 0 in 256b equal to index 0-7 in 32b).

In addition, the goal of RVV ISA designed to allow same binary code to work across variations in VLEN and SLEN, I believe it means we should not support non SLEN portable intrinsic functions.

from rvv-intrinsic-doc.

nick-knight commented on July 18, 2024

the goal of RVV ISA designed to allow same binary code to work across variations in VLEN and SLEN, I believe it means we should not support non SLEN portable intrinsic functions.

I strongly agree with the first statement, that the architects designed the RVV ISA with this type of portability in mind. I know (from asking them) that they strongly disapprove of an intrinsics library that facilitates such non-portabilities. I think they're overstepping their bounds: they've already had their say, the ISA...

This of course opposes @rdolbeau's perspective from https://github.com/sifive/rvv-intrinsic-doc/issues/8#issuecomment-615358469:

And intrinsics are not used to do it "the way it was designed to be programmed". They're used to push the architecture to it's limit, either performance or semantic, so that it does things it wasn't designed to do in the first place.

I'm not optimistic that we'll find common ground. In particular, I have two concerns:

If we don't offer the type-punning tools @rdolbeau is requesting, then it will lead to divergence down the road.
Offering these type-punning tools will further complicate the API and add serious technical hurdles (like I insinuated in https://github.com/sifive/rvv-intrinsic-doc/issues/8#issuecomment-617918874) to an already challenging and understaffed project.

The best way I can think of navigating these concerns is to go through the effort to formally specify the type-punning tools in the API, possibly marked as a "black diamond" extension to warn away beginners, and defer the implementation to future work, when we have the manpower.

from rvv-intrinsic-doc.

rdolbeau commented on July 18, 2024

@zakk0610 I'm sorry if this sounds harsh, but... if you try to send the message "that's not what the philosophy is so we won't let you do it", the message that will be received is "this is an academic toy so don't use it"... You design the hardware, don't try to design what the user will do with the hardware. It's their jobs, not yours, and they won't like it.

Arm pushed scalability for SVE as well. First question they heard from the HPC guys, "Can we fix the vector width at compile time?"... You need to know how much data you play with for things like cache blocking or full unrolling, and scalability goes against that. Ditto for algorithms working on a fixed amount of data - sometimes you can vectorize across blocks (chacha20 does that, hence the big transpose at the end), sometimes you can't and you need to know how much data you have in a vector.

And in many cases the code is compiled for one single architecture not to be 'portable', because it's going to run on one single supercomputer - if it needs to run elsewhere, it will be recompiled anyway.

@knightsifive I'll comment in #8, and I agree that the implementations of the most difficult/unusual concepts can be deferred.

from rvv-intrinsic-doc.

rofirrim commented on July 18, 2024

@zakk0610 thanks, I see your point, under SLEN != VLEN, moving from a smaller SEW to a larger SEW is problematic. However moving from a larger one to a smaller one I think should not be a problem (under the assumption, which I think still holds, that everything is a power of two).

in that sense, and given the possibility that there can be implementations where SLEN=VLEN, I believe it should be possible to reintepret vectors in any way within the same LMUL (not just LMUL=1). I admit that a portability warning may be due here.

Perhaps SLEN should be easier to determine by the user, in particular if the user embraces reinterpretation for whatever reason it needs (let's not try to foresee exactly what are those needs lest we risk ourselves constraining what can be done with V-ext), he or she will need to be aware of SLEN. Maybe another intrinsic? I seem to recall vid can be used for that, so shouldn't be too difficult to have this functionality (i.e. no need for a new CSR).

from rvv-intrinsic-doc.

zakk0610 commented on July 18, 2024

It seems to me that there are two types of portability, binary and performance. I mentioned the binary portability based on Krste's slide, and rvv does not give a promise for 'portable' performance.

@rdolbeau thank again, I learned a lot from HPC developers view, the problem we met in RVV is its highly HW configurable than tradition SIMD architecture and we want to design an universal intrinsic function to support all HW combination. But HPC guys do not like that, they already have HW config in mind to perform optimizations.

@rofirrim
Yes, users can get the SLEN via vid (riscv/riscv-v-spec#233 (comment)).
I'm not sure about assuming users need to be aware of SLEN does make sense or not, because in riscv/riscv-v-spec#233 (comment) @aswaterman mentioned that "SLEN should be invisible to the programmer.".

Now I don't have strong feelings regarding those reinterpretation functions, but I would wondering should compiler need to encode SLEN=VLEN or SLEN!=VLEN info in some where to avoid linking two different assumption binaries?

from rvv-intrinsic-doc.

rdolbeau commented on July 18, 2024

@zakk0610 Yes, 'portability' means multiple things, you are right 'portable performance' is pretty much a myth :-) As soon as the hardware has some sort of specific behaviour that can be leveraged, it will be leveraged, and that will not be portable to hardware w/o the specific behaviour. Including the 24-bits multiplier in some GPUs or the let's-make-the-usercode-cache-line-size-aware dcbz on powerpc...

'binary' or 'behaviour' or 'semantic' compatibility is a worthy goal - and it has made Intel what it is, so obviously it's commercially important as well. However, some people are willing to forego that for other objectives, such as performance. That's why a lot of HPC codes are compiled with 'ifort -xAVX512-CORE' nowadays (yes, Fortran, sorry ;-) ), which won't run on anything before Skylake (not even Knights Landing which has a different subset of AVX-512 and prefers -xAVX512-MIC ...). You definitely want the baseline tools to have such compatibility - but you don't want to enforce it, because it takes options out of the user's hands.

Phrased differently - if I can write fast but non-portable code in pure assembly, there's no point preventing me from doing it in intrinsics as well. In fact, it might help portability, because then some of the 'dangerous' behaviours might be identifiable and 'warning-able' by the compiler - thus telling the less-aware users that they probably shouldn't be doing that in the first place...

And in other words - no-one /wants/ to write intrinsics-level, non-portable, ugly, hard-to-write, nightmarish-to-debug, impossible-to-maintain, code. But sometimes, the job requires it...

from rvv-intrinsic-doc.

kito-cheng commented on July 18, 2024

@rdolbeau I couldn't accept reinterpret for different SEW with same LMUL before, the reason is it break portability, my thought is we should not create those API to let programmer can write non portable code, but your point convince me :P

Like part in C language there is lots of magic can be done if we don't care some portability, pointer casting with different type size, assume floating point is IEEE 754 format, assume endian is little, assume string is ASCII...those assumption didn't guaranteed by C/C++ standard, which is not portable on certain uncommon machine/environment, but it can get better performance if we ignore those portability issue.

But personally I still don't like those API to open the door of non-portable, so I vote/plus 0.5 to adding reinterpret API for all LMUL :P

from rvv-intrinsic-doc.

Hsiangkai commented on July 18, 2024

I write a section for this issue.
https://github.com/sifive/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#reinterpret-sew

from rvv-intrinsic-doc.

nick-knight commented on July 18, 2024

@Hsiangkai: Thanks for writing this up! I suggest adding a minor clarification:
https://github.com/sifive/rvv-intrinsic-doc/pull/14

from rvv-intrinsic-doc.

nick-knight commented on July 18, 2024

On a related note, I recently suggested adding a "reinterpret_cast" instruction to the ISA:
riscv/riscv-v-spec#434 (comment)
This addresses the more general case of changing both SEW and LMUL, so handles register-group "fission" and "fusion" as well. (It's a no-op in certain cases, like SLEN = VLEN.) However, I doubt the task group will approve it.

from rvv-intrinsic-doc.

ebahapo commented on July 18, 2024

I guess that when loading and storing registers the fission and fusion somehow is performed on the bits, but it'd be better to not have to go through a load and store to achieve the same result.

from rvv-intrinsic-doc.

nick-knight commented on July 18, 2024

@ebahapo in the worst case of fission/fusion between LMUL = 1 and LMUL = 8, this could require 9 memory operations. A HW implementation would only need to use a subset of the memory logic, e.g., the realignment logic but not the memory port.

Unfortunately, I don't have convincing/realistic applications of this, so I've been unable to make a strong case for it in the task group.

from rvv-intrinsic-doc.

rdolbeau commented on July 18, 2024

I'm not sure if it's the proper issue, but I have another example of reinterpretation needed - though if I understand how masking is done in RVV, it's probably a tough one (the bit aren't in the same places for all types?).

A convergence loop where the 'stop bit' for each element is in its own array, and the datawidth is not always the same as the FP used. The natural way would be comparing the array to generate the mask, then use the mask throughout the computation (at the end of the computation, you also need to update the array with the new convergence result). You can't do that if the mask are different...

Example code where I end up copying the 32 bits (they could even be only 8...) into a 64 bits temporary array so I get the properly typed mask in https://github.com/HydroBench/Hydro/, branch risc-v, directory HydroC/HydroC99_2DMpi/Src, file is riemann.c.

from rvv-intrinsic-doc.

rdolbeau commented on July 18, 2024

For a working example of mask conversion for my previous example (in https://github.com/sifive/rvv-intrinsic-doc/issues/12#issuecomment-621762226), see HydroBench/Hydro@494dcb0, where operations on SVE masks avoid an intermediate array.

from rvv-intrinsic-doc.

Hsiangkai commented on July 18, 2024

Thanks @knightsifive for helping me revise the description. If there is no other objection for this issue, I will close it later.

from rvv-intrinsic-doc.

Support reinterpretation between different types of the same LMUL about rvv-intrinsic-doc HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent