Hi These days I'm finding a solution to zip 2 vectors using RVV Intrinsics. Using

Another arithmetic sequence is something like <p dir="au

I am not sure whether it is a good example: <a href="https://godbolt

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question] How to zip 2 vectors using RVV Intrinsics? about rvv-intrinsic-doc HOT 11 CLOSED

GaryCAICHI commented on July 19, 2024

[Question] How to zip 2 vectors using RVV Intrinsics?

from rvv-intrinsic-doc.

Comments (11)

topperc commented on July 19, 2024 3

Another arithmetic sequence is something like

__riscv_vwmaccu_vx(__riscv_vwaddu_vv(a, b), -1U, b)

from rvv-intrinsic-doc.

zhongjuzhe commented on July 19, 2024 2

Right, vrgather can be used to do a zip, and I did include that in my message. But, for a situation where it's followed by a store, I'd expect a segmented store to be at least as good, given that, worst-case, hardware can always itself implement the unit-stride segmented store as vrgather followed by a regular store (but of course a segmented store has the potential to be better too; and of course hardware is also free to do a clearly-suboptimal thing, but this would be pretty sad as then there'd be no situation ever where the segmented load/store instructions are best option).

Of note is that clang compiles that example to a segmented store.

Yeah. Segment store should be good in hardware in most cases. But segment load not sure.
The example I shows is just to clarify how to use vmerge + vrgather simulate ZIP with compile option tunning.

Actually, current RVV compiler no matter GCC or Clang, is always using segment load/store by default:

https://godbolt.org/z/jseq14fMf

from rvv-intrinsic-doc.

dzaima commented on July 19, 2024 1

This is indeed a messy thing in RVV; there's no RVV instruction for doing a zip in-register, so you have to emulate it with a vrgather (or, for ≤32-bit elements, there's also the option of doing some widening arith, something like vwadd_wv(vsll_vx(vzext_vf2(a), width), b), which might be faster or slower depending on hardware).

But of note is that if you need to immediately store the result, a segmented store can be used (e.g. __riscv_vsseg2e32_v_i32m1x2)

from rvv-intrinsic-doc.

dzaima commented on July 19, 2024 1

Another arithmetic sequence is something like

__riscv_vwmaccu_vx(__riscv_vwaddu_vv(a, b), -1U, b)

I'm pretty sure this should use vzext instead of the vwmaccu. Alternatively vadd.vv & vwmul.vx.

No, vwaddu is correct - the idea is that the -1 * b does (2^n-1) * b and then the extra b+ makes it exactly (2^n)*b, and we also add a while at it and thus need all adds to work in the wider space as they're actually all adds, not bitwise magic. (I too thought it was wrong until I threw it into a SAT solver, and it accepted the widening add but not a zext)

from rvv-intrinsic-doc.

zhongjuzhe commented on July 19, 2024

I am not sure whether it is a good example:

https://godbolt.org/z/rMK898G1d

For ARM SVE, use zip to interleave the data.

For RVV, we can use vid + vand + vmseq + vmerge with mask to interleave the data.

For example:

vid v (01234....)
vand.vi v , 3 -> (012012012012....)

Then you can use the AND result to generate 101010101010...mask with:
vmseq.vi v, 0

Then you can use the AND result to generate 010101010...mask with:
vmseq.vi v, 1

Then you can use the AND result to generate 001001001001...mask with:
vmseq.vi v, 2

You can use those masks + vmerge to interleave data if you don't want to segment load/store (Which is expensive instructions in hardware).

from rvv-intrinsic-doc.

dzaima commented on July 19, 2024

@zhongjuzhe An ARM zip does a transform like src1=[a,b,c,d], src2=[e,f,g,h] to zip1=[a,e,b,f], zip2=[c,g,d,h], which changes the indices of elements (e.g. e was src2[0] but at zip1[1] in the result). vmerges alone cannot achieve that. Though, yes, if the rearranging is avoidable/unnecessary, it's beneficial to avoid it.

from rvv-intrinsic-doc.

zhongjuzhe commented on July 19, 2024

@zhongjuzhe An ARM zip does a transform like src1=[a,b,c,d], src2=[e,f,g,h] to zip1=[a,e,b,f], zip2=[c,h,d,h], which changes the indices of elements (e.g. e was y[0] but at zip1[1] in the result). vmerges alone cannot achieve that.

No. I am not saying that vmerge alone achieve zip in ARM SVE.
vmerge is generating the index for vrgather.

Actually, vmerge + vrgather. The example has explicitly shows that:

https://godbolt.org/z/rMK898G1d

from rvv-intrinsic-doc.

zhongjuzhe commented on July 19, 2024

@zhongjuzhe An ARM zip does a transform like src1=[a,b,c,d], src2=[e,f,g,h] to zip1=[a,e,b,f], zip2=[c,g,d,h], which changes the indices of elements (e.g. e was src2[0] but at zip1[1] in the result). vmerges alone cannot achieve that. Though, yes, if the rearranging is avoidable/unnecessary, it's beneficial to avoid it.

Oh. Sorry. I realize that I am showing incorrect example:

https://godbolt.org/z/vnboG1rvz

This is the example using vmerge+vrgather.

vmerge generate the index. vrgather shuffle the vector.

from rvv-intrinsic-doc.

dzaima commented on July 19, 2024

Right, vrgather can be used to do a zip, and I did include that in my message. But, for a situation where it's followed by a store, I'd expect a segmented store to be at least as good, given that, worst-case, hardware can always itself implement the unit-stride segmented store as vrgather followed by a regular store (but of course a segmented store has the potential to be better too; and of course hardware is also free to do a clearly-suboptimal thing, but this would be pretty sad as then there'd be no situation ever where the segmented load/store instructions are best option).

Of note is that clang compiles that example to a segmented store.

from rvv-intrinsic-doc.

GaryCAICHI commented on July 19, 2024

@zhongjuzhe @dzaima Thanks! I'll check that out.

from rvv-intrinsic-doc.

camel-cdr commented on July 19, 2024

Another arithmetic sequence is something like

__riscv_vwmaccu_vx(__riscv_vwaddu_vv(a, b), -1U, b)

~~I'm pretty sure this should use vzext instead of the vwmaccu. Alternatively vadd.vv & vwmul.vx.~~

from rvv-intrinsic-doc.

[Question] How to zip 2 vectors using RVV Intrinsics? about rvv-intrinsic-doc HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent