Comments (11)
'native', pre-defined tuple simply needs to exist (for things like Zvlsseg, etc.) and have accessors so they can be (de)constructed; the ability to access/update with either struct-like or array-like syntax falls under the "good-to-have" from my point-of-view - I could live without that if it's too difficult to implement. If it's doable and you want an opinion between the options, mine is 'whichever is easier to implement' :-) (I'm guessing array).
Used-defined arrays with vector element would be good for some algorithms (e.g. storing locally a copy of data for some block-based algorithms like they have in video processing). Arm might consider that for SVE - I've put a request for it and they didn't say 'no' straight away ;-)
User-defined structure with vector element would be nice, but maybe too difficult to implement (alignment rules are going to be hell...) and not worth the effort. I don't think Arm will do that for SVE.
In your examples, when you write vint32m2x3_t
, we agree that means 6 architected registers ? (m2 means LMUL=2 so 2, the x3 is a tuple of 3, so 6 in all), and that the data- layout will be SLEN-dependent because LMUL>1? Whereas vfloat64m1x3_t
would be just 3 registers with a implementation-independent data layout?
from rvv-intrinsic-doc.
@rdolbeau thanks your feedback, and yes, I am seeking feedback between those options.
Honestly I didn't have idea about the implementation cost amount those options yet. So personally I am not rush to decide which approach other than first option, just collect more feedback at this stage.
In your examples, when you write vint32m2x3_t, we agree that means 6 architected registers ? (m2 means LMUL=2 so 2, the x3 is a tuple of 3, so 6 in all), and that the data- layout will be SLEN-dependent because LMUL>1? Whereas vfloat64m1x3_t would be just 3 registers with a implementation-independent data layout?
The data layout of vector tuple type same as the vector type version but repeat NF
times, e.g. vint32m2x3_t has same data layout as vint32m2_t but repeat 3 times, and with extra register allocation constraint that need consecutive registers.
Conceptually vint32m2x3_t is equivalent to vint32m2_t[3].
So yes, vint32m2x3_t is SLEN-dependent if LMUL > 1, and vfloat64m1x3_t is implementation-independent data layout.
from rvv-intrinsic-doc.
I think it might be slightly better to use structs instead of arrays.
C and C++ don't have values of array type and as such they can't be returned from a function call.
That would be the case of a load from Zvlsseg, in which a tuple of registers vfloat64m1x3_t
could be returned from a call to a hypothetical vlseg_v_f64m1x3
.
vfloat64m1x3_t vlseg3e_v_f64m1x3(const double *base);
Arguments don't need to have structs (we could always flatten them) but for consistency I'd expect the store look like this.
void vsseg3e_v_f64m1x3(const double *base, vfloat64m1x3_t);
In that sense that vfloat64m1x3_t
would behave like a struct with fields (say) v0
, v1
, v2
.
vfloat64m1x3_t vt;
vt.v0 = ...;
... = vt.v2
If we focus only on intrinsic functionality, it is unclear to me we need to allow users defining their own structs or array types with RVV vectors in them.
So my stance now would be like @kito-cheng 1 (a primitive type) above plus as much behaviour of 2 that makes sense for it.
In that sense I'd be inclined to do something like Arm's ACLE for SVE ( https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_00_en.pdf ).
Note that in general (page 13)
Members of unions, structures and classes cannot have sizeless type.
sizeless
is Arm's term for we don't necessarily know the size of the object at compile time.
But then in page 14
Each type
svBASExN_t
is sizeless and contains a sequence ofN
svBASE_t
s. The individual vectors are members with namesv0
,v1
, and so on. For example,svfloat64x4_t
contains four svfloat64_t vectors with the namesv0
,v1
,v2
andv3
.
Nothing seems to prevent making those types svBASExN_t
primitive. Arm's implementation in their Arm Compiler for HPC exposes that detail in the headers via a __sizeless_struct
syntax but this seems an implementation detail to me.
typedef __sizeless_struct { svfloat64_t v0, v1, v2; } svfloat64x3_t;
from rvv-intrinsic-doc.
C and C++ don't have values of array type and as such they can't be returned from a function call.
Good point, sounds like array is not an option.
Each type svBASExN_t is sizeless and contains a sequence of N svBASE_ts. The individual vectors are members with names v0, v1, and so on. For example, svfloat64x4_t contains four svfloat64_t vectors with the names v0, v1, v2 and v3.
This paragraph seems gone in later version, svBASExN_t
can't access via v0
..vn
now.
https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_04_en.pdf
For some implementation detail for SVE GCC 10, they allow struct-style initialization, but other operation must call intrinsic function:
svfloat64_t s64;
svfloat64x3_t s64x3 = {s64, s64, s64}; // Creation.
s64x3 = svcreate3(s64, s64, s64); // Creation.
s64x3 = svset3(s64x3, 1, s64); // Insertion
s64 = svget3(s64x3, 2); // Extraction
from rvv-intrinsic-doc.
This paragraph seems gone in later version,
svBASExN_t
can't access viav0
..vn
now.
https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_04_en.pdf
Thanks @kito-cheng I wasn't aware of the new version.
For some implementation detail for SVE GCC 10, they allow struct-style initialization, but other operation must call intrinsic function:
svfloat64_t s64; svfloat64x3_t s64x3 = {s64, s64, s64}; // Creation. s64x3 = svcreate3(s64, s64, s64); // Creation. s64x3 = svset3(s64x3, 1, s64); // Insertion s64 = svget3(s64x3, 2); // Extraction
This is a reasonable alternative to using structs.
from rvv-intrinsic-doc.
@kito-cheng I test "Primitive style" way with vcreate_*
after objdump, we can find vcreate_ will lead to use more vector register and use extra “vmv” instruct , which will lead to poor performance
#include "riscv_vector.h"
#include <cstdio>
int main() {
float src[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
vfloat32m1_t t0, t1, t2, t3, ret0, ret1;
t0 = vle32_v_f32m1(src, 4);
t1 = vle32_v_f32m1(src + 4, 4);
t2 = vle32_v_f32m1(src + 8, 4);
t3 = vle32_v_f32m1(src + 12, 4);
ret0 = vfadd_vv_f32m1(t0, t1, 4);
ret1 = vfadd_vv_f32m1(t2, t3, 4);
float dst [8] = {0};
vse32_v_f32m1(dst, ret0, 4);
vse32_v_f32m1(dst + 4, ret1, 4);
for (size_t i = 0; i < 8; i++) {
printf("%f ", dst[i]);
}
printf("\n");
return 0;
}
10508: 0087f7d7 vsetvli a5,a5,e32,m1,d1
1050c: 181c addi a5,sp,48
1050e: 0207f207 vle.v v4,(a5)
10512: 1004 addi s1,sp,32
10514: 009c addi a5,sp,64
10516: 0207f087 vle.v v1,(a5)
1051a: 0204f107 vle.v v2,(s1)
1051e: 089c addi a5,sp,80
10520: 0207f187 vle.v v3,(a5)
10524: 02221157 vfadd.vv v2,v2,v4
10528: 021190d7 vfadd.vv v1,v1,v3
1052c: e802 sd zero,16(sp)
1052e: ec02 sd zero,24(sp)
10530: e002 sd zero,0(sp)
10532: e402 sd zero,8(sp)
10534: 081c addi a5,sp,16
10536: 840a mv s0,sp
10538: 6941 lui s2,0x10
1053a: 02017127 vse.v v2,(sp)
1053e: 0207f0a7 vse.v v1,(a5)
10542: 00042787 flw fa5,0(s0) # ffffffffffffd000 <__global_pointer$+0xfffffffffffea800>
10546: 67090513 addi a0,s2,1648 # 10670 <__libc_csu_fini+0x4>
1054a: 420787d3 fcvt.d.s fa5,fa5
1054e: 0411 addi s0,s0,4
10550: e20785d3 fmv.x.d a1,fa5
10554: f6dff0ef jal ra,104c0 <printf@plt>
10558: fe8495e3 bne s1,s0,10542 <main+0x72>
1055c: 4529 li a0,10
1055e: f53ff0ef jal ra,104b0 <putchar@plt>
10562: 70e6 ld ra,120(sp)
10564: 7446 ld s0,112(sp)
10566: 74a6 ld s1,104(sp)
10568: 7906 ld s2,96(sp)
1056a: 4501 li a0,0
1056c: 6109 addi sp
use vcreate_
int main() {
float src[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
//vfloat32m1_t t0, t1, t2, t3, ret0, ret1;
vfloat32m1x4_t src4;// = vcreate_f32m1x4(t0, t1, t2, t3);
vfloat32m1x2_t dst2;// = vcreate_f32m1x2(ret0, ret1);
src4 = vset_f32m1x4(src4, 0, vle32_v_f32m1(src, 4));
src4 = vset_f32m1x4(src4, 1, vle32_v_f32m1(src + 4, 4));
src4 = vset_f32m1x4(src4, 2, vle32_v_f32m1(src + 8, 4));
src4 = vset_f32m1x4(src4, 3, vle32_v_f32m1(src + 12, 4));
dst2 = vset_f32m1x2(dst2, 0, vfadd_vv_f32m1(vget_f32m1x4_f32m1(src4, 0), vget_f32m1x4_f32m1(src4, 1), 4));
dst2 = vset_f32m1x2(dst2, 1, vfadd_vv_f32m1(vget_f32m1x4_f32m1(src4, 2), vget_f32m1x4_f32m1(src4, 3), 4));
float dst [8] = {0};
vse32_v_f32m1(dst, vget_f32m1x2_f32m1(dst2, 0), 4);
vse32_v_f32m1(dst + 4, vget_f32m1x2_f32m1(dst2, 1), 4);
for (size_t i = 0; i < 8; i++) {
printf("%f ", dst[i]);
}
printf("\n");
return 0;
}
10508: 0087f7d7 vsetvli a5,a5,e32,m1,d1
1050c: 1004 addi s1,sp,32
1050e: 0204f087 vle.v v1,(s1)
10512: c2002f73 csrr t5,vl
10516: c2102ff3 csrr t6,vtype
1051a: 00807057 vsetvli zero,zero,e32,m1,d1
1051e: 5e008257 vmv.v.v v4,v1
10522: 81ff7057 vsetvl zero,t5,t6
10526: 32002057 vmv.x.s zero,v0
1052a: 181c addi a5,sp,48
1052c: 0207f087 vle.v v1,(a5)
10530: c2002f73 csrr t5,vl
10534: c2102ff3 csrr t6,vtype
10538: 00807057 vsetvli zero,zero,e32,m1,d1
1053c: 5e0082d7 vmv.v.v v5,v1
10540: 81ff7057 vsetvl zero,t5,t6
10544: 32002057 vmv.x.s zero,v0
10548: 009c addi a5,sp,64
1054a: 0207f087 vle.v v1,(a5)
1054e: c2002f73 csrr t5,vl
10552: c2102ff3 csrr t6,vtype
10556: 00807057 vsetvli zero,zero,e32,m1,d1
1055a: 5e008357 vmv.v.v v6,v1
1055e: 81ff7057 vsetvl zero,t5,t6
10562: 32002057 vmv.x.s zero,v0
10566: 089c addi a5,sp,80
10568: 0207f087 vle.v v1,(a5)
1056c: e802 sd zero,16(sp)
1056e: c2002f73 csrr t5,vl
10572: c2102ff3 csrr t6,vtype
10576: 00807057 vsetvli zero,zero,e32,m1,d1
1057a: 5e0083d7 vmv.v.v v7,v1
1057e: 81ff7057 vsetvl zero,t5,t6
10582: 32002057 vmv.x.s zero,v0
10586: ec02 sd zero,24(sp)
10588: c2002f73 csrr t5,vl
1058c: c2102ff3 csrr t6,vtype
10590: 00807057 vsetvli zero,zero,e32,m1,d1
10594: 5e020457 vmv.v.v v8,v4
10598: 81ff7057 vsetvl zero,t5,t6
1059c: 32002057 vmv.x.s zero,v0
105a0: c2002f73 csrr t5,vl
105a4: c2102ff3 csrr t6,vtype
105a8: 00807057 vsetvli zero,zero,e32,m1,d1
105ac: 5e0284d7 vmv.v.v v9,v5
105b0: 81ff7057 vsetvl zero,t5,t6
105b4: 32002057 vmv.x.s zero,v0
105b8: c2002f73 csrr t5,vl
105bc: c2102ff3 csrr t6,vtype
105c0: 00807057 vsetvli zero,zero,e32,m1,d1
105c4: 5e0300d7 vmv.v.v v1,v6
105c8: 81ff7057 vsetvl zero,t5,t6
105cc: 32002057 vmv.x.s zero,v0
105d0: c2002f73 csrr t5,vl
105d4: c2102ff3 csrr t6,vtype
105d8: 00807057 vsetvli zero,zero,e32,m1,d1
105dc: 5e0382d7 vmv.v.v v5,v7
105e0: 81ff7057 vsetvl zero,t5,t6
105e4: 32002057 vmv.x.s zero,v0
105e8: 081c addi a5,sp,16
105ea: 840a mv s0,sp
105ec: 6941 lui s2,0x10
105ee: 02849257 vfadd.vv v4,v8,v9
105f2: 021290d7 vfadd.vv v1,v1,v5
105f6: e002 sd zero,0(sp)
105f8: e402 sd zero,8(sp)
105fa: c2002f73 csrr t5,vl
105fe: c2102ff3 csrr t6,vtype
10602: 00807057 vsetvli zero,zero,e32,m1,d1
10606: 5e020157 vmv.v.v v2,v4
1060a: 81ff7057 vsetvl zero,t5,t6
1060e: 32002057 vmv.x.s zero,v0
10612: c2002f73 csrr t5,vl
10616: c2102ff3 csrr t6,vtype
1061a: 00807057 vsetvli zero,zero,e32,m1,d1
1061e: 5e0081d7 vmv.v.v v3,v1
10622: 81ff7057 vsetvl zero,t5,t6
10626: 32002057 vmv.x.s zero,v0
1062a: c2002f73 csrr t5,vl
1062e: c2102ff3 csrr t6,vtype
10632: 00807057 vsetvli zero,zero,e32,m1,d1
10636: 5e010257 vmv.v.v v4,v2
1063a: 81ff7057 vsetvl zero,t5,t6
1063e: 32002057 vmv.x.s zero,v0
10642: c2002f73 csrr t5,vl
10646: c2102ff3 csrr t6,vtype
1064a: 00807057 vsetvli zero,zero,e32,m1,d1
1064e: 5e0180d7 vmv.v.v v1,v3
10652: 81ff7057 vsetvl zero,t5,t6
10656: 32002057 vmv.x.s zero,v0
1065a: 02017227 vse.v v4,(sp)
1065e: 0207f0a7 vse.v v1,(a5)
10662: 00042787 flw fa5,0(s0) # ffffffffffffd000 <__global_pointer$+0xfffffffffffea800>
10666: 79090513 addi a0,s2,1936 # 10790 <__libc_csu_fini+0x4>
1066a: 420787d3 fcvt.d.s fa5,fa5
1066e: 0411 addi s0,s0,4
10670: e20785d3 fmv.x.d a1,fa5
10674: e4dff0ef jal ra,104c0 <printf@plt>
from rvv-intrinsic-doc.
sometimes need must use array vector type for easy coding
so now ,what is the best solution for declaration rvv array like
vfloat32m1_t src4[4]
what`s more
vfloat32m1_t src4[4][2]
at the same, do not import use more instruct
I build args:
riscv64-unknown-linux-gnu-g++ -march=rv64gcv0p7 -mabi=lp64d
from rvv-intrinsic-doc.
i test -march=rv64gcv with do not have issue, but use -march=rv64gcv0p7 have "more move instruct" issue
BUT, there are so many board only support v0p7 now, so any possible fix this issue on v0p7
from rvv-intrinsic-doc.
Closing this issue and redirecting to #139. The question essentially boils down to how do we enable sizeless struct in the compiler implementation.
from rvv-intrinsic-doc.
i test -march=rv64gcv with do not have issue, but use -march=rv64gcv0p7 have "more move instruct" issue
BUT, there are so many board only support v0p7 now, so any possible fix this issue on v0p7
That sounds like a T-head toolchain issue rather than intrinsic interface issue, I would suggest you could report that to T-head directly.
from rvv-intrinsic-doc.
riscv-collab/riscv-gnu-toolchain#1106
from rvv-intrinsic-doc.
Related Issues (20)
- Question about the rounding mode of the bit right shift functions HOT 1
- Question about intrinsic data types HOT 4
- [Error] Conversion from vuint to vbool HOT 3
- [Error] GCC Crashes while passing some intrinsics as parameter of another intrinsics HOT 3
- [Question] Combining two vector registers with different LMUL HOT 2
- Question about using `__riscv_vlm_v` HOT 2
- vcreate intrinsics for LMUL > 1 HOT 8
- [Question] How to zip 2 vectors using RVV Intrinsics? HOT 11
- Tuple types that goes across the hardware restriction HOT 1
- [Proposal] Support for C operators on RVV types HOT 12
- vget for fractional register doesn't exist HOT 10
- Constraint of vector types in Zve32* HOT 2
- [Requirement]: The RISC-V RVV vector intrinsic must include support for vector groups in the __riscv_vfredosum function HOT 4
- Type-relative overloads for vreinterpret, vlmul_ext, vlmul_trunc, etc. HOT 1
- How to use a class to wrap or derive from a sizeless vector type HOT 1
- Encode all the effects of vsetvl in the return type, for use in subsequent type deductions HOT 1
- Does `__riscv_v_intrinsic >= 1000000` imply overloaded intrinsics are supported?
- Create bibliography from reference section HOT 3
- Simple questions about inline assembly in vmv.x.s instruction HOT 2
- Asterisks are not subscripts
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rvv-intrinsic-doc.