What kinds of C operator should we support for scalable vector types? What is the sema

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

C operator for scalable vector types,about riscv-non-isa/rvv-intrinsic-doc

Comments (15)

ebahapo commented on July 1, 2024 1

Besides the assignment operator, it would make sense to me to also implement pointer indirection, the arithmetic and comparison operators on VL length. Methinks that users would be more comfortable porting the core of their algorithms if they could retain at least some of their original algebraic syntax.

from rvv-intrinsic-doc.

rdolbeau commented on July 1, 2024

I'm not a huge fan of mixing apples and oranges (except in sponge cake ;-) ).

For fixed-width vector, it's already not obvious - AVX for instance doesn't type the register so the 'SEW' isn't known, so no meaning. For NEON/SVE/V with the more specific (a.k.a. 'better ;-) ') types it can be reasonably meaningful, but then there's the relationship with VL that is an issue in V:

vint32m1_t a, b, c, d;
unsigned long gvl = vsetvl_i32m1(16);
a = (some vector stuff)
b = (more vector stuff);
unsigned long gvl2 = vsetvl_i32m1(8);
c = (and some other vector stuff);
(extra stuff that may or may not be vector)
d = a + b;

So, how wide is 'd' now? To a casual reader, it's less than obvious. Both input should be 16-wide (assuming VLMAX>=16 for SEW=32/LMUL=1 ...).
It can be 8 if we obey the implicit, invisible VL that we set for 'c'.
It can be 16 if we assume the user want to use all its data in 'a' and 'b'.
It can be VLMAX (which may or may not be larger than 16) if we assume C operators are full-vector as they are not intrinsics and therefore might not obey any specific VL.

... or the compiler can just spew an error message and force the user to write d = vadd_vv_i32m1(a, b); or (my favorite) the even more explicit d = vadd_vv_i32_m1_vl(a, b, gvl);

The problem with implicit 'VL' is that it semantically works on instructions, but in many developer's mind, the 'VL' is going to be associated to the data itself (i.e. as an analogy 'N' is the size of the array, and therefore is used as the bound of the loop; not the other way around...).

So for this specific issue - I would say, none.

from rvv-intrinsic-doc.

kito-cheng commented on July 1, 2024

Here is several operator we might need to discuss, this table is organized from wiki page: Operators in C and C++

Arithmetic, comparison, relational, logical, bit-wise and compound assignment operators are controversial, because those are relative to the #8, how to pass the VL.

So I would like to discuss other part, operators not list above:

Assume a and b both are vint32m1_t, and n is int :

Operator	Synatx
Assignment	a = b, a = n
Subscript	a[n]
Address-of	&a
Ternary conditional	n ? a : b, a > b ? a : b
sizeof	sizeof (a)
Alignof	alignof (a)
typeid	typeid (a)
Conversion, static_cast	type (a), (type)a, static_cast(a)
new	new a
delete	delete a

Pointer type, assume p are vint32m1_t *:

Operator	Synatx
Subscript	p[n]
Deference	*p
Increment/Decrement	p++, p--, ++p, --p
Pointer Arithmetic	p + 1, p + n
reinterpret_cast<>	reinterpret_cast<vint64m1_t>(p)

The point we need discuss is, should we support those operator with scalable vector type? if so what's the semantics?

from rvv-intrinsic-doc.

kito-cheng commented on July 1, 2024

SiFive's implementation:

Assume a and b both are vint32m1_t, and n is int :

Operator	Synatx	Supported in SiFive's implementation	Semantics
Assignment	a = b, a = n	Y	Same as vcopy, `VL`-aware
Subscript	a[n]	N
Address-of	&a	Y
Ternary conditional	n ? a : b, a > b ? a : b	Y *(partial support)
sizeof	sizeof (a)	N
Alignof	alignof (a)	Y
typeid	typeid (a)	Y
Conversion, static_cast	type (a), (type)a, static_cast(a)	N
new	new a	N	Due to sizeof not supported
delete	delete a	N	Due to sizeof not supported

Pointer type, assume p are vint32m1_t *:

Operator	Synatx	Supported in SiFive's implementation	Semantics
Subscript	p[n]	N
Deference	*p	Y	Same as unit stride load/store, `VL`-aware
Increment/Decrement	p++, p--, ++p, --p	N
Pointer Arithmetic	p + 1, p + n	N
reinterpret_cast<>	reinterpret_cast<vint64m1_t>(p)	Y

from rvv-intrinsic-doc.

rofirrim commented on July 1, 2024

I believe that if we choose to implement the GCC extension for vector types for these types, perhaps the more reasonable thing to do here is to give them VLMAX semantics. Just to agree with existing practice of using vectors like "big scalars".

However I see how this introduces confusion in the context of implicit vector length because one could argue that the VL is also implicit in C builtin operations. Also I think lack of control over VLMAX by the user makes these extensions not very useful in general (how much are we going to load/store?).

So I would be inclined, that for now, C builtin operations not be extended to rvv vectors, to avoid introducing legacy.

Assignment is a bit of a special. Seems too fundamental to disallow (otherwise nothing will work), so I'd expect

va = vb;

to copy the whole vector (aligned with my expectation that these "builtin" operations use VLMAX).

This is important if we want to preserve the as-if behaviour that an assignment allows the user to "name" a value (and this is why we can replace usages of va with vb in the compiler). If we only copy up to VL I think we might be breaking this assumption.

Does this make sense?

from rvv-intrinsic-doc.

nick-knight commented on July 1, 2024

@rofirrim I agree we should support operator= and that it should copy the entire underlying register group (up to VLMAX). This implies uniform semantics in cases of copy-elision like return-value optimization.

vmv.vv is still available if the user wants length-VL and masked copies.

We should all be aware of the possibility of implementation-defined behavior for inactive and tail elements:
https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
(This is still under discussion by the V-ext task group.)

from rvv-intrinsic-doc.

kito-cheng commented on July 1, 2024

@ebahapo :
It's little counter-intuitive if assignment and arithmetic has different behavior for VL, that mean a = b + c; d = b + c and a = b + c; d = a is different.

from rvv-intrinsic-doc.

kito-cheng commented on July 1, 2024

I was thought it might kill the performance if we define assignment/operator= always operate with VLMAX, but it should able to optimized by compiler.

Extend the example in my last comment:

a = vadd(b, c, avl);
d = vadd(b, c, avl); // It could optimized by CSE
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

a = vadd(b, c, avl);
d = vcopy(a, avl);
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

a = vadd(b, c, avl);
d = a;
vstore (x, a, avl); // Store a to pointer x
vstore (x, d, avl); // Store d to pointer x

For those 3 case vcopy, recompute and operator= should get same performance after optimization.

d = a should able to optimized out or optimized into copy avl elements only, since the reset part of d are unused.

from rvv-intrinsic-doc.

jan-wassenberg commented on July 1, 2024

Joining a bit late :) I'm working on portable wrappers for intrinsics at Google and heard complaints about verbose code. Where possible, operators are very helpful for readability.

Somewhat related: our goal is to reduce the large cost of implementing and porting by having the same code compile for multiple platforms, including RVV. If the code requests VL-aware or even masked load/store unnecessarily, that would be expensive on other platforms.

The same code could be efficient everywhere if we have the main loop using VLMAX, and a second cleanup 'loop' using masks or avl.

Thus it would be nice to have VLMAX operators for the first loop, especially if the app does not need a cleanup because it is able to pad inputs/outputs to VLMAX. Does that make sense?

(BTW an ARM engineer seemed receptive to such operators for SVE ACLE.)

from rvv-intrinsic-doc.

nick-knight commented on July 1, 2024

Hi @jan-wassenberg, sorry, I don't completely follow your proposal. Please let me clarify where I am confused.

This thread concerned extending C operators to the new RVV types, which exposes a kind of impedance mismatch between the underlying assembly language and the C abstraction. There was consensus that the assignment operator would operate up to VLMAX, but it was unclear whether these semantics should also apply to the other operators.

For example, currently we can express N-by-N matrix multiply (C += A*B) in RVV intrinsics something like this:

for (int i = 0; i < N; ++i)
  for (int j = 0; vl = vsetvl_f32m1(N - j); j += vl)
    for (int k = 0; k < N; ++k) {
      vfloat32m1_t B_vec = vle32_v_f32m1(&B[k][j]);
      vfloat32m1_t C_vec = vle32_v_f32m1(&C[i][j]);
      C_vec = vfmacc_vf_f32m1(C_vec, A[i][k], B_vec);
      vse32_v_f32m1(&C[i][j], C_vec);
    }

It's tempting to extend the += and * operators to replace the vfmacc_vf_f32m1 intrinsic by:

C_vec += A[i][k] * B_vec;

However, like you mention, if these operators are defined to work on VLMAX elements, then we will need to add cleanup code to handle the fringe case, which will use the vfmacc_vf_f32m1 intrinsic anyway. So this alternative appears to me to be strictly more verbose, and also likely less performant because of increased (static and dynamic) code size.

EDIT: to be clear, I'm not opposed to the VLMAX semantics --- it doesn't remove any functionality --- I'm just concerned that it doesn't yield a net decrease on the intrinsics programmer's cognitive burden.

from rvv-intrinsic-doc.

jan-wassenberg commented on July 1, 2024

Hi @knightsifive , thanks for looking into this. For c += a*b it makes sense to use FMA despite the increased verbosity, but c += a is enough to show the operator.

Let's imagine we take your code, which looks good for RVV, and replace each intrinsic with a wrapper function, then re-implement the wrappers using AVX2. Because the loop relies on VL, each iteration would have to check whether vl==vlmax, because AVX2 has neither VL nor masks and fairly expensive masked load/store. That is wasteful because only the last iteration actually needs it.

Now if we have a VLMAX first loop followed by cleanup, I agree with you that it is more verbose and also larger code. In the AVX2 case, we can still expect a performance benefit. (ICC also generates two such loops even for AVX3.)

In my experience with the JPEG XL image codec, we are often able to arrange for N to be a multiple of VLMAX, or at least make it safe to pretend it is by padding all inputs/outputs. Then we do not need a cleanup loop, and it would be nice if the first loop is able to use the shorter and more readable operators.

Why talk about AVX2 here? I imagine not all software is going to be rewritten specifically for RVV in the VL style. Projects such as OpenCV/JPEG XL already have such wrappers and would hope to write (performance-portable) code only once, not per platform. Is that something we would want to enable for faster adoption and porting?

I am actually not sure the above use case cares whether += uses VL or VLMAX, but I do hope that operators would be included/allowed for readability.

from rvv-intrinsic-doc.

jan-wassenberg commented on July 1, 2024

@kito-cheng now that we are moving to explicit VL, it seems a good time to resume this discussion.
In operator+= etc functions, we do not have a user-provided AVL argument, so they would now use VLMAX, right?

Can the compiler define operator+= builtin functions? Unfortunately overloading them in normal C++ code is not possible because the arguments (vuint*) are built-in types, not user-defined.

from rvv-intrinsic-doc.

kito-cheng commented on July 1, 2024

Apologize for very late reply, I would prefer block most operation at this first and then relax later if needed.

The list should be allowed in first version in my mind is:

Assignment operator, with VLMAX semantic.
typeid operator.
Address-of operator.

I think for those operators should be supported when size is known, and maybe we should only supported on VLS type (e.g. int32x8_t), but this part we don't have well discussion yet, although upstream LLVM has some initial support there.

from rvv-intrinsic-doc.

jan-wassenberg commented on July 1, 2024

@kito-cheng
For user ergonomics, do you agree operators are much easier to read/understand at a glance? Here are some real-world examples:

const V y1 = (x2 / (sqrt_x2_plus_1 + kOne)) + abs_x;
const V y1 = Add(Div(x2, Add(sqrt_x2_plus_1, kOne)), abs_x);

const V z = (y + kTwo) / (y + kOne) * (y * kHalf);
const V z = Mul(Div(Add(y, kTwo), Add(y, kOne)), Mul(y, kHalf));

"relax later if needed" would have unfortunate consequences: users of a generic interface (e.g. Highway) would have to use Div etc now instead of operators, and once written I doubt code would be changed back to operators (some risk of introducing mistakes).

Is it infeasible to provide a builtin that behaves as if the following were allowed?
vfloat32m8_t operator/(vfloat32m8_t a, vfloat32m8_t b) { return vfdiv_vf_f32m1(a, b, MaxVL()); }

from rvv-intrinsic-doc.

sh1boot commented on July 1, 2024

FWIW, Clang (but not GCC) lets you pass its own built-in vector types to RVV intrinsics: https://godbolt.org/z/vTYsPWsMT

That is:

#define VECTOR_BITS 256  // use __riscv_v_min_vlen if you don't care (or __riscv_v_fixed_vlen if it is defined)
using fixed_vuint8m1_t = uint8_t __attribute__((vector_size(VECTOR_BITS / 8)));
fixed_vuint8m1_t add(fixed_vuint8m1_t a, fixed_vuint8m1_t b) {
    return __riscv_vadd(a, b, 32);
}

works perfectly well provided the type is no larger than an m8 register of the minimum vector length of the compilation target. It also works for other architectures (but still not with GCC).

That example isn't interesting because addition is already supported by the compiler, but it does mean you can switch freely to and from intrinsics where necessary. Consequently, you can introduce VL by switching to intrinsics.

So in a sense C operator support is already halfway there. The only missing part is the ability to create an unsized vector with __attribute__((vector_size())) which would always use VLMAX.

from rvv-intrinsic-doc.

C operator for scalable vector types about rvv-intrinsic-doc HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent