david-horner / text-format Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 24 KB

text-format's People

Contributors

Watchers

text-format's Issues

rvv1

Proposal to introduce register groups, SLEN and fractional SLEN into the simple register fractional LMUL model.

What has not changed:
Fractional LMUL will

still be denoted as 1/2, 1/4, 1/8.
will continue to halve the VLEN to match its denotation
conceptually LMUL=1 is still adjacent to 1/2

New, but fundamentally the same as for LMUL>1

Fractional groups (Striped groups of fraction registers).
A striped group of fractional registers (a fractional group) parallels LMUL>1 registers, in that:

the number of fractional registers in the group is a power of 2
the group is aligned on a multiple of the group size
all fractional registers in the group are of the same bit length.
elements are filled from 0 to vl in a round robin beginning at lowest register number.
. . filling proceeds to the next register after a striping number of bits are met.

The rest of this proposal talks about what has changed (even if some subtly).

Some convenient definitions:

Define “SEW-instructions” as those that vs1, vs2 and vd match SEW from vtype.
To clarify, they are not:
widening or narrowing
whole register moves
mask register only

Introduce register group characterization:
This proposal allows fractional groups to originate a multiple levels with their width determined by that level.
For example fractional groups with a physical width of VLEN/8 originated at LMUL=1/8.
A short hand to identify such groups will make the narrative much more readable.

Consider LMUL>=1 register groups.
They all start in LMUL=1 via a widening operation.
So 1 should be in their designation even though it is superfluous without fractional LMUL.

Consider n:m format where VLEN/n is the vector length and m is number of base-arch registers in the group.
Then we designate

LMUL=2 addressable registers as 1:2
LMUL=4 addressable registers as 1:4
LMUL=8 addressable registers as 1:8 and
LMUL=1 addressable registers are 1:1 (for completeness)

In the previously presented simple mappings of fractional LMUL, there was a presumptive understanding that widening operation sourced LMUL=1/n registers widen to LMUL=2*(1/n) registers.

This would be represented by a table such as this:

LMUL	1/8	1/4	1/2	1	2	4	8
------------
group type
1:8						x	a=0,8,16,24
1:4					x	a=0,4,8,12 ...
1:2				x	a=0,2,4,6, ...
1:1			x	a=all
2:1		x	a=all
4:1	x	a=all
8:1	a=all

a = Accessible at LMUL level by SEW instructions
x = Created by widening instructions at LMUL level
(Narrowing instructions also source from this LMUL)

Note: 16:1 is intentionally omitted from the diagram although it works the same.

This proposal acknowledges that such a simplistic approach can be inefficient for many reasonable implementations.
It also acknowledges that some mandatory RVV instructions are comparably inefficient. vgather , slideup/down, and others similarly have to operate across lanes.
And further that striped register support is already present in the base design.

So this proposal introduces striped groups beginning with table:

LMUL	1/16	1/8	1/4	1/2	1	2	4	8
------
group type
1:8							x	a= 0,8, 16,24
1:4						x	a= 0,4, 8,12 ...
1:2					x	a=0,2, 4,6, ...
1:1					a=all
16:8			x	a= 0,8, 16,24
16:4		x	a= 0,4, 8,12 ...
16:2	x	a=0, 2,4,6, ...
16:1	a= all
8:1		a= odd
4:1		a= odd
2:1			a= odd
LMUL	**1/16	1/8	1/4	1/2	1	2	4	8

This is the same legend as above and will be assumed for all further diagrams:
a = Accessible at LMUL level by SEW instructions
x = Created by widening instructions at LMUL level
(Narrowing instructions also source from this LMUL)

Note: 8:1 , 4:1 and 2:1 were added to the table though technically not required to illustrate fraction groups. More below.

This has two undesirable features. Both of which present trade-offs

LMUL now determines both the levels fractional size and the fractional group's size

Registers used for fractional groups are not available for fractional registers (halved at the first level)
there was no need to provide addressing to other registers in LMUL>=1 as all registers were the same physical length.

The smallest fractional register size is used as the base for LMUL grouping

this is necessary to achieve 8 levels of grouping
however, the usefulness of the smallest vector register is generally limited to small element size.

Although it is possible to provide an even wider LMUL or additional fields in vtype to facilitate more states to address these concerns, the approach here will be to enlist the register numbers to provided context information.

Fist note that at any level the register numbers used by register groups are specific. In LMUL>=2 the only operands available to any operation (including widening and narrowing) were register groups. Widening to 1:8 can only be performed with 1:4 inputs. Converse for narrowing. Widening to 16:8 must use 16:4 inputs to parallel that behaviour. Taking both these observation together the comparable behaviour constraint can be incorporated into the instruction decoding using register addresses.

This allows widening to originate at other levels concurrently, as diagramed here:

LMUL	1/16	1/8	1/4	1/2	1	2 ...
------
group type
1:2					x	a=0,2, 4,6, ...
1:1					a=all
16:8			x	a= 0,8, 16,24
16:4		x	a= 0,4, 8,12 ...
16:2	x	a=0, 2,4,6, ...
16:1	a=all
8:8				x
8:4			x	a= 4,12, 18,20, ...
8:2		x	a= 2,6, 10,14, ...
8:1		a= odd
4:4				x
4:2			x	a= 2,6, 10,14, ...
4:1			a= odd
2:2				x
2:1				a= odd
LMUL	**1/16	1/8	1/4	1/2	1	2

Note: I dropped LMUL=4 and 8 only from the illustration.
Note: 16:8 is addressable (from LMUL=1/2), but 8;8, 4:4 and 2:2 are not addressable from LMUL=1.
They are however addressable from widening and narrowing instructions from LMUL=1/2.

To be continued ......

Relax LMUL coupling of register group size and striping structure

Relax LMUL coupling of register group size and striping structure.
And implications that arise therefrom

I have been working on a full proposal based on avoiding the limitations of the current LMUL design.

I still plan to present that here, as I flesh out some details, but as one of the salient characteristics are applicable even to the current LMUL design, I thought I would propose the ideas separately and then consolidate them into the full proposal.

Current LMUL couples striping structure unnecessarily to register group size.
Instead, make potential register length be dependent upon register address and limit via vl

Thought Experiment:
Although LMUL=8 necessitates a register group of at least 8 base-arch registers there is no inherent reason that only 8 registers can participate. (We certainly can envision a 64bit RVV allowing 2 sets of LMUL=8 structures in 16 64base-arch registers. These could be concatenated 64base-arch registers transparently to provide 1 32bit register group and similarly associated double length mask registers). [To make 16 register groups work, mask registers would also need to double in length when vl * SEW exceeds * VLEN. But this is just a thought experiment to help understand the concept. Register groups of 16 registers is not proposed at this time.]

Moving on to an immediately applicable scenario:Even with the current 32bit design limitation, a double length LMUL=4 result can coexist within the current maximum register groups.

LMUL=4 groups of double length can analogously exist in the current LMUL=8 register group space, at V0, V8, V16 and V24. The instructions vl will determine if the following group of 4 registers is used, or the current behaviour of only one register group of 4 base-arch registers is used.

However, a widening operator cannot create a double length LMUL=4 register group without a double length LMUL=2 register group: analogously, LMUL=2 register groups can double or quadruple into groups previously designated for LMUL=4 and LMUL=8 register groups.

Similarly down the chain, so that at (unaffected by SLEN) LMUL=1 v0,v8,v16,v24 can have vl up to 8 * VLEN, using up to 8 consecutive base-arch registers, or a minimum of 1 base-arch registers depending upon vl. (This is the “effect” if tail undisturbed is used).
Also at LMUL=1, vector addresses on a multiple of 4 (V0,V4,V6,V8,V12,V16,V20.V24, obviously including the multiples of 8) can have vl up to 4 * VLEN, using up to 4 consecutive base-arch registers or a minimum of 1 base-arch registers depending upon vl.
Similarly, all even vector register addresses allow multiples of 2.
And finally, all odd vector register addresses allow only one base-arch registers to participate.

A consideration that should be mentioned here is that the current LMUL address limitation for each level can be enforced as is.

Mask format is sufficient for address specific maximum register length
even though mask format differs for striping and interleaving levels

There is sufficient room in the mask for address dependent register grouping.
However, the current LMUL mapping is not generally applicable as element layout within physical registers varies with LMUL level.
A possible mapping for LMUL=1 of a register at V8 follows:

VLEN=32b
. Byte	3	2	1	0
LMUL=1,SEW=8b
phys reg 0	3	2	1	0	Element
.	[24]	[16]	[08]	[00]	Mask bit position in decimal
phys reg 1	7	6	5	4	Element
.	[25]	[17]	[09]	[01]	Mask bit position in decimal
phys reg 2	3	2	1	0	Element
.	[26]	[18]	[10]	[02]	Mask bit position in decimal
phys reg 3	3	2	1	0	Element
.	[27]	[19]	[11]	[03]	Mask bit position in decimal
phys reg 4	3	2	1	0	Element
.	[28]	[20]	[12]	[04]	Mask bit position in decimal
and so on ...

It should sufficiently clear that alternatives are possible with varying trade-offs, and the end design can be close to as effective as the current LMUL method.
I have not determined that optimal mapping, applicable across all LMUL and interleave levels.
Indeed, it may not be necessary and so I have put efforts into other considerations.
Which leads me to this:

How many widening / narrowing levels do we need?

Having decoupled effective register length from widening / narrowing levels (and corresponding structures) how many widening / narrowing levels are necessary? Can software approaches suffice to maintain efficiency but reduce hardware complexity?

a) A rationale for 4 levels (as in standard LMUL) is substantially weakened. Low end machines especially benefit by an 8 times increase in effective register length. However, with that benefit divorced from register group sizing, LMUL>2 is no longer required for minimal machines to have such a benefit.

b) Mask structure may need to vary between “LMUL level” for efficient operation even so:
These invariant relax the need to identify the level of LMUL:
i) All unmasked single width integer and float arithmetic can operate at any LMUL>=1 level and provide identical results.
ii) For a unchanged LMUL, all sequences of mask compare and mask to mask ops on them (thus excludes vfirst.ms and the like) intermixed with single width integer/float, including under mask operations at any LMUL>=1 level and provide identical results. These include add with carry and subtract with borrow.

As a result, dependence upon element layout in physical registers and mask layout is limited (I believe solely, but I may have missed a category) to

load/store
widening narrowing ops
element ordinality including vfirst.vx and the like
element rearrangement (a subset of ordinality) including slideup/down and vgather

If these were limited to hardware support for only one level on either side of LMUL=1, that is LMUL=2 and LMUL=1/2 either:

software could address the less common iteration of widening/narrowing, or
hardware could support variants of the layout sensitive instructions, or
a new instruction to transform a register format in the current LMUL (2 or 1/2) to LMUL=1.

The last option of a new transforming instruction would allow multiple and iterative widening operations with intervening and intermixed type (ii) operations followed by a much lesser number of executions (1 or 2 lets say) of transforms on the final aggregated result (the most typical scenario) to allow store at LMUL=2.

Further, if LMUL=1/2 is supported, starting at LMUL=1/2 gives 2 levels of widening without any explicit leveling transforms.

In the light of the above, vstart may require a looser meaning.

It may be best that it is defined as a token, that for the given machine, determines the location of unprocessed elements for the sole purpose of restart. The current description that it determines a consecutive number of elements from element zero is problematic when ordinality is variable. All depends upon the acceptable performance hit for most (all and fringe) architectures and how the issues in 2) and 3) are resolved.

More work to come

In that the above is applicable to current LMUL considerations I am “publishing” early.

There are various considerations that I do not address here but within the full proposal.

– mixed register input (can arise when max designated length for a register address conflicts with vl setting.)

Judicious use of undisturbed and agnostic and how they are applied to subsequent unused register groups with address dependent group lengths.
Further agnostic implications to interleave.
And interleave itself which will also be left to the full proposal.

add Branch Bit Set Instruction to Fast Interrupt Repertoire

Critical code sections in interrupt routines operate under various
constraints to provide optimal response and throughput:

fewest saved registers (to stack and xscratches)
fewest active registers (to reduce preemption overhead)
fewest instructions

TYesting bits in such an environment is expensive in register usage and
instructions.

The bit manipulation TG rotate could be used as non-destructive approach to move desired bit to
sign to branch relative to zero.
but it requires two instructions to return bits to normal which may require disabling interrupts for the duration.

A single Branch Bit Set avoids these overheads.

Suggested formulation is use of brownfield minor op in branch op code.

positive 8 bit offset from pc in 16bit packages, bits 8 -11 and 25-28
this would provide a +512 PC relative forward branch range
s1 (as 5 bits, any of the 31) to be tested and
a 5 bit immediate in rs2 to select the bit to test
(zero the least significant , consecutively from there)

This would use the same decoding as the branch instructions

```
opcode=1100011,
```
```
with funct3=011
```
```
rs2 the bit selector immediate field
```

and remaining bits (31-29 and 7) are zero.

in RV32 the value 31 is redundant with sign checking to zero, but more useful in RV64.
For RV64 we could consider a 6bit variant, perhaps incorporating it 7?.

8 bits for 16bit package offsets is sufficient for most fast interrupt handler code that is , by nature, compact.
All 31 registers may not be needed. sp and rp (X2 and X1) are heavily used, but so are the Compressed register set (x8 through x15) as the use of compressed instructions helps with code locality related to cache size and cache line.

The lower bits are in general more valuable to test as

low bit in code addresses can be used as flags, ignored by xret and jumps. (jalr a possible exception).
most csrs map significant bits to low bits, partially to allow the 5bit set/clear instructions useful.
the most significant bit is already directly testable with the signed branch.
rarely are more than a few bits needed in interrupt handlers as the complexity of the code is limited.
the branch variations goes up by the power of two for each additional bit.

Because this encoding stipulates the remaining bits re zeroed, these are effectively reserved;
potentially for multi bit test variants.

There is a tipping point at which vectoring on the "embedded" value is more effective than single bit checks.
(eg. branch tables)

Thus, testing sets of bits may be more valuable then being able to test them all.
However, it is valuable to have better decoding than branching on anded low bits with immediate test field.
i.e. testing that any of all the bits are set is a low frequency check for control transfer.
Similarly all but one set is similarly of low value..

revisit ff Execution Environment allowance - must run as if application env?

riscv/riscv-v-spec#576 (comment)

revist

Include glossary for vector - include such as channels, chaining, segments,

CALLVEC FAQ (fequently asked questions)

CALLVEC frequently asked questions

In that CALLVEC is new, question asked even once is relatively frequent. So here goes with some asked so far:

Q1) Why has CALLVEC been called "The Swiss Army Knife Of Instructions"?

A1) Because I thought it was an apt description.

Q2) Why was the instruction called CALLVEC?

A2) Because TSAKOI was too hard for me to pronounce.
And because its primary use is expected to be a VECtored CALL;
VECCALL was ignored as ambiguous; too hard to spell and pronounce.
The vector calling allows for the Application Execution Environment to tailor the instruction to specific hardware.

Q3) WHAT DOES IT LOOK LIKE (instruction format).

A3) The final instruction format has not been determined yet,
but its logical structure is < OPCODE>.

Q4) Where is CALLVEC used.
A4) CALL VEC should always be used in a call context.
The instruction always writes t0 [ this may be refined to another

Q) HOW DOES IT WORK (What does it do)
A)
Minimally, the hardware implemented CALLVEC:
a) stores the instruction info into t0 [when emulated it stores the instruction word]
b) it formats the instruction code format into a 17bit unsigned immediate into t0
c) it conditionally stores the return address into ra [depending upon immediate value]
d) if ra was stored it conditionally ra

Q: Is that all?
A: No, standard hardware optimizations can be made that because they are in "library routines" are transparent to the application code. Further, some of the more sophisticated hardware optimizations can be made transparent to typical level 1 and 2 code.

Q:What are these levels you mention? 1,2 and 3?
A:
Transition to Level 3 code is the end of the CALLVEC [direct] influence.
To user code, Level 3 is the standard code environment that exists without CALLVEC implemented.
[either as hardware or emulation].

Level 2 code runs in memory space designated as such. The AEE is aware of this, even in emulation.
Code run in this mode can be standard RVI code and will work across all implementations.
Specialized code can also work with new opcodes and enhanced facilities, but will only work on similarly configured machines.

Level 1 code , by design, will have limitations imposed on it and although standard RVI code will run, some implementations will not allow certain behaviour. Just as correct LR/SC use is constrained, so level 1 code will be on any given implementation.

Q: What and where are these hardware assist modes?
A: Anywhere in executable memory can be designated level 1 or level 2 CALLVEC code.
Code that runs in either of these two levels runs under whatever hardware assists are available/enabled.
The hardware assists can be
The location the code is executed determines the CALLVEC acceleration features, and therefore any code executing in those areas is hardware assisted.
However, it is likely that implementations will use various methods to limit the access to these code areas, specifically some may limit such hardware acceleration to only be accessed from user-mode by CALLVEC. System modes will likely not be so constrained.

Q: What are these hardware assists?
A: The opportunity for such assists is open ended, CALLVEC is exceedingly extensible.
However, the envisioned initial assists are:
a) cache-line prefetch for
b) branch optimizations ; drop through code from level 1 to 2, and level 2 to 3
c) escape level 1 or 2 by branching to fixed location [typically level1 or level 2 zero entry]
c) cache and interrupt management optimizations ; level 1 code jamming.
d) special instructions for CALLVEC handling; level escape opcode rather than branch and drop through or branch zero entry.
e) special instructions for CALLVEC meta-execution; allow specifying register dynamically with prefix-instruction.

Q: what is [for each of a) through e) name hardware assist.
A: for each named hardware assist explain.

Q: are there limitations to the instructions used in CALLVEC.
A: Yes, in levels 1 and 2. . But what is typical

Q: Is CALLVEC itself interruptible:
A: Yes, level 1 and level 2 execution
Q: Is CALLVEC callable within CALLVEC?
A: Yes. But it may have limitations.

a) Level 1 is anticipated to be jammed into the instruction queue to be processes while other backgroud activities occur such as cache line load of level 2 and 3. So although saving ra and t0 to other temps is possible it makes those temps unavailable for the next CALLVEC, and the nested CALLVEC must restore ra and t0 before return to level 1, so interrupts will correctly configure restart, so the constraints on level 1 are formidable in a short code sequence.

Q: How performant is CALLVEC when not emulated?
A: When fully assisted, CALLVEC will be as performant as in-line code.

#552

Some of the related issues were discussed in #418 Introduce vlmt (vl multiplicative threshold) / VLMT Vector LiMiT

It was a proposal to remove vlmul from vtype including non-fractional.

Some of the concerns are now obsolete:

riscv/riscv-v-spec#418 (comment)
3. While the data layout does not now depend on LMUL, the mask register layout does.
Now that load and store explicitly designate element width the prime motivator for fractional lmul is absent.

riscv/riscv-v-spec#418 (comment)
Additional code that was required to emulate fractional lmul is unnecessary in many use cases.
And were emulation is required the burden is not substancial over fractional lmul use.

Of these some are specific to non-fraction LMUL.

riscv/riscv-v-spec#418 (comment)

. 2. A certain class of code errors would not be caught.
. 4 The dynamic error checking logic becomes somewhat more complex,
. 5. The benefit of dropping LMUL in vtype is small
. 6. A later ILEN=64 encoding would not need this.

Things of which I am still unsure :
. 1. Some current instructions use VLMAX=LMUL*VLEN/SEW to describe the input operand

From 3.3.2

LMUL can also be a fractional value, reducing the number of bits used in a vector register. LMUL can have fractional values 1/2, 1/4, 1/8.

Fractional LMUL is used to increase the number of usable architectural registers when operating on mixed-width values, by not requiring that larger-width vectors occupy multiple vector registers.
Instead, wider values can occupy a single vector register and narrower values can occupy a fraction of
a vector register.

There is no value in the base which only allows widening/narrowing to/from EEW of 2*SEW.
No quad or higher widening or narrowing is in the base.
So at best fractional 1/2 is currently useful.

However, as I suggested

this is a test of open and close and open [and close]

Towards quantifying Optimization: explicit ideas related to #24 in riscv-code-size-reduction

depreciate JALR with low bit set.

We could provide an optional compatibility mode. But it will not be used.
No compiler generates this code, nor would it. It is useless.

First make it reserved. Then determine its best use.

A possible use is as the high bit below sign in the offset of a revised JALR
If the new code is run on an old implementation the problem should manifest in an obvious way.
To quote spec:

Although there is potentially a slight loss of error checking in this case, in practice
jumps to an incorrect instruction address will usually quickly raise an exception.

TODOs for V extension

Big-endian behavior of whole register load/store riscv/riscv-v-spec#549
What support for big-endian is actually needed overall?
Do we need in-register match in-memory as per little-endian?
More code may be immediately portable if in-register is always in-little-endian-memory order.
If load/stores observe endian-ness, is that sufficient.

will be

TODOs for fast-interrupt

enhance support co-routines and thus allow level change already saved registers.

can be used horizontally

application can check that co-routine did not clobber the registers reserved for itself

one halt loop suffices for multiple levels

current code assumes level 0 and pause for horizontal interrupt?
riscv/riscv-fast-interrupt@2fc1965

current thought to enhance xnxti use low bits to defined co-routine flag.

cssrci and cssrsi can provide different polarities, especially as it is coupled with interrupt disable/enable (xie clear/set).

use low 8bits of xintstatus for tracking co-routine / interrupt saved stack status.

This status is a declared state by any privilege level that their x2 register is a stack pointer
  to saved data that if restored will allow correct ongoing program behaviour.
 In addition to being readable here,
  the state is set through xnxti instructions and  
  the state is aggregated to provide status  in the low bits of 
   the vector in rd provided by the xnxti instructions.
Aggregated status is also provided in the low bits of xscratchcsw[l]

Perhaps unexpectedly, plan is xintthresh will not affect these bits.

xintstatus bit recovqq

 a code section sets this bit with a csrrsi xnxti when setrecovqq is 1 
         // not sure of this yet ------ and clears when bit is 0. 
 when set informs co-routine (including interrupts) that the partner process is able to fully qq state 
           coroutine can thus able to use qq state with impunity iff it resets this bit
      partner routine must at the end of the recoverable qq  section check statrecqq and recover qq state if cleared
           further, it will clear statrecqq to terminate the shared qq state code region 
           a single csrrci xnxti instruction does both 
                current recovqq state is placed in setrecovqq bit location and
                            recovqq is cleared if setrecovqq is set in immediate field.

new branch instruction on low bits set.

 generally available functionality , but especially valuable in a minimal register useage case where either:
         no other register available (without spill) to mask specific bits,
                    or reload of value costly (as CSR read may be).

TODO for isa-manual

clarify :
section 2.2

The behavior upon decoding a reserved instruction is unspecified.

add prior to this instruction

When the Riscv standard privileged state is implimented ,
The behavior upon decoding an instruction in the base that is not implemented is to trap (as illegal instruction).

vxsat in high bit like interrupt in cause.

Fortunately fcsr does not have this issue:

Bits 31–8 of the fcsr are reserved for other standard extensions, including the “L” standard extension for decimal floating-point. If these extensions are not present, implementations shall gnore writes to these bits and supply a zero value when read. Standard software should preserve the contents of these bits.

3.9. Vector Control and Status Register vcsr The vxrm and vxsat separate CSRs can also be accessed via elds in the vector control and status CSR, vcsr.
And as we are pon the topic of reworking

Perhaps vxsat should be in high order bit like TT (inrterrupt bit is) and reduce our csr load.
do we really need the convenience of is.

augment coment on register wrap around.

add this

Vector registers do not wrap around through zero. This constraint is to help provide forward-compatibility with a future longer instruction encoding that has more addressable vector
registers.

perhaps move to general register addressing section with reminder at this location
near end of section 7.8

Expand sign injection instructions to operate on integer values and any supported sew. #449 - Addendum

Solve widening problems by using fractional (interleaved) structures for source.

A1.

Even with v0.9-draft-20200424 establishes a new meaning for the SLEN parameter which interleaves SEW elements , there is pressure to retain SLEN=VLEN to avoid the byte/half-word/word/double/etc. in-register structure to not match in-memory order.

The new SLEN, and the SLEN before it were introduced to support widening operations.

This is the fundamental challenge, how to widen from a given element width to an effective width efficiently across multiple micro-architectures; avoiding element positioning skew and long wiring lengths.

A2.

So far vertical and horizontal approaches to accommodate double width results from fully packed source registers has proved challenging. Each creates anomalies in the register structure.

The original vertical striping was rigidly allocated powers of 2 register groups as it compounded at each level of widening. The applied striping length partially mitigated but still allowed for in-register structure to mismatch in-memory on a register group and implementation specific SLEN basis.

The current proposal provides horizontal striping at the LMUL=1 and above to mitigate for machines with large VLEN. However, as “smaller” machines will not need this functionality it has the potential to fragment the eco-system. Especially as (H)SLEN < VKEN introduces in-register vs in-memory anomalies. Although there are various proposals to mitigate this disparity, all retain some risk of fragmentation.

A3.

I propose an alternate approach to resolve the widening operation dilema.
It is inherent in the existing #421 fractional fill proposal.

Proposal:

Overview:

The fundamental concept is rather than attempt to accommodate widened results from fully packed sources, instead size and format the sources so that the results can be accommodated in a fully packed target register set.

Specifically:

The two source widening operations only source from fractional register structures.
Fractional structures are defined that are 1/2 and 1/4 populated (per operand) depending upon whether quad or double widening operators are defined. See**
The fractional structures allow two components of the structure to be filled and accessed independently.
(it would be an extension to access each component for 1/4, 1/8 etc. )
The structure has modes in vtype that select from these independent segments for widening and load/store.
Register multipliers (LMUL) work on all register structures, providing Effective VLEN of LMUL * VLEN. see***

So, initially fill ratios are 1, 1/2 and 1/4 to support dual source widening operations and 1/8 to support load/store related enhanced single source widening.

Relevant details from #421:

The new field, vfill, fulfills two distinct purposes: fractional cluster order (fill) and selection (element location).
Corresponding masks segments are active for each selected cluster. see****
These are specific to three cases for fractional data for:

For one vector operand instructions: provides the fill degree and order.
Examples:
load/store
vclstr/vdclstr
mask ordinal
narrowing
For two operand single-SEW instructions it determines the participating clusters.
Examples:
vadd.vv vadd.vi vfadd.vv
vmseq.vv vmseq.vx
For two operand widening instructions it determines the participating clusters.
Examples:
vwadd.vv vwadd.vx vwadd.wv

The structure and values are chosen to minimize the vfill state changes in typical code sequences.
The encoding is independent of LMUL>=1 to allow register groups for all values of LMUL, from 1 to 8. vlmul is reduced to 2 bits, with LMUL values 1,2,4 and 8. Only the power of 2 limits are needed to validate the used register groups. see***

CLSTR and vclstr are defined that determine the clustering size. CLSTR is the minimum size in bytes allocated to successive elements until met or exceeded before moving on to the next cluster chunk. The CLSTR value is defined by the field vclstr is a field vclstr in vcsr.

Although #461 defines cluster size in the context of the horizontal interleaving introduced with v0.9-draft-20200424 , it retains the same definition here. For further details see #461.

Define vfill, a new 3 bit field stored within the lower 11 bits of vsetvli. see*****

The following table specifies the register layout(structure) and sources established with different codes in vfill:

vfill	One vector operator:	Two vector operands	Widening	equivalent to original LMUL

000	X0	X0	~	LMUL=1
	Odd:even	Odd:even	Odd:even
001	- X1	- X1	- X1	LMUL=1/2
010	X1 -	X1 -	X1 -	LMUL=1/2
011	~	~	W1 X1	LMUL=1/2

101	- X2	- X2	- X2	LMUL=1/4
110	X2 -	X2 -	X2 -	LMUL=1/4
111	~	Y2 X2	W2 X2	LMUL=1/4 : Note 1

100	- X4	- X4	- X4	LMUL=1/8

Notes:
1 – For vfill=111 two operand, vl counts the pairs of operations.

Legend:

~ not a valid combination (reserved)

"-" gap of size equal to CLSTR size

X0 consecutively numbered elements (clusters with no gaps)
[i+n-1] .... [i+2] [i +1] [i+0] where n is 2 * CLSTR
and i is determined by two cluster boundary.

X1 consecutively numbered elements (clusters with equal size gap)
[i+n-1] .... [i+2] [i +1] [i+0] where n is number elements in a cluster
and i is determined by cluster boundary.

X2 same as X1 except effective cluster size is CLSTR / 2

  X1 and X2 can occupy even or odd sides of gap/cluster pair.

X4 same as X1 except effective cluster size is CLSTR / 4
and is only allocated on even side.

Y2 equivalent to X2 but can occupy odd cluster location only.
These odd clusters are processed in tandem with the X even clusters, such that vl * 2 operations are performed.

W1 and W2 are equivalent to X1 and X2 but occupy odd cluster location only.
For widening ops vs1 is sourced from this odd cluster location.
while vs2 is sourced from even cluster location.
When vs1 = vs2 a single physical register sources both operands.

One vector operand instructions:
Load exemplifies the processing. Either the odd or even cluster in the ‘gap/cluster’ or ‘cluster/gap’ pair is chosen by vfill.

For even clusters, elements are filled from the lower bits until the cluster is filled, the gap is skipped and the next cluster filled, etc. until vl is exhausted.

For odd clusters, the initial gap of CLSTR bytes is skipped, the cluster is filled, the rest (if any of the CLSTR bytes is skipped to the next CLSTR gap/cluster pair, and the process repeated until vl is exhausted.

Note: the corresponding bits in V0 are used to mask elements for instruction with vm=0.

The same element numbering derived by load apply to store and all other one vector register instructions.

Single operand widening instructions

The selection is that same as for “One vector operand instructions”.
The vd results are aligned with right or left for source at vfill=100,101 and 110 if the effective vfill level is 101, 001 or 010. If the resultant level is 000 it fills the full 2 * cluster chunk.
Similarly, if source is at 001 or 010, then vd effective vfill is 000, and it fills the full 2 * cluster chunk.

For two operand single-SEW instructions:
The same element numbering derived by load apply to each vector and the corresponding mask bits whether selected from the even or odd clusters.

For vfill= (100, x01) or x10, both operands for the instruction are selected from (even, even) or odd clusters, respectively, one group of element from each of the two registers vs1 and vs2. The result is stored in the corresponding elements in the (even,even) or odd cluster of vd, respectively.

For vfill=111, two operations occur for each value of vl. The even ( X ) elements are processed as described for vfill=101, with the result written to the element of the even vd cluster. The odd ( Y ) elements are processed as described for vfill=110, with the result written to the element of the odd vd cluster.

Note: the setting vfill=011 is currently reserved for two operand single-SEW operations because it is the same as if the operation were performed with vfill=000.

In all cases the corresponding mask bits in v0 for each used cluster element are in effect.

For two operand widening instructions

For vfill= 100, x01 or x10 double widening instructions select cluster source elements the same way as for two operand single-SEW instructions. However, the corresponding vd is in the next higher vfill level. This even/odd works for vfill=101 and 110 with correspondingly larger even/odd effective vfill=001 and 010 respectively.

When vfill=001 or 010 the vd result is always in the 2 * CLSTR sized group of elements at vfill=000.

For quad widening the result is 2 vfill levels up. For vfill=100 source the result is at vfill=001.
Note, vfill=0xx cannot be quad widened.

For vfill=011 or 111 and double widening operations the operands for the instruction are selected from (both an even and an odd clusters. For vs1 the elements are selected from the even clusters within the register. For vs2 the elements are selected from the odd clusters within the register. The vd result from has an effective vfill one level higher that the sources. The result is thus either vfill=000 with vfill=011 source, or it is vfill=001 for vfill=111 source.

For quad widening vfill=011 is not allowed.
However, vfill=111 is allowed for quad widening an yeilds a vfil=000 vd result.

A4.

Applications that know beforehand when they are going to perform widening operations can readily tailor the input to match those operations.
This design has many favourable characteristics:

the design is not ILEN32 specific, The same concerns occur regardless of instruction encoding and thus the ILEN64 model is not hampered by it.
logical register groups from 1 to 8 are orthogonaliy supported.
Widening using vfill=011 with vs1 = vs2 uses a fully packed physical register group to create another fully packed register group.
two of the same widening instruction executed using vfill=001 and then with vfill=010 will widen a vfill=000 register group into two like sized register groups.
two of the same widening instruction using vfill=011 but exchanging vs1 and vs2 will for many aggregating operations perform a valid step without changing vtype.
Non-widening operations are also fully supported at all levels by this design.
single-SEW operations at vfill=000 for executes on all elements equivalently to performing an odd (vfill=001) then an even (vfill=010) pair of the same operation.
This is a minimal vfill design. More functionality such as full component selection at lower vfill levels (higher interleave levels) notably 1/2 CLSTR and 1/8 CLSTR are possible. #### see****

A5.

This proposal is as radical a departure from vertical striping as horizontal interleave via SLEN<VLEN.
- Should we do such a change so near a v0,9 release? (various perception and retooling concerns)

Fractional register dependency implies lesser register usage.
The design (appears to) requires more vsetvli changes. It may require augmenting widening and load/store operations to reduce that effect.
This proposal should be improved by making CLSTR/vclstr programmable.
e.g. when CLSTR is set to EW, then vfill=001 will select the even of an element pair, and vfill-010 will select the odd.

see**

Octal and higher will be defined if such higher order operations are defined.

see*** LMUL should allow all values from 1 to 8 as explained in #460

see****

for simplicity assumes #448 Ordinal based mask, but other mask encode can be compatible.

see*****

But other than odd side support for 1/4 CLSTR size (1/8th EW for “Normal physical register format) which would enable widening from both sides and increase register use, I don’t know if it is worth it.

Same 5Q&A boilerplate goes here. -Ed.

revoveryqq dump

enhance support co-routines and thus allow level change already saved registers.
can be used horizontally
application can check that co-routine did not clobber the registers reserved for itself
one halt loop suffices for multiple levels
current code assumes level 0 and pause for horizontal interrupt?
riscv/riscv-fast-interrupt@2fc1965

scenarios :

define code section that can be re-runnable even if interrupt clobbers a specified set of registers
co-routine code
<>
set redoable flag on register set q1
1l:
reset redo flag for q1
.... execute re-runable code segment ....
if redo flag for q1 is set branch to 1l
reset redoable flag for q1
<>

interrupt co-routine handler code
save non-q1 registers that are used in handler
if redoable flag clear save q1 registers
‘’’’ execute some interrupt handler code ‘’’’
if redoable q1 flag was set on entry
and q1 register set modified
set redo flag for q1
if redoable q1 flag not set on entry
restore q1 saved registers
restore non-q1 save registers
return

Uses include
a) reducing code segments that need interrupts disbled
b) reducing interrupt save/restores

Specifically this code is applicable to nested interrupt handlers.
2) on return from interrupt handler allow saved register set to be reused if immediately followed by an interrupt.
In the examples below the necessary modifications to the standard process are defined.

Note: The exact order relative to each other and the existing functionality is sub-optimal and would be incorrect for some implementations. For example, in 2a reset of reusable q1 flag in handler prolog could be performed co-incident with the read of its state.

Note: In practice, the discreet functionality will need to be reordered to avoid special cases in the hardware design. For example,

This functionality requires hardware support.

 A) This functionality allows reuse of the register sets saved in the trap handler
         by the next interrupt at the same level if:
           i) xret does not complete, or 
           ii) not instruction is executed before the next interrupt is actioned.

      For 2a hardware support is required but is minimal.
i)  two bits in CSRs, reusable q1 flag for each privilege level
        ii)  if an interrupt is imminent (pending and enabled).
           MRET cannot execute any instructions until interrupt is engaged.
        iii) reusable q1 flag is reset if, after  return, any instruction is executed.

B) The functionality in 2b is enhanced over 2a to allow
M-mode and S-mode to share the others saved register sets.

  2B requires the same hardware support as 2a, and 
    additionally requires S-mode to clear 
    M-mode's reusable q1 flag

2a) horizontal interrupts with minimal hardware flag reset.
<<modified interrupt handler epilog
a) leaves x2 in xswap pointing to saved q1 register set
b) set reusable q1 flag for current privilege level x
c) execute modified xret >>
<< modified xret
at end of standard xret if pending interrupt
a) do not execute target instruction
b) start interrupt process
otherwise
a) reset reusable q1 flag
b) resume interrupted code >>

<<modified interrupt prolog
<<perform standard preliminary sp setup except
do not advance sp by -FRAMESIZE>>
save non-q1 registers
if reusable q1 flag clear
save q1 set registers
otherwise
a) leave previously saved q1 registers unchanged
b) [[ possibly reset reusable q1 flag ]]
<>

This 2a varient is comparable to other architecture's hardware stacking feature that avoids writing registers if just saved.
(specifically when an interrupt arises during return)

2a) vertical interrupts with minimal hardware flag reset.
<<modified S-mode interrupt handler epilog
a) leaves x2 in sswap pointing to saved q1 register set
b) set S-mode reusable q1 flag
c) execute modified sret >>
<< modified sret
at end of standard sret if pending S-mode or M-mode interrupt
a) do not execute target instruction
b) start interrupt process (S-mode or M-mode)
otherwise
a) reset S-mode reusable q1 flag
b) resume interrupted code >>

<<modified M-mode interrupt prolog
<<perform standard preliminary sp setup except
do not advance sp by -FRAMESIZE>>
save non-q1 registers
if neither S-mode nor M-mode reusable q1 flags are clear
a) save q1 set registers
otherwise
a) leave previously saved q1 registers unchanged
b) reset reusable q1 flags [Not technically needed?]
<>

<<modified M-mode interrupt handler epilog
a) if S-mode reusable q1 flag was set on entry to M-mode handler
execute
b) restore all registers from M-mode saved register frame
(which will include q1 register set)
b) leaves x2 in mswap pointing to saved register frame
c) set M-mode reusable q1 flag
d) execute modified mret >>

a) if S-mode reusable q1 flag was set on entry to M-mode handler
do not load q1 set of registers from M-mode stack

<< modified mret
at end of standard sret if pending S-mode or M-mode interrupt
a) do not execute target instruction
b) start interrupt process (S-mode or M-mode)
otherwise
a) reset S-mode reusable q1 flag
b) resume interrupted code >>

Allow reused of saved registers by interrupt handlers if saved data is still current.

3b) horizontal interrupts with comprehensive hardware flag reset.
<< as in 2a: modified interrupt handler epilog
a) leaves x2 in xswap pointing to saved q1 register set
b) set reusable q1 flag for current privilege level x
c) execute modified xret >>
<< standard xret, revised from 2a
resume interrupted code >>
<< modified resumed code behaviour, new for 2b
a) execute instructions as normal
b) however, if executed instruction modifies a q1 register
reset reusable q1 flag
Note: reset of q1 flag occurs in all privilege modes >>

<<modified interrupt prolog
<<perform standard preliminary sp setup except
do not advance sp by -FRAMESIZE>>
save non-q1 registers
if reusable q1 flag clear
save q1 set registers
otherwise
a) leave previously saved q1 registers unchanged
b) [[ possibly reset reusable q1 flag ]]
<>

This 2b variant goes beyond other architecture's hardware stacking feature that avoids writing registers if just saved.
Interruptible routines that are q1 register set aware can avoid their use in heavily dynamically executed code segments in a interrupt heavy environment and make considerable more progress than without this feature.
2c) horizontal interrupts with multiple hardware flag reset.
<< as in 2b: modified interrupt handler epilog
a) leaves x2 in xswap pointing to saved qx register sets
b) set reusable qx flags for current privilege level x
c) execute modified xret >>
<< standard xret, revised from 2a
resume interrupted code >>
<< modified resumed code behaviour, parallels 2b
a) execute instructions as normal
b) however, if executed instruction modifies a qx register
reset that reusable qx flag in all priv modes>>

<<modified interrupt prolog
<<perform standard preliminary sp setup except
do not advance sp by -FRAMESIZE>>
save non-qx registers
for each reusable qx flag that is clear
save qx set registers
otherwise
a) leave previously saved qx registers unchanged
b) [[ possibly reset reusable qx flags ]]
<>

This 2c variant provides further granularity of register sets.
As before, qx register set aware interruptible routines can tailor register use in heavily dynamically executed code segments leveraging the additional flexibility that the multiple qx flags provide.
I believe the sweet spot may be 2 qx sets that encompass all the registers, {x16...x31} and {x1,x3...x15}. Note, sp, by convention x2 will (also by convention) be saved in xswap. As such, x2 can be the sole register modified in such coroutine sections and avoid any register saving to memory.
2d) vertical interrupts with hardware flag reset (s to m).
<< as in 2c: modified supervisor interrupt handler epilog
a) leaves x2 in sswap pointing to saved S-mode qx register sets
b) set reusable qx flags for supervisor privilege levels
c) execute standard sret >>
<< modified resumed code behaviour, same as 2c except
a) execute instructions as normal
b) however, if executed instruction modifies a qx register
reset both M-mode and S-mode reusable qx flag
Note: initially only U-mode is affected >>

<<modified M-mode interrupt prolog
<<perform standard M-mode sp setup except
do not advance sp by -FRAMESIZE>>
save non-qx registers
for each M-mode reusable qx flag that is clear
save qx set registers
otherwise
a) leave previously saved qx registers unchanged
b) [[ possibly reset reusable qx flag ]]
<>

<<modified M-mode epilog
a) restore all registers saved by prolog
i) all non-qx registers
ii) all qx registers with reusable qx flag set on entry
b) if no reusable qx flag set on entry
i) set continue with normal epilog (mepc etc. and mret)
including setting M-mode
c) if any reusable qx flag set on entry,
perform special M-mode trampoline return> >>

<<special M-mode trampoline return

 a) select first of the qx set that had 
      reusable qx flag set on entry (needs 3+ registers).
 b) store saved x2 in first register of that set.
 c) store saved mret
     (interrupt return address, to be used by sret)
     in second register of selected qx space.
 d) store saved mstatus  (to be used by sret)
     in third register of selected qx space.
 e) set x2 to identify the to be recovered qx sets.
 f) set mepc to <trampoline code in S-mode space>
 g) set MPP to S-mode
 h) set MPEI to disable S-mode interrupts.
 i) execute (standard) mret to <trampoline code in S-mode space> >>

<<trampoline code in S-mode space
a) check x2 for first qx set to use,
branch to appropriate code to do the following:
b) recover x2 from sswap
c) recover all or most of set qx SSIE and SPP from register containing mstatus data
ii)

This 2c variant provides further granularity of register sets.
As before, qx register set aware interruptible routines can tailor register use in heavily dynamically executed code segments leveraging the additional flexibility that the multiple qx flags provide.
I believe the sweet spot may be 2 qx sets that encompass all the registers, {x16...x31} and {x1,x3...x15}. Note, sp, by convention x2 will (also by convention) be saved in xswap. As such, x2 can be the sole register modified in such coroutine sections and avoid any register saving to memory.
current thought to enhance xnxti use low bits to defined co-routine flag.
cssrci and cssrsi can provide different polarities, especially as it is coupled with interrupt disable/enable (xie clear/set).
use low 8bits of xintstatus for tracking co-routine / interrupt saved stack status.
This status is a declared state by any privilege level that their x2 register is a stack pointer
to saved data that if restored will allow correct ongoing program behaviour.
In addition to being readable here,
the state is set through xnxti instructions and
the state is aggregated to provide status in the low bits of
the vector in rd provided by the xnxti instructions.
Aggregated status is also provided in the low bits of xscratchcsw[l]
.
Perhaps unexpectedly, plan is xintthresh will not affect these bits.
xintstatus bit recovqq

qq is a set of saved registers on the stack pointed to by x2.
As a result x2 is not considered part of the recoverable set, it must be re-established before return to the co-routine.
a code section sets this bit with a csrrsi xnxti when setrecovqq is 1
// not sure of this yet ------ and clears when bit is 0.
when set informs co-routine (including interrupts) that the partner process is
able to fully restore qq state
coroutine is thus able to use qq state with impunity iff it signals dorecovqq to the co-routine.
partner routine must at the end of the recoverable qq section check do recovqq and recover qq state.
further, it will clear statrecqq to terminate the shared qq state code region

plan: a single csrrci xnxti instruction does both
current dorecovqq state is placed in setrecovqq bit location and
recovqq is cleared if setrecovqq is set in immediate field.
Bit dorecovqq needs to be set on return from co-routine if coroutine messed with qq state.

Dorecovqq could be offset from zero that allows C.JR to return right back to process.

is cleared when recovqq is set by csrrsi.+
new branch instruction on low bits set.
generally available functionality , but especially valuable in a minimal register
use case where either:
no other register available (without spill) to mask specific bits,
or reload of value costly (as CSR read may be).
bbb

Note: for 2a some implementations may already have this functionality.
Specifically, if the implementation already detects pending enable interrupt and
a) vectors to start of interrupt routine and
b) leaves xepc unchanged
Then the low bit of xepc can be used as the hardware signal for M and S modes.

https://github.com/riscv/riscv-v-spec/issues/522#issuecomment-656452221

riscv/riscv-v-spec#522 (comment)

dealing in hypotheticals?

It appears we are trading in hypotheticals.
ading in hypothetical.

In the 2020/7/17 meeting we went over this issue again. The consensus at the meeting was to stay with PoR, which might alter vl when SEW/LMUL ratio changes rather than set vill. The sentiment was that the debugging value was small, and there was some potential uses for the behavior.

Rereading Guy's comments above again, the register form of vsetvl case was not considered in TG discussion. Though that is a relatively rare form of the instruction., and in general won't benefit from static checking (e.g., for unsupported SEW).

We did discuss making the behavior in these cases "reserved", so we could redefine those cases in future to perform some useful alternative vl setting function, but this would not enable forward compatibility via emulation (neither would setting vill), so the decision was that a different encoding should be used when a different function is required. This makes me reconsider our PoR.

For a while, I've been trying to think of cases where a vsetvli x0,x0, with different SEW/LMUL would perform a useful function and I struggle with coming up with plausible use cases. One example I had thought of was when taking a vector of say bytes and wanting to cast these into vector of halfwords. The instruction could be used to change vl appropriately, though only in one direction and this only really works when vl=VLMAX initially, which we can accomplish more reliably using the rd!=x0, rs1=x0 variant.

More methodically, when new SEW'/LMUL' > existing SEW/LMUL, then vl might either stay the same or be clipped. I struggle to see the utility there for portable software.

When SEW'/LMUL'<SEW/LMUL then vl will be unchanged, but upper portions of vector register groups at SEW' will be inaccessible. There might be use there, but it seems a bit esoteric. While one could possibly use the lower portions of a vector register group to work with other vector register groups, the same effect would be had simply using a different LMUL to keep the upper portion undisturbed (and maybe set tail-undisturbed for fractional LMUL.

I am coming around to the idea the vsetvl{I} x0, x0 form should set vill if SEW/MUL ratio would change, leaving vl unchanged in all cases. This might also simplify hardware as there is no potential change in vl with this instruction.

Ideas to reduce code size

riscvarchive/riscv-code-size-reduction#28 (comment)

Micro-code support - fast synchronous exception support [like fast-interrupt].

emulate zero page in High-Five E320

extend aligned load/store, emulate example.

TEMP

= Zero page relocation

[NOTE]

This proposed is entirely based on David Horner's work, written up by Tariq

This proposal adds new CSRs to control the behaviour of unusual encodings, which are not very useful in general, and changes the behaviour to make it more useful.

== zero page JALR

=== CSR

MZPJALR[0] = enable
MZPJALR[31:12] = base
MZPJALR[11:1] - reserved

[NOTE]

The https://github.com/riscv/riscv-code-size-reduction/blob/master/ISA%20proposals/Huawei/table%20jump.adoc[table jump proposal] reduces the usefulness of this

== Behaviour

If MZPJALR.enable=0 then the behaviour of JALR rd, offset(x0) is unchanged.

If MZPJALR.enable=1 then the behaviour of JALR rd, offset(x0) takes on a new meaning.

. x0 is substituted for MZPJALR.base.
. offset is shifted left twice before use to give a bigger range.

Therefore the behaviour is:

[source,sourceCode,text]

jalr rd, offset(x0);# executes as jalr rd, MZPJALR.base+offset*4

This gives a 16KB region which can always be accessed by jalr from anywhere in the address map. Note that there is no 16-bit form as x0 cannot be specified as the base register for c.jalr.

== Zero Page load/store

=== CSR

MZPLDST[0] = enable
MZPLDST[31:12] = base
MZPLDST[11:1] - reserved

== Behaviour

If MZPLDST.enable=0 then the behaviour of [lq|ld|ldu|lw|lwu|lh|lhu|lb|lbu] rd, offset(x0) is unchanged.

If MZPLDST.enable=0 then the behaviour of [sq|sd|sw|sh|sb] rs2, offset(x0) is unchanged.

If MZPLDST.enable=1 then the behaviour of [lq|ld|ldu|lw|lwu|lh|lhu|lb|lbu] rd, offset(x0) takes on a new meaning.

If MZPLDST.enable=1 then the behaviour of [sq|sd|sw|sh|sb] rs2, offset(x0) takes on a new meaning.

. x0 is substituted for MZPLDST.base.
. offset is shifted left by the access width before use, to give a bigger range.

Therefore the behaviour is:

[source,sourceCode,text]

lb[u] rd, offset(x0);# executes as lb[u] rd, MZPLDST.base+offset
lh[u] rd, offset(x0);# executes as lh[u] rd, MZPLDST.base+offset2
lw[u] rd, offset(x0);# executes as lw[u] rd, MZPLDST.base+offset4
ld[u] rd, offset(x0);# executes as ld[u] rd, MZPLDST.base+offset8
lq rd, offset(x0);# executes as lq rd, MZPLDST.base+offset16

sb rs1, offset(x0);# executes as sb rs1, MZPLDST.base+offset
sh rs1, offset(x0);# executes as sh rs1, MZPLDST.base+offset2
sw rs1, offset(x0);# executes as sw rs1, MZPLDST.base+offset4
sd rs1, offset(x0);# executes as sd rs1, MZPLDST.base+offset8
sq rs1, offset(x0);# executes as sq rs1, MZPLDST.base+offset16

This gives a 4KB-16KB region which can always be accessed by loads and stores from anywhere in the address map without using the global pointer,
although the whole range can't be accessed with the all data widths.

== Application

If compiling with the GCC option -fstack-protector-strong then every function in the Huawei IoT code has these:

[source,sourceCode,text]

e04a5e: 00f00437 lui s0,0xf00
e04a62: 02c42783 lw a5,44(s0) # f0002c <__stack_chk_guard>

Some functions also have this (sometimes it's a 32-bit sequence to call it)

[source,sourceCode,text]

10bef2c: ffd47097 auipc ra,0xffd47
10bef30: f52080e7 jalr -174(ra) # e05e7e <__stack_chk_fail>

These could be replaced by zero-page jalr and lw meaning that 64-bit sequences would never be required. Additionally table jump can be used for the calls to __stack_chk_fail

== Link Time Optimisation?

Can the linker make use of this feature, so the compiler doesn't need to know about it?

Introduce vlmt (vl multiplicative threshold) / VLMT Vector LiMiT

This proposal expands register groups into 8 physical registers for all LMUL<8 , including fractional.

LMUL=8 already uses 8 physical registers and continues to use all 8.

Recapping LMUL>=1 register groups:

Register groups addressed by register numbers that
are a multiple of 8 have 8 physical registers,
are a multiple of 4 but not 8 have 4 physical registers,
are a multiple of 2 but not 4 have 2 physical registers,
are not a multiple of 2 have a single physical registers.

These are named reggroup type 8, 4, 2 and 1 respectively

As with LMUL>1 the register number determines the maximum physical registers that the underlying register group can provide. ***
These are equivalent to previous constraints of LMUL 8, 4, 2 and 1 respectively.

If vl exceeds any source or destination reggroup type size an illegal instruction exception is raised.

Register Group processing:
For LMUL<=1 the addressed reggroup is processed as if successive consecutively numbered physical register are appended to the first (up to the reggroup type).
This results in linear registers with effective VLEN up to 8 time the original length.

For LMUL=8 no extension is possible. The register groups of type 8 are not affected.

For LMUL=2 and LMUL=4 the regroup is processed as if each successive register group is concatenated as register pair or quads to the first group. The effective VLEN is up to 4 times for LMUL=2 and twice WLEN for LMUL=4.

VLMT , with values 1 through 8 specifies the maximum number of physical registers that vl can address.
When LMUL>1 then VLMT must be a power of two with LMUL <= VLMT.

VLMT is set by the vlmt 3-bit field in vtype.
When vlmt = 0, VLMT is set to the max(1,LMUL) [the max function is required if LMUL with fractional values is defined, otherwise LMUL is the setting]. Thus a hardcoded value of zero disables the facility. All other values of vlmt set VLMT to 2 through 8 according to vlmt encoding.

On execution of vsetvl[i] if LMUL>1 and LMUL >= VLMT then vill is set (and the rest of the bits of vtype set to 0].

The derived value VLMAX is now defined in terms of VLMT, and is equal to VLMT*VLEN/SEW and represents the maximum number of elements that can be operated on with a single vector instruction given the current SEW and VLMT settings.

Possible encodings of vlmt with these objectives:
1) vl is directly affected by vlmt and so placing it into the lower 11 bits allows the vsetvli instruction to directly control it. These bits are also competing for other important configuration fields, specifically: sew,lmul and ediv. There are more fields pending, including expanding lmul.

In that vlmt is used only to determine the value of vl (allowing it to span physical registers), the value of vlmt does not need to be persistent. When re-establishing vl after a context switch it suffices to set vlmt to the representation of VLMT=8 in the saved vtype and execute vsetvl as usual with the save vl as AVL (rs1).
1. The values 3, 5, 6 and 7 can reasonably be expected to occur less frequently. Splitting the field into two sections with 1 bit to select between powers of 2 and the other values.

Some alternatives:

standard bit representation of 3 consecutive bits in lower 12 bits.
standard bit representation of 3 consecutive bits in upper 20 bits[31,12].
split fields with
vlmt.h one bit in upper 20 bits [31,12]
vlmt.l two bits in lower 11 bits.
When vlmt.h = 0, VLMT = 2 ** vlmt.l
When vlmt.h = 1, VLMT = 3, 5, 6 and 7 when vlmt.l = 0, 1, 2, 3 respectively,
same as (3) with vlmt.h not set/reset by vsetvli.

Mask mappings:

A significant aspect of the expanded register groups is the mapping of the mask bits to the elements.
LMUL>1 has already established a mapping appropriate to the fixed grouping (powers of 2) inherent in its structure.
LMUL=8 is completely established and cannot be expanded.

Mapping for LMUL<=1.

Within the mask MLEN=SEW/8. Each mask position corresponding to the SEW element is therefore divided into 8 equal segments. As previously, only the least significant bit in the segment is the mask bits. On write the other upper bits are cleared and on read only the low order bit is checked.
The fill order establishes even levels (physical register offsets 0,2,4,6) in the low segments of the SEW length field. All odd levels (physical register offsets 1,3,5,7) in the higher segments.
The set order is specifically from most significant segment to least:
7,3,5,1,6,2,4,0
fill table

phys reg\ bit number	7	6	5	4	3	2	1	0
fill order
0								x
1				x
2						x
3		x
4							x
5			x
6					x
7	x
--	--	--	--	--	--	--	--	--
set order
7	x
3		x
5			x
1				x
6					x
2						x
4							x
0								x

LMUL=2 is fully compatible with LMUL=1, and the fill order for the first two physical registers matches.

For LMUL=4 the current MLEN fields are sub divided and the upper halves of each are used for the allocation of 5 through 7 physical registers. The fill order could go either of two ways. To match the 5 through 7 of LMUL=1 and 2, or match the lower halves.

With the current strawman model, fractional LMUL (1/2,1/4,1/8) halves the effective mask length at each lower level, matching the cluster size. As the mask is SEW length based, applying the masking approach is identical within the corresponding mask each element. The same gap between clusters is applicable to the mask and within the physical registers as well.

this is a test

test1

Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location

The new field, vfill, fulfills two distinct purposes: fractional cluster order (fill) and selection (element location).
Corresponding masks segments are active for each selected cluster.
These are specific to three cases for fractional data for:

For one vector operand instructions: provides the fill degree and order.
Examples:
load/store
vclstr/vdclstr
mask ordinal
narrowing
For two operand single-SEW instructions it determines the participating clusters.
Examples:
vadd.vv vadd.vi vfadd.vv
vmseq.vv vmseq.vx
For two operand widening instructions it determines the participating clusters.
Examples:
vwadd.vv vwadd.vx vwadd.wv

The structure and values are chosen to provide backward compatibility with LMUL>=1 , and to minimize the vfill state changes in typical code sequences.
Conceptually the field lmul is superseded by a two field pair: vlvl (identical to current lmul in size, values and location) and vfill (a new 2 bit field).
When vfill is zero vlvl determined LMUL level exactly as lmul field would do, with all non-fractional functionality working as before.
When vfill is non-zero additional fractional LMUL functionality is in effect.
Especially fractional levels 1/2, 1/4 and 1/8 are determined by non-zero vfill and vlvl value 1,2 and 3.
table:

Vlvl (prev lmul)	vfill	One vector operator:	Two vector operands	Widening	comment

00	00	X0	X0	N/A	LMUL=1 : Note 1
01	00	N/A	N/A	N/A	LMUL=2
10	00	N/A	N/A	N/A	LMUL=4
11	00	N/A	N/A	N/A	LMUL=8

		Odd:even	Odd:even	Odd:even
01	01	- X1	- X1	- X1	LMUL=1/2
01	10	X1 -	X1 -	X1 -	LMUL=1/2
01	11	~	Y1 X1	W1 X1	LMUL=1/2 : Note 3&4

10	01	- X2	- X2	- X2	LMUL=1/4
10	10	X2 -	X2 -	X2 -	LMUL=1/4
10	11	~	Y2 X2	W2 X2	LMUL=1/4

11	01	- X4	- X4	- X4	LMUL=1/8
11	10	X4 -	X4 -	X4 -	LMUL=1/8
11	11	~	Y4 X4	W4 X4	LMUL=1/8

Notes:
1 – LMUL=1 is transitional for fractional LMUL.
The structure is compatible in the limiting case of for both clustered and striped.
2 – For vfill=11 two operand, vl counts the pairs of operations.
3 – Consider: use vfill=11 single operand to process double the iterations of vfill=01.

Legend:

N/A not applicable to fractional operations

~ not a valid combination (reserved)

"-" gap of size equal to LMUL=1/2

X0 consecutively numbered elements (clusters with no gaps)
[i+n-1] .... [i+2] [i +1] [i+0] where n is 2 * CLSTR
and i is determined by two cluster boundary.

X1/2/4 can occupy even or odd sides of gap/cluster pair.

X1 consecutively numbered elements (clusters with equal size gap)
[i+n-1] .... [i+2] [i +1] [i+0] where n is number elements in a cluster
and i is determined by cluster boundary.

X2 same as X1 except effective cluster size is CLSTR / 2

X4 same as X1 except effective cluster size is CLSTR / 4

Y1/2/4 equivalent to X1/2/4 but occupy odd cluster location only.
These odd clusters are processed in tandem with the X even clusters, such that vl * 2 operations are performed.

W1/2/4 equivalent to X1/2/4 but occupy odd cluster location only.
for widening ops vs1 is sourced from this odd cluster location.
(while vs2 is sourced from even cluster location).
When vs1 = vs2 a single physical register sources both operands.

One vector operand instructions:
Load exemplifies the processing. Either the odd or even cluster in the gap/cluster, cluster/gap pair is chosen by vfill.

For even clusters, elements are filled from the lower bits until the cluster is filled, the gap is skipped and the next cluster filled, etc. until vl is exhausted.

Note: the corresponding bits in V0 are used to mask elements for instruction with vm=0.

The same element numbering derived by load apply to store and all other one vector register instruction.

For two operand single-SEW instructions:
The same element numbering derived by load apply to each vector and the corresponding mask bits whether selected from the even or odd clusters.

For vfill= 01 or 10, both operands for the instruction are selected from either even or odd clusters, respectively, one from each of the two registers vs1 and vs2. The result is stored in the corresponding element in the even or odd cluster of vd, respectively.

For vfill=11, two operations occur for each value of vl. The even ( X ) elements are processed as described for vfill=01, with the result written to the element of the even vd cluster. The odd ( Y ) elements are processed as described for vfill=10, with the result written to the element of the odd vd cluster.

In all cases the corresponding bits in v0 for each used cluster element are in effect.

For two operand widening instructions

For vfill= 01 or 10 widening instructions select cluster source elements the same way as for two operand single-SEW instructions. However, the corresponding vd is in the next higher LMUL level. This odd/even works for 1/8 and 1/4 with correspondingly larger odd/even 1/4 and 1/2 clusters.
When LMUL=1/2 (vlvl=1) the vd result is always in the 2 * CLSTR sized group of elements.

For vfill=11 and LMUL=1/8 and 1/4 the two widening operations (even and odd) with the same vl value occurs as described in single-SEW instructions (replacing ( Y ) with (W)).

For vfill=11 and LMUL=1/2 only one widening operation occurs. As with vfill=01 or 10, vd is written into the 2 * CLSTR sized group of elements. However, the sources are chosen from both even and odd clusters. The element from vs2 ( X ) is selected from the even cluster, the element from vs1 ( W ) is selected from the odd cluster. This allows a single register to source both the elements for the widening operation when vs1 = vs2.

david-horner / text-format Goto Github PK

text-format's People

Contributors

Watchers

text-format's Issues

Some of the related issues were discussed in #418 Introduce vlmt (vl multiplicative threshold) / VLMT Vector LiMiT

Some of the concerns are now obsolete:

Of these some are specific to non-fraction LMUL.

depreciate JALR with low bit set.

enhance support co-routines and thus allow level change already saved registers.

can be used horizontally

one halt loop suffices for multiple levels

current thought to enhance xnxti use low bits to defined co-routine flag.

use low 8bits of xintstatus for tracking co-routine / interrupt saved stack status.

xintstatus bit recovqq

new branch instruction on low bits set.

A1.

A2.

A3.

Proposal:

Overview:

A4.

A5.

see**

see*** LMUL should allow all values from 1 to 8 as explained in #460

see****

see*****

[source,sourceCode,text]

[source,sourceCode,text]

[source,sourceCode,text]

e04a5e: 00f00437 lui s0,0xf00 e04a62: 02c42783 lw a5,44(s0) # f0002c <__stack_chk_guard>

[source,sourceCode,text]

10bef2c: ffd47097 auipc ra,0xffd47 10bef30: f52080e7 jalr -174(ra) # e05e7e <__stack_chk_fail>

Recommend Projects

Recommend Topics

Recommend Org

e04a5e: 00f00437 lui s0,0xf00
e04a62: 02c42783 lw a5,44(s0) # f0002c <__stack_chk_guard>

10bef2c: ffd47097 auipc ra,0xffd47
10bef30: f52080e7 jalr -174(ra) # e05e7e <__stack_chk_fail>