mrisc32 / mrisc32-a1 Goto Github PK

View Code? Open in Web Editor NEW

22.0 4.0 5.0 524 KB

A pipelined, in-order, scalar VHDL implementation of the MRISC32 ISA

Home Page: https://gitlab.com/mrisc32/mrisc32-a1

Makefile 2.61% VHDL 95.70% Shell 0.49% C++ 0.22% Assembly 0.77% Tcl 0.20%

vhdl cpu mrisc32 soft-core risc fpu

mrisc32-a1's Introduction

This repo has moved to: https://gitlab.com/mrisc32/mrisc32-a1

MRISC32-A1

This is a VHDL implementation of a single issue, in-order CPU that implements the MRISC32 ISA. The working name for the CPU is MRISC32-A1.

Overview

Progress

The CPU is nearing completion but still under development. The following components have been implemented:

A 9-stage pipeline.
- PC and branching logic.
- Instruction fetch.
- Decode.
- Register fetch.
- Execute.
- Data read/write logic (scalar and vector).
- Register write-back.
- Operand forwarding.
The integer ALU.
- Supports all packed and unpacked integer ALU operations.
- All ALU operations finish in one cycle.
A pipelined (three-cycle) integer multiply unit.
- Supports all packed and unpacked integer multiplication operations.
A semi-pipelined integer and floating point division unit.
- The integer division pipeline is 3 stages long, while the floating point division pipeline is 4 stages long.
- 32-bit division: 15/12 cycles stall (integer/float).
- 2 x 16-bit division: 7/5 cycles stall (integer/float).
- 4 x 8-bit division: 3/2 cycles stall (integer/float).
A pipelined (two-cycle) Saturating Arithmetic Unit (SAU).
- Supports all packed and unpacked saturating and halving arithmetic instructions.
An IEEE 754 compliant(ish) FPU.
- The following single-cycle FPU instructions are implemented:
  - FMIN, FMAX
  - FSEQ, FSNE, FSLT, FSLE, FSUNORD, FSORD
- The following three-cycle FPU instructions are implemented:
  - ITOF, UTOF, FTOI, FTOU, FTOIR, FTOUR
- The following four-cycle FPU instructions are implemented:
  - FADD, FSUB, FMUL
- Both packed and unpacked FPU operations are implemented.
The scalar register file.
- There are three read ports and one write port.
The vector register file.
- There are two read ports and one write port.
- Each vector register has 16 elements (configurable).
An address generation unit (AGU).
- The AGU supports all addressing modes.
Branch prediction and correction.
- A direct mapped 2-bit dynamic branch predictor (512 entries, configurable).
- A return address stack predictor (16 entries, configurable).
- The branch misprediction penalty is 3 cycles (a correctly predicted branch incurs no penalty).
A direct mapped instruction cache.
Two 32-bit Wishbone (B4 pipelined) interfaces to the memory.
- Instruction and data requests have separate Wishbone interfaces.
- One memory request can be completed every cycle per interface.

TODO: Data cache, interrupt logic.

Configurability

The aim is for the MRISC32-A1 to implement the complete MRISC32 ISA, which means that it is a fairly large design (including an FPU, hardware multiplication and division, packed operations, etc).

If the design is too large or complex for a certain target chip (FPGA), it is possible to disable many features via T_CORE_CONFIG (see config.vhd). E.g. setting HAS_MUL to false will disable support for hardware multiplication.

It is also possible to change the vector register size by chaging the value of C_LOG2_VEC_REG_ELEMENTS (4 means 16 elements, 5 means 32 elements, 8 means 256 elements, and so on).

Performance

The MRISC32-A1 can issue one operation per clock cycle.

When synthesized against an Intel Cyclone V FPGA, the maximum running frequency is close to 100 MHz.

mrisc32-a1's People

Contributors

Stargazers

Watchers

Forkers

saitej25 mfkiwl rickyzhang82 pabloua isabella232

mrisc32-a1's Issues

Optimization: Reduce logic in vector_control

The path to o_is_vector_op_busy (which in turn ends up controlling the stall signal & the IF memory interface) is fairly complex. Try to reduce the complexity.

float_compare: Properly handle 0.0 == -0.0

Implement FSQRT

This probably requires different solutions for different float widths. E.g. for f8 we can most likely use a simple LUT solution, while for f32 we may need to use Newton-Raphson or similar.

Turn CPU config into a generic on the core entity

Right now the config (CPU capabilities) are given in a separate file in the design.

It would be better for users of the VHDL code if you could pass the configuration as parameters to the entity instantiation. This would also enable a single system to instantiate several cores with different configs.

A1: Investigate treating short fw branches as predicates

If a branch instruction only advances the PC by 4 bytes when the branch is taken, we essentially have an instruction (the one following the branch) that is predicated. In some situations it would be beneficial to treat that instruction as conditionally executed instead of handling the branch as usual.

If we have a branch misprediction, we could just let the execution flow continue but replace the predicated instruction with a bubble. Or something like that.

Redesign the register files to save BRAM

Currently we use five RAM instances for the register files (35840 effective bits in total):

Three 1024-bit RAM instances for the scalar register file (three read ports).
Two 16384-bit RAM instances for the vector register file (two read ports).

In a Cyclone V FPGA this translates to 70 Kbits BRAM usage in total, as follows:

3 x M10K BRAM blocks for the scalar register file.
4 x M10K BRAM blocks for the vector register file.

That means that we are wasting 50% of the memory bits.

Try out different strategies. E.g. try using MLAB:s / distributed RAM for the scalar register file.

Implement an ICache

The ICache is more important than the DCache for most applications, and it is easier to implement.

Having an ICache will leave the shared memory bus free for the data interface most of the time, letting the instruction fetch stage run uninterrupted even during data operations.

An ICache is also very useful for systems with slow memory (e.g. SDRAM). More so than a DCache since the CPU needs one instruction per clock cycle, while it may not perform data accesses on every clock cycle.

fadd & fsub: Implement RTNE rounding

This is slightly trickier than for fmul & fdiv.

fmul & fdiv: Implement RTNE rounding

Implement a DCache

Do #1 first.

Implement late forwarding for MADD

With late forwarding of the addend, the MADD instruction would work as an MAC with zero latency for consecutive multiply+add operations such as:

    madd  r1, r2, r3
    madd  r1, r4, r5
    madd  r1, r6, r7  ; r1 = r1 + r2 * r3 + r4 * r5 + r6 * r7

We probably only need to worry about forwarding of outputs from the MADD unit.

Optimize the ALU + operand forwarding

There are too many levels of MUX:ing going on, especially around the compare logic.

Don't stall bubbles

It should be relatively easy to "pop bubbles" during a stall (i.e. don't propagate the stall signal to earlier stages if a stage is currently holding a bubble).

Execute PC- and Z-relative unconditional branches in the ID stage rather than in the EX1 stage (?)

B and BL do not need access to any registers, and so should be possible to execute in the ID stage (essentially compute PC + Imm and always branch). The same goes for zero-relative branches (j/jl z, addr).

This would reduce the branch misprediction penalty to 1 cycle for B/BL. C-style for-loops would benefit from this, for instance:

  .loop:
    slt     s9, s20, s21    ; s20 < s21?
    ...
    bns     s9, .loop_done  ; Predicted not taken: +3 cycles on last iteration
    add     s20, s20, 1
    ...
    b       .loop           ; Predicted not taken: +1 cycle on first iteration

.loop_done:
    ...

The potential extra cost is additional muxing in the PC / BTB, as well as more logic for calculating the brancj target in ID.

Caveat: A branch in ID must be considered speculative, since up tp 2 earlier brancjes may be further down the pipeline, waiting to potentially invalidate the branch instruction in the ID stage. One solution is to not let the ID branch update the BTB, but wait until the instruction reaches EX to do the update.

Another problem is that more information may be required in order to determine the correctness of the program flow (i.e. whether or not the EX stage should cancel the following instructions).

Implement late forwarding for memory stores

A memory store does not need the data operand until the 2nd execute pipeline stage. Being able to start the store instruction (to calculate the address) before the data operand is ready can save one clock cycle in certain situations, e.g:

    ldw s1, s3, #0 
    stw s1, s4, #0

Improve the branch predictor

The current branch predictor is a single-state predictor (taken / not taken). Add weakly taken states (i.e. use a two-bit predictor state).

Also, a return-address stack predictor would be useful.

Investigate current BTB implementation:

Is it optimal or does it contain redundant bits?
Can we measure the BTB hit/fail rate?