Giter Site home page Giter Site logo

darklife / darkriscv Goto Github PK

View Code? Open in Web Editor NEW
1.9K 86.0 269.0 2.22 MB

opensouce RISC-V cpu core implemented in Verilog from scratch in one night!

License: BSD 3-Clause "New" or "Revised" License

Shell 1.80% Tcl 5.99% TeX 10.30% SystemVerilog 11.17% Verilog 70.73%
riscv processor-design rv32i rv32e verilog risc-v fpga rtl softcore core

darkriscv's Introduction

DarkRISCV

Opensource RISC-V implemented from scratch in one night!

darkriscv

Table of Contents

Introduction

Developed in a magic night of 19 Aug, 2018 between 2am and 8am, the DarkRISCV softcore started as an proof of concept for the opensource RISC-V instruction set.

Although the code is small and crude when compared with other RISC-V implementations, the DarkRISCV has lots of impressive features:

  • implements most of the RISC-V RV32E instruction set
  • implements most of the RISC-V RV32I instruction set (missing csr*, e* and fence*)
  • works up to 250MHz in a ultrascale ku040 (400MHz w/ overclock!)
  • up to 100MHz in a cheap spartan-6, fits in small spartan-3E such as XC3S100E!
  • can sustain 1 clock per instruction most of time (typically 71% of time)
  • flexible harvard architecture (easy to integrate a cache controller, bus bridges, etc)
  • works fine in a real xilinx (spartan-3, spartan-6, spartan-7, artix-7, kintex-7 and kintex ultrascale)
  • works fine with some real altera and lattice FPGAs
  • works fine with gcc 9.0.0 for RISC-V (no patches required!)
  • uses between 850-1500LUTs (core only with LUT6 technology, depending of enabled features and optimizations)
  • optional RV32E support (works better with LUT4 FPGAs)
  • optional 16x16-bit MAC instruction (for digital signal processing)
  • optional coarse-grained multi-threading (MT)
  • no interlock between pipeline stages!
  • BSD license: can be used anywhere with no restrictions!

Some extra features are planned for the future or under development:

  • interrupt controller (under tests)
  • cache controller (under tests)
  • gpio and timer (under tests)
  • sdram controller w/ data scrambler
  • branch predictor (under tests)
  • ethernet controller (GbE)
  • multi-processing (SMP)
  • network on chip (NoC)
  • rv64i support (not so easy as it appears...)
  • dynamic bus sizing and big-endian support
  • user/supervisor modes
  • debug support
  • misaligned memory access
  • bridge for 8/16/32-bit buses

And much other features!

Feel free to make suggestions and good hacking! o/

History

The initial concept was based in my other early 16-bit RISC processors and composed by a simplified two stage pipeline, where a instruction is fetch from a instruction memory in the first clock and then the instruction is decoded/executed in the second clock. The pipeline is overlapped without interlocks, in a way that the DarkRISCV can reach the performance of one clock per instruction most of time, except by a taken branch, where one clock is lost in the pipeline flush. Of course, in order to perform read operations in blockrams in a single clock, a single-phase clock with combinational memory OR a two-phase clock with blockram memory is required, in a way that no wait states are required in thatcases.

As result, the code is very compact, with around three hundred lines of obfuscated but beautiful Verilog code. After lots of exciting sleepless nights of work and the help of lots of colleagues, the DarkRISCV reached a very good quality result, in a way that the code compiled by the standard GCC for RV32I worked fine.

After two years of development, a three stage pipeline working with a single clock phase is also available, resulting in a better distribution between the decode and execute stages. In this case the instruction is fetch in the first clock from a blockram, decoded in the second clock and executed in the third clock.

As long the load instruction cannot load the data from a blockram in a single clock, the external logic inserts one extra clock in IO operations. Also, there are two extra clocks in order to flush the pipeline in the case of taken branches. The impact of the pipeline flush depends of the compiler optimizations, but according to the lastest measurements, the 3-stage pipeline version can reach a instruction per clock (IPC) of 0.7, smaller than the measured IPC of 0.85 in the case of the 2-stage pipeline version.

Anyway, with the 3-stage pipeline and some other expensive optimizations, the DarkRISCV can reach up to 100MHz in a low-cost Spartan-6, which results in more performance when compared with the 2-stage pipeline version (typically 50MHz).

Project Background

The main motivation for the DarkRISCV was create a migration path for some projects around the 680x0/Coldfire family.

Although there are lots of 680x0 cores available, they are designed around different concepts and requirements, in a way that I found no much options regarding my requirements (more than 50MIPS with around 1000LUTs). The best option at this moment, the TG68, requires at least 2400LUTs (by removing the MUL/DIV instructions), and works up to 40MHz in a Spartan-6. As addition, the TG68 core requires at least 2 clock per instruction, which means a peak performance of 20MIPS. As long the 680x0 instruction is too complex, this result is really not bad at all and, at this moment, probably the best opensource option to replace the 68000.

Anyway, it does not match with the my requirements regarding space and performance. As part of the investigation, I tested other cores, but I found no much options as good as the TG68 and I even started design a risclized-68000 core, in order to try find a solution.

Unfortunately, due to compiler requirements (standard GCC), I found no much ways to reduce the space and increase the performance, in a way that I started investigate about non-680x0 cores.

After lots of tests with different cores, I found the picorv32 core and the all the ecosystem around the RISC-V. The picorv32 is a very nice project and can peak up to 150MHz in a low-cost Spartan-6. Although most instructions requires 3 or 4 clocks per instruction, the picorv32 resembles the 68020 in some ways, but running at 150MHz and providing a peak performance of 50MIPS, which is very impressive.

Although the picorv32 is a very good option to directly replace the 680x0 family, it is not powerful enough to replace some Coldfire processors (more than 75MIPS).

As long I had some good experience with experimental 16-bit RISC cores for DSP-like applications, I started code the DarkRISCV only to check the level of complexity and compare with my risclized-68000. For my surprise, in the first night I mapped almost all instructions of the rv32i specification and the DarkRISCV started to execute the first instructions correctly at 75MHz and with one clock per instruction, which not only resembles a fast and nice 68040, but also can beat some Coldfires! wow! :)

After the success of the first nigth of work, I started to work in order to fix small details in the hardware and software implementation.

Directory Description

Although the DarkRISCV is only a small processor core, a small eco-system is required in order to test the core, including RISCV compatible software, support for simulations and support for peripherals, in a way that the processor core produces observable results. Each element is stored with similar elements in directories, in a way that the top level has the following organization:

  • README.md: the top level README file (points to this document)
  • LICENSE: unlimited freedom! o/
  • Makefile: the show start here!
  • src: the source code for the test firmware (boot.c, main.c etc in C language)
  • rtl: the source code for the DarkRISCV core and the support logic (Verilog)
  • sim: the source code for the simulation to test the rtl files (currently via icarus)
  • boards: support and examples for different boards (currently via Xilinx ISE)
  • tmp: empty, but the ISE will create lots of files here)

Setup Instructions:

Step 1: Clone the DarkRISC repo to your local using below code. git clone https://github.com/darklife/darkriscv.git

Pre Setup Guide for MacOS:

The document encompasses all the dependencies and steps to install those dependencies to successfully utilize the Darriscv ecosystem on MacOS.

Essentially, the ecosystem cannot be utilized in MacOS because of on of the dependencies Xilinx ISE 14.7 Design suit, which currently do not support MacOS.

In order to overcome this issue, we need to install Linux/Windows on MacOS by using below two methods:

a) WineSkin, which is a kind of Windows emulator that runs the Windows application natively but intercepts and emulate the Windows calls to map directly in the macOS.

b) VirtualBox (or VMware, Parallels, etc) in order to run a complete Windows OS or Linux, which appears to be far better than the WineSkin option.

I used the second method and installed VMware Fusion to install Linux Mint. Please find below the links I used to obtain download files.

Dependencies:

  1. Icarus Verilog a. Bison b. GNU c. G++ d. FLEX

  2. Xilinx 14.7 ISE

Icarus Verilog Setup:

The steps have been condensed for linux operating system. Complete steps for all other OS platforms are available on https://iverilog.fandom.com/wiki/Installation_Guide.

Step 1: Download Verilog download tar file from ftp://ftp.icarus.com/pub/eda/verilog/ . Always install the latest version. Verilog-10.3 is the latest version as of now.

Step 2: Extract the tar file using ‘% tar -zxvf verilog-version.tar.gz’.

Step 3: Go to the Verilog folder using ‘cd Verilog-version’. Here it is cd Verilog-10.3.

Step 4: Check if you have the following libraries installed: Flex, Bison, g++ and gcc. If not use ‘sudo apt-get install flex bison g++ gcc’ in terminal to install. Restart the system once for effects to change place.

Step 5: Run the below commands in directory Verilog-10.3

  1. ./configure
  2. Make
  3. Sudo make install

Step 6: Use ‘sudo apt-get install verilog’ to install Verilog.

Optional Step: sudo apt-get install gtkwave

Xilinx Setup:

Follow the below video on youtube for complete installation.

https://www.youtube.com/watch?v=meO-b6Ib17Y

Note: Make sure you have libncurses libraries installed in linux.

If not use the below codes:

  1. For 64 bit architechure a. Sudo apt-get install libncurses5 libncursesw-dev
  2. For 32 bit architecture a. Sudo apt-get install libncurses5:i386

Once all pre-requisites are installed, go to root directory and run the below code:

cd darkrisc make (use sudo if required)

The top level Makefile is responsible to build everything, but it must be edited first, in a way that the user at least must select the compiler path and the target board.

By default, the top level Makefile uses:

CROSS = riscv32-embedded-elf
CCPATH = /usr/local/share/gcc-$(CROSS)/bin/
ICARUS = /usr/local/bin/iverilog
BOARD  = avnet_microboard_lx9

Just update the configuration according to your system configuration, type make and hope everything is in the correct location! You probably will need fix some paths and set some others in the PATH environment variable, but it will eventually work.

And, when everything is correctly configured, the result will be something like this:

# make
make -C src all             CROSS=riscv32-embedded-elf CCPATH=/usr/local/share/gcc-riscv32-embedded-elf/bin/ ARCH=rv32e HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S boot.c -o boot.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c boot.s -o boot.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:20 -0300\"" -DARCH="\"rv32e\"" -S stdio.c -o stdio.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c stdio.s -o stdio.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S main.c -o main.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c main.s -o main.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S io.c -o io.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c io.s -o io.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-gcc -Wall -I./include -Os -march=rv32e -mabi=ilp32e -D__RISCV__ -DBUILD="\"Sat, 30 May 2020 00:55:21 -0300\"" -DARCH="\"rv32e\"" -S banner.c -o banner.s
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c banner.s -o banner.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-cpp -P  -DHARVARD=1 darksocv.ld.src darksocv.ld
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld -Tdarksocv.ld -Map=darksocv.map -m elf32lriscv  boot.o stdio.o main.o io.o banner.o -o darksocv.o
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-ld: warning: section `.data' type changed to PROGBITS
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objdump -d darksocv.o > darksocv.lst
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.text --only-section .text* 
hexdump -ve '1/4 "%08x\n"' darksocv.text > darksocv.rom.mem
#xxd -p -c 4 -g 4 darksocv.o > darksocv.rom.mem
rm darksocv.text
wc -l darksocv.rom.mem
1016 darksocv.rom.mem
echo rom ok.
rom ok.
/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-objcopy -O binary  darksocv.o darksocv.data --only-section .*data*
hexdump -ve '1/4 "%08x\n"' darksocv.data > darksocv.ram.mem
#xxd -p -c 4 -g 4 darksocv.o > darksocv.ram.mem
rm darksocv.data
wc -l darksocv.ram.mem
317 darksocv.ram.mem
echo ram ok.
ram ok.
echo sources ok.
sources ok.
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/src'
make -C sim all             ICARUS=/usr/local/bin/iverilog HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
/usr/local/bin/iverilog -I ../rtl -o darksocv darksimv.v ../rtl/darksocv.v ../rtl/darkuart.v ../rtl/darkriscv.v
./darksocv
WARNING: ../rtl/darksocv.v:280: $readmemh(../src/darksocv.rom.mem): Not enough words in the file for the requested range [0:1023].
WARNING: ../rtl/darksocv.v:281: $readmemh(../src/darksocv.ram.mem): Not enough words in the file for the requested range [0:1023].
VCD info: dumpfile darksocv.vcd opened for output.
reset (startup)

              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv  
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv    
rr                vvvvvvvvvvvvvvvvvvvvvv      
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr

       INSTRUCTION SETS WANT TO BE FREE

boot0: text@0 data@4096 stack@8192
board: simulation only (id=0)
build: darkriscv fw build Sat, 30 May 2020 00:55:21 -0300
core0: [email protected] with rv32e+MT+MAC
uart0: 115200 bps (div=868)
timr0: periodic timer=1000000Hz (io.timer=99)

Welcome to DarkRISCV!
> no UART input, finishing simulation...
echo simulation ok.
simulation ok.
make[1]: Leaving directory `/home/marcelo/Documents/Verilog/darkriscv/v38/sim'
make -C boards all          BOARD=piswords_rs485_lx9 HARVARD=1
make[1]: Entering directory `/home/marcelo/Documents/Verilog/darkriscv/v38/boards'
cd ../tmp && xst -intstyle ise -ifn ../boards/piswords_rs485_lx9/darksocv.xst -ofn ../tmp/darksocv.syr
Reading design: ../boards/piswords_rs485_lx9/darksocv.prj

*** lots of weird FPGA related messages here *** 

cd ../tmp && bitgen -intstyle ise -f ../boards/avnet_microboard_lx9/darksocv.ut ../tmp/darksocv.ncd
echo done.
done.

Which means that the software compiled and liked correctly, the simulation worked correctly and the FPGA build produced a image that can be loaded in your FPGA board with a make install (case you has a FPGA board and, of course, you have a JTAG support script in the board directory).

Case the FPGA is correctly programmed and the UART is attached to a terminal emulator, the FPGA will be configured with the DarkRISCV, which will run the test software and produce the following result:

              vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
                  vvvvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrrrr    vvvvvvvvvvvvvvvvvvvvvvvv
rrrrrrrrrrrrrrrr      vvvvvvvvvvvvvvvvvvvvvv  
rrrrrrrrrrrrr       vvvvvvvvvvvvvvvvvvvvvv    
rr                vvvvvvvvvvvvvvvvvvvvvv      
rr            vvvvvvvvvvvvvvvvvvvvvvvv      rr
rrrr      vvvvvvvvvvvvvvvvvvvvvvvvvv      rrrr
rrrrrr      vvvvvvvvvvvvvvvvvvvvvv      rrrrrr
rrrrrrrr      vvvvvvvvvvvvvvvvvv      rrrrrrrr
rrrrrrrrrr      vvvvvvvvvvvvvv      rrrrrrrrrr
rrrrrrrrrrrr      vvvvvvvvvv      rrrrrrrrrrrr
rrrrrrrrrrrrrr      vvvvvv      rrrrrrrrrrrrrr
rrrrrrrrrrrrrrrr      vv      rrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrr          rrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrr      rrrrrrrrrrrrrrrrrrrr
rrrrrrrrrrrrrrrrrrrrrr  rrrrrrrrrrrrrrrrrrrrrr

       INSTRUCTION SETS WANT TO BE FREE

boot0: text@0 data@4096 stack@8192
board: piswords rs485 lx9 (id=6)
build: darkriscv fw build Fri, 29 May 2020 23:56:39 -0300
core0: [email protected] with rv32e+MT+MAC
uart0: 115200 bps (div=868)
timr0: periodic timer=1000000Hz (io.timer=99)

Welcome to DarkRISCV!
> 

The beautiful ASCII RISCV logo was produced by Andrew Waterman! [6]

As long as the build works, it is possible start make changes, but my recommendation when working with soft processors is not work in the hardware and software at the same time! This means that is better freeze the hardware and work only with the software or freeze the software and work only with the hardware. It is perfectly possible make your research in both, but not at the same time, otherwise you find the DarkRISCV in a non-working state after software and hardware changes and you will not be sure where the problem is.

"src" Directory

The src directory contains the source code for the test firmware, which includes the boot code, the main process and auxiliary libraries. The code is compiled via gcc in a way that some auxiliary files are produced, for example:

  • boot.c: the original C code for the boot process
  • boot.s: the assembler version of the C code, generated automatically by the gcc
  • boot.o: the compiled version of the C code, generated automatically by the gcc

When all .o files are produced, the result is linked in a darksocv.o ELF file, which is used to produce the darksocv.bin file, which is converted to hexadecimal and separated in ROM and RAM files (which are loaded by the Verilog code in the blockRAMs). The linker also produces a darksocv.lst with a complete list of the code generated and the darsocv.map, which shows the map of all functions and variables in the produced code.

The firmware concept is very simple:

  • boot.c contains the boot code
  • main.c contains the main application code (shell)
  • banner.c contains the riscv banner
  • stdio.c contains a small version of stdio
  • io.c contains the IO interfaces

Extra code can be easily added in the compilation by editing the src/Makefile.

For example, in order to add a lempel-ziv code lz.c, it is necessary make the Makefile knows that we need the lz.s and lz.o:

OBJS = boot.o stdio.o main.o io.o banner.o lz.o
ASMS = boot.s stdio.s main.s io.s banner.s lz.s
SRCS = boot.c stdio.c main.c io.c banner.c lz.c

And add a "lz" command in the main.c, in a way that is possible call the function via the prompt. Alternatively, it is possible entirely replace the provided firmware and use your own firmware.

"sim" Directory

The simulation, in the other hand will show some waveforms and is possible check the DarkRISCV operation when running the example code.

The main simulation tool for DarkRISCV is the iSIM from Xilinx ISE 14.7, but the Icarus simulator is also supported via the Makefile in the sim directory (the changes regarding Icarus are active when the symbol ICARUS is detected). I also included a workaround for ModelSim, as pointed by our friend HYF (the changes regarding ModelSim are active when the symbol MODEL_TECH is detected).

The simulation runs the same firmware as in the real FPGA, but in order to improve the simulation performance, the UART code is not simulated, since the 115200 bps requires lots dead simulation time.

"rtl" Directory

The RTL directory contains the DarkRISCV core and some auxiliary files, such as the DarkSoCV (a small system-on-chip with ROM, RAM and IO), the DarkUART (a small UART for debug) and the configuration file, where is possible enable and disable some features that are described in the Implementation Notes section.

"board" Directory

The current supported boards are:

  • id==0 simulation only
  • id==1 avnet_microboard_lx9
  • id==2 xilinx_ac701_a200
  • id==3 qmtech_sdram _lx16
  • id==4 qmtech_spartan7_s15
  • id==5 lattice_brevia2_lxp2
  • id==6 piswords_rs485_lx9
  • id==7 digilent_spartan3_s200
  • id==8 aliexpress_hpc40gbe k420
  • id==9 qmtech_artix7_a35
  • id==10 aliexpress_hpc40gbe_ku040
  • id==11 papilio_duo_logicstart
  • id==12 qmtech_kintex7_k325
  • id==13 scarab_minispartan6-plus_lx9
  • id==14 colorlighti9_ecp5-45f
  • id==15 colorlighti5_ecp5-25f
  • id==16 ulx3s_ecp5-85f

The organization is self-explained, w/ the vender, board and FPGA model in the name of the directory. Each board directory contains the project files to be open in the Xilinx ISE 14.x, as well Makefiles to build the FPGA image regarding that board model. Although a ucf file is provided in order to generate a complete build with a UART and some LEDs, the FPGA is NOT fully wired in any particular configuration and you must add the pins that you will use in your FPGA board.

Anyway, although not wired, the build always gives you a good estimation about the FPGA utilization and about the timing (because the UART output ensures that the complete processor must be synthesized).

As long there are much supported boards, there is no way to test all boards everytime, which means that sometimes the changes regarding one board may affect other board in a wrong way.

Implementation Notes*

[*This section is kept for reference, but the description may not match exactly with the current code]

Since my target is the ultra-low-cost Xilinx Spartan-6 family of FPGAs, the project is currently based in the Xilinx ISE 14.7 for Linux, which is the latest available ISE version. However, there is no explicit reference for Xilinx elements and all logic is inferred directly from Verilog, which means that the project is easily portable to other FPGA families and easily portable to other environments, as can be observed in the case of Lattice XP2 support. Anyway, keep in mind that certain Verilog structures may not work well in some FPGAs.

In the last update I included a way to test the firmware in the x86 host, which helps as lot, since is possible interact with the firmware and fix quickly some obvious bugs. Of course, the x86 code does not run the boot.c code, since makes no sense (?) run the RISCV boot code in the x86.

Anyway, as main recomendation when working with softcores try never work in the hardware and in the software at the same time! Start with the minimum software configuration possible and freeze the software. When implementing new software updates, use the minium hardware configuration possible and freeze the hardware.

The RV32I specification itself is really impressive and easy to implement (see [1], page 16). Of course, there are some drawbacks, such as the funny little-endian bus (opposed to the network oriented big-endian bus found in the 680x0 family), but after some empirical tests it is easy to make work.

The funny information here is that, after lots of research regarding add support for big-endian in the DarkRISCV, I found no way to make the GCC generate the code and data correctly.

Another drawback in the specification is the lacking of delayed branches. Although i understand that they are bad from the conceptual point of view, they are good trick in order to extract more performance. As reference, the lack of delayed branches or branch predictor in the DarkRISCV may reduce between 20 and 30% the performance, in a way that the real measured performance may be between 1.25 and 1.66 clocks per instruction.

Although the branch prediction is not complex to implement, I found the experimental multi-threading support far more interesting, as long enable use the idle time in the branches to swap the processor thread. Anyway, I will try debug the branch prediction code in order to improve the single-thread performance.

The core supports 2 or 3-state pipelines and, although the main logic is almost the same, there are huge difference in how they works. Just for reference, the following section reflects the historic evolution of the core and may not reflect the current core code.

The original 2-stage pipeline design has a small problem concerning the ROM and RAM timing, in a way that, in order to pre-fetch and execute the instruction in two clocks and keep the pre-fetch continously working at the rate of 1 instruction per clock (and the same in the execution), the ROM and RAM must respond before the next clock. This means that the memories must be combinational or, at least, use a 2-phase clock.

The first solution for the 2-stage pipeline version with a 2-phase clock is the default solution and makes the DarkRISCV work as a pseudo 4-stage pipeline:

  • 1/2 stage for instruction pre-fetch (rom)
  • 1/2 stage for static instruction decode (core)
  • 1/2 stage for address generation, register read and data read/write (ram)
  • 1/2 stage for data write (register write)

From the processor point of view, there are only 2 stages and from the memory point of view, there are also 2 stages. But they are in different clock phases. In normal conditions, this is not recommended because decreases the performance by a 2x factor, but in the case of DarkRISCV the performance is always limited by the combinational logic regarding the instruction execution.

The second solution with a 2-stage pipeline is use combinational logic in order to provide the needed results before the next clock edge, in a way that is possible use a single phase clock. This solution is composed by a instruction and data caches, in a way that when the operand is stored in a small LUT-based combinational cache, the processor can perform the memory operation with no extra wait states. However, when the operand is not stored in the cache, extra wait-states are inserted in order to fetch the operand from a blockram or extenal memory. According to some preliminary tests, the instruction cache w/ 64 direct mapped instructions can reach a hit ratio of 91%. The data cache performance, although is not so good (with a hit ratio of only 68%), will be a requirement in order to access external memory and reduce the impact of slow SDRAMs and FLASHes.

Unfortunately, the instruction and data caches are not working anymore for the 2-stage pipeline version and only the instruction cache is working in the 3-stage pipeline. The problem is probably regarding the HLT signal and/or a problem regarding the write byte enable in the cache memory.

Both the use of the cache and a 2-phase clock does not perform well. By this way, a 3-stage pipeline version is provided, in order to use a single clock phase with blockrams.

The concept in this case is separate the pre-fetch and decode, in a way that the pre-fetch can be done entirely in the blockram side for the instruction bus. The decode, in a different stage, provides extra performance and the execute stage works with one clock almost all the time, except when the load instruction is executed. In this case, the external memory logic inserts one wait-state. The write operation, however, is executed in a single clock.

The solution with wait-states can be used in the 2-stage pipeline version, but decreases the performance too much. Case is possible run all versions with the same, clock, the theorical performance in clocks per instruction CPI), number of clocks to flush the pipeline in the taken branch (FLUSH) and memory wait-states (WSMEM) will be:

  • 2-stage pipe w/ 2-phase clock: CPI=1, FLUSH=1, WSMEM=0: real CPI=~1.25
  • 3-stage pipe w/ 1-phase clock: CPI=1, FLUSH=2, WSMEM=1: real CPI=˜1.66
  • 2-stage pipe w/ 1-phase clock: CPI=2, FLUSH=1, WSMEM=1, real CPI=~2.00

Empiracally, the impact of the FLUSH in the 2-stage pipeline is around 20% and in the 3-stage pipeline is 30%. The real impact depends of the code itself, of course... In the case of the impact of the wait-states in the memory access regarding the load instruction, the impact ranges between 5 and 10%, again, depending of the code.

However, the clock in the case of the 3-stage pipeline is far better than the 2-stage pipeline, in special because the better distribuition of the logic between the decode and execute stages.

Currently, the most expensive path in the Spartan-6 is the address bus for the data side of the core (connected to RAM and peripherals). The problem regards to the fact that the following actions must be done in a single clock:

  • generate the DADDR[31:0] = REG[SPTR][31:0]+EXTSIG(IMM[11:0])
  • generate the BE[3:0] according to the operand size and DADDR[1:0]

In the case of read operation, the DATAI path includes also a small mux in order to separate RAM and peripheral buses, as well separate the diferent peripherals, which means that the path increases as long the number of peripherals and the complexity increases.

Of course, the best performance setup uses a 3-state pipeline and a single-clock phase (posedge) in the entire logic, in a way that the 2-stage pipeline and dual-clock phase will be kept only for reference.

The only disadvantage of the 3-state pipeline is one extra wait-state in the load operation and the longer pipeline flush of two clocks in the taken branches.

Just for reference, I registered some details regarding the performance measurements:

The current firmware example runs in the 3-stage pipeline version clocked at 100MHz runs at a verified performance of 62 MIPS. The theorical 100MIPS performance is not reached 5% due to the extra wait-state in the load instruction and 32% due to pipeline flushes after taken branches. The 2-stage pipeline version, in the other side, runs at a verified performance of 79MIPS with the same clock. The only loss regards to 20% due to pipeline flushes after a taken branch.

Of course, the impact of the pipeline flush depends also from the software and, as long the software is currently optimized for size. When compiled with the -O2 instead of -Os, the performance increase to 68MIPS in the 3-state pipeline and the loss changed to 6% for load and 25% for the pipeline flush. The -O3 option resulted in 67MIPS and the best result was the -O1 option, which produced 70MIPS in the 3-stage version and 85MIPS in the 2-stage version.

By this way, case the performance is a requirement, the src/Makefile must be changed in order to use the -O1 optimization instead of the -Os default.

And although the 2-stage version is 15% faster than the 3-stage version, the 3-stage version can reach better clocks and, by this way, will provide better performance.

Regarding the pipeline flush, it is required after a taken branch, as long the RISCV does not supports delayed branches. The solution for this problem is implement a branch cache (branch predictor), in a way that the core populates a cache with the last branches and can predict the future branches. In some inicial tests, the branch prediction with a 4 elements entry appers to reach a hit ratio of 60%.

Another possibility is use the flush time to other tasks, for example handle interrupts. As long the interrupt handling and, in a general way, threading requires flush the current pipelines in order to change context, by this way, match the interrupt/threading with the pipeline flush makes some sense!

With the option THREADING is possible test this feature.

The implementation is in very early stages of development and does not handle correctly the initial SP and PC. Anyway, it works and enables the main() code stop in a gets() while the interrupt handling changes the GPIO at a rate of more than 1 million interrupts per second without affecting the execution and with little impact in the performance! :)

The interrupt support can be expanded to a more complete threading support, but requires some tricks in the hardware and in the software, in order to populate the different threads with the correct SP and PC.

The interrupt handling use a concept around threading and, with some extra effort, it is probably possible support 4, 8 or event 16 threads. The drawback in this case is that the register bank increses in size, which explain why the rv32e is an interesting option for threading: with half the number of registers is possible store two more threads in the core.

Currently, the time to switch the context in the darkricv is two clocks in the 3-stage pipeline, which match with the pipeline flush itself. At 100MHz, the maximum empirical number of context switches per second is around 2.94 million.

NOTE: the interrupt controller is currently working only with the -Os flag in the gcc!

About the new MAC instruction, it is implemented in a very preliminary way with the OPCDE 7'b1111111 (this works, but it is a very bad decision!). I am checking about the possibility to use the p.mac instruction, but at this time the instruction is hand encoded in the mac() function available in the stdio.c (i.e. the darkriscv libc). The details about include new instructions and make it work with GCC can be found in the reference [5].

The preliminary tests pointed, as expected, that the performance decreases to 90MHz and although it was possible run at 100MHz with a non-zero timing score and reach a peak performance of 100MMAC/s, the small 32-bit accumulator saturates too fast and requries extra tricks in order to avoid overflows.

The mul operation uses two 16-bit integers and the result is added with a separate 32-bit register, which works as accumulator. As long the operation is always signed and the signal always use the MSB bit, this means that the 15x15 mul produces a 30 bit result which is added to a 31-bit value, which means that the overflow is reached after only two MAC operations.

In order to avoid overflows, it is possible shift the input operands. For example, in the case of G711 w/ u-law encoding, the effective resolution is 14 bits (13 bits for integer and 1 bit for signal), which means that a 13x13 bit mul will be used and a 26-bit result produced to be added in a 31-bit integer, enough to run 32xMAC operations before overflow (in this case, when the ACC reach a negative value):

# awk 'BEGIN { ACC=2**31-1; A=2**13-1; B=-A; for(i=0;ACC>=0;i++) print i,A,B,A*B,ACC+=A*B }'
0 8191 -8191 -67092481 2080391166
1 8191 -8191 -67092481 2013298685
2 8191 -8191 -67092481 1946206204
...
30 8191 -8191 -67092481 67616736
31 8191 -8191 -67092481 524255
32 8191 -8191 -67092481 -66568226

Is this theory correct? I am not sure, but looks good! :)

As complement, I included in the stdio.c the support for the GCC functions regarding the native *, / and % (mul, div and mod) operations with 32-bit signed and unsigned integers, which means true 32x32 bit operations producing 32-bit results. The code was derived from an old 68000-related project (as most of code in the stdio.c) and, although is not so faster, I guess it is working. As long the MAC instruction is better defined in the syntax and features, I think is possible optimize the mul/div/mod in order to try use it and increase the performance.

Here some additional performance results (synthesis only, 3-stage version) for other Xilinx devices available in the ISE for speed grade 2:

  • Spartan-6: 100MHz (measured 70MIPS w/ gcc -O1)
  • Artix-7: 178MHz
  • Kintex-7: 225MHz

For speed grade 3:

  • Spartan-6: 117MHz
  • Artix-7: 202MHz
  • Kintex-7: 266MHz

The Kintex-7 can reach, theorically 186MIPS w/ gcc -O1.

This performance is reached w/o the MAC and THREADING activated. Thanks to the RV32E option, the synthesis for the Spartan-3E is now possible with resulting in 95% of LUT occupation in the case of the low-cost 100E model and 70MHz clock (synthesis only and speed grade 5):

  • Spartan-3E: 70MHz

For the 2-stage version and speed grade 2, we have less impact from the pipeline flush (20%), no impact in the load and some impact in the clock due to the use of a 2-phase clock:

  • Spartan-6: 56MHz (measured 47MIPS w/ -O1)

About the compiler performance, from boot until the prompt, tested w/ the 3-stage pipeline core at 100MHz and no interrupts, rom and ram measured in 32-bit words:

  • gcc w/ -O3: t=289us rom=876 ram=211
  • gcc w/ -O2: t=291us rom=799 ram=211
  • gcc w/ -O1: t=324us rom=660 ram=211
  • gcc w/ -O0: t=569us rom=886 ram=211
  • gcc w/ -Os: t=398us rom=555 ram=211

Due to reduced ROM space in the FPGA, the -Os is the default option.

In another hand, regarding the support for Vivado, it is possible convert the Artix-7 (Xilinx AC701 available in the ise/boards directory) project to Vivado and make some interesting tests. The only problem in the conversion is that the UCF file is not converted, which means that a new XDC file with the pin description must be created.

The Vivado is very slow compared to ISE and needs lots of time to synthesise and inform a minimal feedback about the performance... but after some weeks waiting, and lots of empirical calculations, I get some numbers for speed grade 2 devices:

  • Artix7: 147MHz
  • Spartan-7: 146MHz

And one number for speed grade 3 devices:

  • Kintex-7: 221MHz

Although Vivado is far slow and shows pessimistic numbers for the same FPGAs when compared with ISE, I guess Vivado is more realistic and, at least, it supports the new Spartan-7, which shows very good numbers (almost the same as the Artix-7!).

That values are only for reference. The real values depends of some options in the core, such as the number of pipeline stages, who the memories are connected, etc. Basically, the best clock is reached by the 3-stage pipeline version (up to 100MHz in a Spartan-6), but it requires at lease 1 wait state in the load instruction and 2 extra clocks in the taken branches in order to flush the pipeline. The 2-state pipeline requires no extra wait states and only 1 extra clock in the taken branches, but runs with less performance (56MHz).

Well, my conclusion after some years of research is that the branch prediction solve lots of problems regarding the performance.

Development Tools

About the gcc compiler, I am working with the experimental gcc 9.0.0 for RISC-V. No patches or updates are required for the DarkRISCV other than the -march=rv32i. Although the fence*, e* and crg* instructions are not implemented, the gcc appears to not use of that instructions and they are not available in the core.

Although is possible use the compiler set available in the oficial RISC-V site, our colleagues from lowRISC project pointed a more clever way to build the toolchain:

https://www.lowrisc.org/blog/2017/09/building-upstream-risc-v-gccbinutilsnewlib-the-quick-and-dirty-way/

Basically:

git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
mkdir combined
cd combined
ln -s ../newlib-cygwin/* .
ln -sf ../binutils-gdb/* .
ln -sf ../gcc/* .
mkdir build
cd build	
../configure --target=riscv32-unknown-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib --with-arch=rv32ima --with-abi=ilp32 --prefix=/usr/local/share/gcc-riscv32-unknown-elf
make -j4
make
make install
export PATH=$PATH:/usr/local/share/gcc-riscv32-unknown-elf/bin/
riscv32-unknown-elf-gcc -v

and everything will magically work! (:

Case you have no succcess to build the compiler, have no interest to change the firmware or is just curious about the darkriscv running in a FPGA, the project includes the compiled ROM and RAM, in a way that is possible examine all derived objects, sources and correlated files generated by the compiler without need compile anything.

Finally, as long the DarkRISCV is not yet fully tested, sometimes is a very good idea compare the code execution with another stable reference!

In this case, I am working with the project picorv32:

https://github.com/cliffordwolf/picorv32

When I have some time, I will try create a more well organized support in order to easily test both the DarkRISCV and picorv32 in the same cache, memory and IO sub-systems, in order to make possible select the core according to the desired features, for example, use the DarkRISCV for more performance or picorv32 for more features.

About the software, the most complex issue is make the memory design match with the linker layout. Of course, it is a gcc issue and it is not even a problem, in fact, is the way that the software guys works when linking the code and data!

In the most simplified version, directly connected to blockRAMs, the DarkRISCV is a pure harvard architecture processor and will requires the separation between the instruction and data blocks!

When the cache controller is activated, the cache controller provides separate memories for instruction and data, but provides a interface for a more conventional von neumann memory architecture.

In both cases, a proper designed linker script (darksocv.ld) probably solves the problem!

The current memory map in the linker script is the follow:

  • 0x00000000: 4KB ROM
  • 0x00001000: 4KB RAM

Also, the linker maps the IO in the following positions:

  • 0x80000000: UART status
  • 0x80000004: UART xmit/recv buffer
  • 0x80000008: LED buffer

The RAM memory contains the .data area, the .bss area (after the .data and initialized with zero), the .rodada and the stack area at the end of RAM.

Although the RISCV is defined as little-endian, appears to be easy change the configuration in the GCC. In this case, it is supposed that the all variables are stored in the big-endian format. Of course, the change requires a similar change in the core itself, which is not so complex, as long it affects only the load and store instructions. In the future, I will try test a big-endian version of GCC and darkriscv, in order to evaluate possible performance enhancements in the case of network oriented applications! :)

Finally, the last update regarding the software included new option to build a x86 version in order to help the development by testing exactly the same firmware in the x86.

In a preliminary way, it is possible build the gcc for RV32E with the folllowing configuration:

git clone --depth=1 git://gcc.gnu.org/git/gcc.git gcc
git clone --depth=1 git://sourceware.org/git/binutils-gdb.git
git clone --depth=1 git://sourceware.org/git/newlib-cygwin.git
mkdir combined
cd combined
ln -s ../newlib-cygwin/* .
ln -sf ../binutils-gdb/* .
ln -sf ../gcc/* .
mkdir build
cd build
../configure --target=riscv32-embedded-elf --enable-languages=c --disable-shared --disable-threads --disable-multilib --disable-gdb --disable-libssp --with-newlib  --with-arch-rv32e --with-abi=ilp32e --prefix=/usr/local/share/gcc-riscv32-embedded-elf
make -j4
make
make install
export PATH=$PATH:/usr/local/share/gcc-riscv32-embedded-elf/bin/
riscv32-embedded-elf-gcc -v

Currently, I found no easy way to make the GCC build big-endian code for RISCV. Instead, the easy way is make the endian switch directly in the IO device or in the memory region.

As long is not so easy build the GCC in some machines, I left in a public share the source and the pre-compiled binary set of GCC tools for RV32E:

https://drive.google.com/drive/folders/1GYkqDg5JBVeocUIG2ljguNUNX0TZ-ic6?usp=sharing

As far as i remember it was compiled in a Slackware Linux or something like, anyway, it worked fine in the Windows 10 w/ WSL and in other linux-like environments.

Development Boards

Currently, the following boards are supported:

  • Avnet Microboard LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
  • XilinX AC701 A200: equipped with a Xilinx Artix-7 A200 running at 90MHz
  • QMTech SDRAM LX16: equipped with a Xilinx Spartan-6 LX16 running at 100MHz
  • QMTech NORAM S15: equipped with a Xilinx Spartan-7 S15 running at 100MHz
  • Lattice Brevia2 XP2: equipped with a Lattice XP2-6 running at 50MHz
  • Piswords RS485 LX9: equipped with a Xilinx Spartan-6 LX9 running at 100MHz
  • Digilent S3 Starter Board: equipped with a Xilinx Spartan-3 S200 running at 50MHz

The speeds are related to available clocks in the boards and different clocks may be generated by programming a clock generator. The Spartan-6 is found in most boards and the core runs fine at ~100MHz, regardless the frequency of the main oscillator (typically 50MHz).

All Xilinx based boards typically supports a 115200 bps UART for console, some LEDs for debug and on-chip 4KB ROM and 4KB RAM (as well the RESET button to restart the core and the DEBUG signals for an oscilloscope).

In the case of QMTECH boards, that does not include the JTAG neither the UART/USB port, and external USB/UART converter and a low-cost JTAG adapter can solve the problem easily!

The Lattice Brevia is clocked by the on-board 50MHz oscillator, with the UART operating at 115200bps and the LED and DEBUG ports wired to the on- board LEDs.

Although the Digilent Spartan-3 Starter Board, this is a very useful board to work as reference for LUT4 technology, in a way that is possible improve the support in the future for alternative low-cost LUT4 FPGAs.

In the software side, a small shell is available with some basic commands:

  • clear: clear display
  • dump : dumps an area of the RAM
  • led : change the LED register (which turns on/off the LEDs)
  • timer : change the timer prescaler, which affects the interrupt rate
  • gpio : change the GPIO register (which changes the DEBUG lines)

The proposal of the shell is provide some basic test features which can provide a go/non-go status about the current hardware status.

Useful memory areas:

  • 4096: the start of RAM (data)
  • 4608: the start of RAM (data)
  • 5120: empty area
  • 5632: empty area
  • 6144: empty area
  • 6656: empty area
  • 7168: empty area
  • 7680: the end of RAM (stack)

As long the DarkRISCV uses separate instruction and data buses, it is not possible dump the ROM area. However, this limitation is not present when the option HARVARD is activated, as long the core is constructed in a way that the ROM bus is conected to one bus from a dual-ported memory and the RAM bus is connected to a different bus from the same dual-ported memory. From the DarkRISCV point of view, they are fully separated and independent buses, but in reality they area in the same memory area, which makes possible the data bus change the area where the code is stored. With this feature, it will be possible in the future create loadable codes from the FLASH memory! :)

FuseSoC support

Last xmas (2022) our colleague Lucas Teske added support for FuseSoC in the darkriscv... I am not much aware how it works and how use it, but it is supposed to handle the build tools automatically! It is the same tool used by SERV and I used it to add some kilocore records in the past... in order to make it work in the darkriscv, please try:

  • fusesoc run --target=qmtech_artix7_a35 darklife:darkriscv:darksocv

At this moment, not all boards are really supported yet. Supported boards are:

  • Colorlight i9
  • Colorlight i5
  • Lattice iCE40 Devkit
  • QMtech Artix 7 (Vivado)

Creating a RISCV from scratch

I found that some people are very reticent about the possibility of designing a RISC-V processor in one night. Of course, it is not so easy as it appears and, in fact, it require a lot of experience, planning and luck. Also, the fact that the processor correctly run some few instructions and put some garbage in the serial port does not really means that the design is perfect, instead you will need lots and lots of debug time in order to fix all hidden problems.

Just in case, I found a set of online videos from my friend (Lucas Teske) that shows the design of a RISC-V processor from scratch (playlist with 9 videos):

Alternatively, there are the original videos in the twitch:

Unfortunately the video set is currently in portuguese only and there a lot of parallel discussions about technology, including the fix of the Teske's notebook online! I hope in the future will be possible edit the video set and, maybe, create english subtitles.

About the processor itself, it is a microcode oriented concept with a classic von neumann archirecture, designed to support more easily different ISAs. It is really very different than the traditional RISC cores that we found around! Also, it includes a very good eco-system around opensource tools, such as Icarus, Yosys and gtkWave!

Although not finished yet (95% done!), I think it is very illustrative about the RISC-V design:

  • rv32e instruction set: very reduced (37) and very ortogonal bit patterns (6)
  • rv32e register set: 16x32-bit register bank and a 32-bit program counter
  • rv32e ALU with basic operations for reg/imm and reg/reg instructions
  • rv32e instruction decode: very simple to understand, very direct to implement
  • rv32e software support: the GCC support provides an easy way to generate code and test it!

The Teske's proposal is not design the faster RISC-V core ever (we already have lots of faster cores with CPI ~ 1, such as the darkriscv, vexriscv, etc), but create a clean, reliable and compreensive RISC-V core.

You can check the code in the following repository:

Academic Papers and Applications

In a funny way, the DarkRISCV appears in some academic papers, sometimes in a comparative way, sometimes as a laboratory mouse.

Regarding real world applications, standard embbeded C code typically runs very well with DarkRISCV. Some examples of applications currently in use:

  • microcontroller programmers (JTAG and other complex protocols)
  • data compression/decompression (LZ streams)
  • cryptography (RSA and SHA256, requires hardware accelerators)
  • digital signal processing (requires mac instruction)

In the case of RSA, the simple inclusion of a pipelined 32x32-bit multiplier mapped via IO (i.e. not the M extension, instead an IO mapped register accessible via load/store instructions) increased the RSA performance by a factor of 20x when compared to the bare RV32E instruction set. In the case of SHA256, a complete SHA256 accelerator in hardware and mapped via IO can transfor seconds of processing in few miliseconds. In both cases, no need for complex instruction integration in the core, no need to wait -- the core and the accelerator can work in parallel.

Performance Comparisons

I tried prepare a fair performance comparison between the DarkRISCV and different FPGAs, but it is not so easy as it appears! The first problem is locate HDL versions of each core, since the tools require verilog or VHDL files in order to build something. The second problem is decide what compile: there are lots of combinations of top level blocks, with different peripherals and concepts. In this case, I included only the core. The third problem regards to the core configuration: it is not easy or clear how to configure them to the minimum area or maximum speed, so I just used the default configuration for all cores. In short words, I solved the problem in a dummy way (which reflects the reality in 95% of time).

As long I have separate builds for each core, there is the final problem: how analyse the results and rank the cores. My option was calculate how many MIPS is possible in a fixed FPGA, in this case the Kintex-7 K420 with 260600 LUTs of 6-inputs. Different cores will require different amount of LUTs, just divide the total number by the required ammount and we have the theorical number of cores per FPGA (peak cores/FPGA). Also, different cores have different theorical peak IPC. guessed numbers and, according to the synthesis tool in the default setup, will run at different maximum frequencies. As long we know the number of instruction per clock and the maximum number of clocks per second (the maximum frequency), we have the maximum number of instructions per second (peak MIPS) per core. As long we also knows about the maximum number of cores, we can calculate the maximum peak MIPS per FPGA.

The following list is far from complete, but it is my suggestion to compare different cores:

Core		LUT	FF	DSP	BRAM	IPC	MFreq	Ncores	MIPS/k7	REPO
DarkRISCV	1177	246	0	0	1	150	221	33232	https://github.com/darklife/darkriscv
VexRISCV	1993	1345	4	5	1	214	130	28013	https://github.com/m-labs/VexRiscv-verilog
PicoRV32	1291	568	0	1	1/4	309	201	15630	https://github.com/cliffordwolf/picorv32
SERV		217	174	0	0	1/32	367	1200	13788	https://github.com/olofk/serv
RPU		2943	1103	12	1	1	111	88	9905	https://github.com/Domipheus/RPU
UE RISC-V	3676	2289	4	0	1	124	70	8811	https://github.com/ultraembedded/riscv
UE BiRISC-V	15667	6324	4	0	2	87	16	2917	https://github.com/ultraembedded/biriscv

Are that results fair enough? from my point of view, as long the conditions are the same, yes. Are that results true enough? from my point of view, maybe. At least two cases does not match: the DarkRISCV reaches up to 240MHz in the same FPGA when the top level is in the build. Not sure why the core only result in smaller maximum frequency, but the point is try to find a way to compare different cores, so it is okay. The second case is the SERV, which shows a maximum frequency of 367MHz and up to 1200 cores/FPGA. In fact, I tested up to 1000 SERV cores in that FPGA and appears to be possivel fit between 1100 and 1200 cores as result of optimizations across the multi-core hierarchy. The bad news is that the maximum clock barely reached 128MHz per core. Again, the point is try to find a way to compare differnt cores, so it is okay.

For sure that numbers will be very useful for future designers and the message is clear: keep it simple! Less area per core means more cores in the same area and, in some cases, means better performance. In most cases, less are also means better clock performance, but in this case the better clock is useful only when the IPC is around 1, which is not so easy to keep. Just as example, the DarkRISCV can trully reach the IPC ~ 1, but the code must be optimized by hand and the top level must be changed in a way that there is no latency regarding the BRAM. In a more general way, with less effort, the compiler and a more standard top level can keep the IPC around 0.7, which is good enough for most applications. So, keep your expectives very, very low! :)

Acknowledgments

Special thanks to my old colleagues from the Verilog/VHDL/IT area:

  • Paulo Matias (jedi master and verilog/bluespec/riscv guru)
  • Paulo Bernard (co-worker and verilog guru)
  • Evandro Hauenstein (co-worker and git guru)
  • Lucas Mendes (technology guru)
  • Marcelo Toledo (technology guru)
  • Fabiano Silos (technology guru)

Also, special thanks to the "friends of darkriscv" that found the project in the internet and contributed in any way to make it better:

  • Guilherme Barile (technology guru and first guy to post anything about the darkriscv! [2]).
  • Alasdair Allan (technology guru, posted an article about the darkriscv [3])
  • Gareth Halfacree (technology guru, posted an article about the DarkRISCV [4])
  • Ivan Vasilev (ported DarkRISCV for Lattice Brevia XP2!)
  • timdudu from github (fix in the LDATA and found a bug in the BCC instruction)
  • hyf6661669 from github (lots of contributions, including the fixes regarding the AUIPC and S{B,W,L} instructions, ModelSIM simulation, the memory byte select used by store/load instructions and much more!)
  • zmeiresearch from github (support for Lattice XP2 Brevia board)
  • All other colleagues from github that contributed with fixes, corrections and suggestions.

Finally, thanks to all people who directly and indirectly contributed to this project, including the company I work for and all colleagues that tested the DarkRISCV.

References

[1] https://www.amazon.com/RISC-V-Reader-Open-Architecture-Atlas/dp/099924910X
[2] https://news.ycombinator.com/item?id=17852876
[3] https://blog.hackster.io/the-rise-of-the-dark-risc-v-ddb49764f392
[4] https://abopen.com/news/darkriscv-an-overnight-bsd-licensed-risc-v-implementation/
[5] http://quasilyte.dev/blog/post/riscv32-custom-instruction-and-its-simulation/
[6] https://github.com/riscv/riscv-pk/blob/master/bbl/riscv_logo.txt

darkriscv's People

Contributors

erjanmx avatar gsilos avatar jeshu54 avatar kingsumos avatar llturtle22ll avatar phuclv90 avatar piso77 avatar racerxdl avatar robertoalcantara avatar samsoniuk avatar splinedrive avatar vgegok avatar zmeiresearch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

darkriscv's Issues

[BUG] boot.s printf data@ is "_heap+4"

Yes, it's me again!

When I used HARVARD, I get 'boot0: text@0 data@10224 stack@16384 (6160 bytes free)'.

But , actually,data@10228:

000027f0 <heapcheck>:
    27f0:	deadbeef          	jal	t4,fffdddda <io+0x7ffdddda>

0x27f0+4=10228

So I revised it as follows:

_normal_boot:

	la	sp,_stack
	la	gp,_global

	call 	banner

	la	a3,_stack
	la	a2,_heap
+	addi	a2,a2,4
	sub	a4,a3,a2
	la	a1,_boot
	la	a0,_boot0msg
	call	printf

	call	main

	j	_normal_boot

But I think text@0 and data@10228 It's also unreasonable:
The text@, You mean the size of text? If so, it should not be 0!
The data@, You mean the size of data? If so, It should be the end address of data minus the start address, Like 10228-0x2000(addr of RAM(0))

I know, This may be complicated in von Neumann structure. I think it should be like this:

HARVARD:
`'boot0: text@textsize data@datasize stack@_stack (ROM  <LENGTH(ROM)-textsize> bytes free, RAM  <LENGTH(RAM)-daatasize> bytes free)'`
Neumann:
`'boot0: text@textsize data@datasize stack@_stack (MEM <stack-textsize-datasize> bytes free)'`

In addition, if the RAM storage is almost full, stack may cause the program to run incorrectly.

Forgive my attention to detail. : )

I think there is a mistake in the RTL code

In darkriscv.v, the following code is used to choose the data that will be written into the register file:

 REG[DPTR] <=   RES ? RESET_SP  :        // reset sp
                       HLT ? REG[DPTR] :        // halt
                     !DPTR ? 0 :                // x0 = 0, always!
                     AUIPC ? NXPC+SIMM :
                      JAL||
                      JALR ? NXPC :
                       LUI ? SIMM :
                       LCC ? LDATA :
                       MCC ? MDATA : 
                       RCC ? RDATA : 
                       CCC ? CDATA : 
                             REG[DPTR];

For instruction "AUIPC", the data should be PC + SIMM. But I only tested it by using the core as a 2-stage CPU without cache. I hope you can have a look.

Something about riscv-tests from https://github.com/riscv

I'm not a native English speaker so I am sorry for my poor expression :).

I have tested darkriscv by running test codes from https://github.com/riscv/riscv-tests/tree/master/isa/rv32ui. I think there are 2 bugs in current RTL code:

  1. The "AUIPC" instruction.
    As mentioned in my issue #8 , the data that will be written in regfile when we are excuting "AUIPC" should be "PC + SIMM" instead of "NXPC + SIMM".
REG[DPTR] <=   RES ? RESET_SP  :        // reset sp
                       HLT ? REG[DPTR] :        // halt
                     !DPTR ? 0 :                // x0 = 0, always!
                     AUIPC ? PC+SIMM :   //original data is NXPC + SIMM
                      JAL||
                      JALR ? NXPC :
                       LUI ? SIMM :
                       LCC ? LDATA :
                       MCC ? MDATA : 
                       RCC ? RDATA : 
                       CCC ? CDATA : 
                             REG[DPTR];
  1. The "STORE (SW, SH, SB)" instruction.
    I think we must use a signal DATA_MASK to describe which byte the "STORE" instruction wants to write into the RAM. Otherwise unexpected data will be written. So I have written my extral code in "darkriscv.v" like this:
output [3:0] DATA_MASK,

                                   //sb
assign DATA_MASK = FCT3 == 0 ? (DADDR[1:0] == 3 ? 4'b1000 :
							DADDR[1:0] == 2 ? 4'b0100 :
							DADDR[1:0] == 1 ? 4'b0010 :
                                                                                        4'b0001
							) : 
				//sh
				   FCT3 == 1 ? (DADDR[1] == 1 ? 4'b1100 :
                                                                                      4'b0011
							) : 
				//sw
				4'b1111;

And in "darksoc.v":

        always @(posedge CLK)
	begin
		if(WR & !DADDR[31])
		begin
			if(DATA_MASK[0])
				RAM[DADDR[12:2]][0 * 8 + 7: 0 * 8] <= DATAO[0 * 8 + 7: 0 * 8];
			if(DATA_MASK[1])
				RAM[DADDR[12:2]][1 * 8 + 7: 1 * 8] <= DATAO[1 * 8 + 7: 1 * 8];
			if(DATA_MASK[2])
				RAM[DADDR[12:2]][2 * 8 + 7: 2 * 8] <= DATAO[2 * 8 + 7: 2 * 8];
			if(DATA_MASK[3])
				RAM[DADDR[12:2]][3 * 8 + 7: 3 * 8] <= DATAO[3 * 8 + 7: 3 * 8];
		end
	end

I have used Xilinx ISE 14.7 for synthesis and downloaded the bitstream into a Xilinx spartan6 LX16 for test. I used ram and rom IP like this:

wire [3:0] ram_wea = {WR, WR, WR, WR} & DATA_MASK;
ram_ip ram_ip_u0 (
  .clka(system_clk), // input clka
  .wea(ram_wea), // input [3 : 0] wea
  .addra(ram_addr), // input [10 : 0] addra
  .dina(DATAO), // input [31 : 0] dina
  .douta(DATAI) // output [31 : 0] douta 
);


rom_ip rom_ip_u0 (
  .clka(system_clk), // input clka
  .addra(rom_addr), // input [9 : 0] addra
  .douta(IDATA) // output [31 : 0] douta
);

Luckily, after I have fixed the 2 bugs above, all tests in https://github.com/riscv/riscv-tests/tree/master/isa/rv32ui were passed, except for "fence_i.S". The clock frequency I used for test was 75MHz.

I also have to point out that I only made the test by setting darkriscv as a 2-stage pipeline cpu without cache, which means that the RAM/ROM and the core use opposite clock edge. Now I am just a green hand in digital IC design so I don't know whether it is pratical for a cpu to operate like that.

I'm sorry that I know little about cache, so I couldn't test it as a 3-stage pipeline cpu with I$/D$ :(.

Best
HYF

Why set two general-purpose register groups

Why are two general-purpose register groups designed in darkriscv.v,

Not like one of the riscv specifications,

Is it to support multi-core?

reg [31:0] REG1 [0:31];	// general-purpose 32x32-bit registers (s1)
reg [31:0] REG2 [0:31];	// general-purpose 32x32-bit registers (s2)

I tried to use a general-purpose register set and it still works.

Problem in src/Makefile

In src/Makefile, you should set ASFLAGS = -march=rv32i. For example, if I add some codes (the related operations is not included in rv32i) like this in main.c:
image

Then the riscv-gcc will try to use the standard library function, without ASFLAGS = -march=rv32i, some ERRORs occurs:
image

With ASFLAGS = -march=rv32i, the compilation can succeed, and we can see the desired asm codes in darksocv.lst like this:
image

START HERE!

Comments, suggestions and discussions can be posted here:

https://github.com/darklife/darkriscv/discussions

For effective changes in the code (support for new boards, applications, etc), please open an issue first to discuss the topic and, case pointed by the issue discussion, please submit a pull request!

Works on DE0 Nano Altera Cyclone IV

I could not find any reference to somebody trying this on the Altera brand of FPGA. (I did not want to buy another kind of FGPA and download yet another gigantic development environment.) The pre-built demo program works fine at 50Mhz. In the config.vh, I turned off THREADING - just to keep things simple. The only change I made was to install the PLL clock generator code and mess with the RES (reset) to connect it to a button on the DE0 Nano board.

BTW I first did this by stripping out most of the 'ifdef' so that I could follow the code with more ease. This was interesting since I made errors which resulted in much fun with ModelSim. Learned much more than if it had just worked.

Thanks for this cool project. Sorry to place this in the ISSUES but I could not figure out another good place for a comment.

rename darkrisc to norisk

New projects should not be using dark in its name. Using semantically more meaningful terms is better than terms that conjure up images of humans forced to work into dark places.

Suggested alternates:

norisk
riskless

Instruction for compiling on Arty board

Hi Samsoniuk,

Sorry this is not an issue but couldn't figure another way to ask.

I was wondering if you could help with instructions on how to build this project for Arty board from digilent. I was able to pull in the Verilog and implement the design in Vivado but needed help with running the Hello World program.

Thanks,
Balaji

stdio.c bugs

bug to fix

putx: change ranges to <=
strtok: set nxt to null when done
xtoi: (val&~0x20)-‘a’+10 ???

replace all defines by parameters

Typical synthesis tools assume the defines as global, so a multi-core environment with non-symmetric configurations does not work... also, the defines may conflict with legacy defines already in use. By this way, it is highly recommened replace all defines by parameters.

Expected data bus latency?

I'm testing this core in simulation using code generated from riscv32-unknown-elf-gcc with -march=rv32i with no optimizations. All my instruction and data bus accesses should have a latency of 1 clock cycle. I've commented the assembly code for what I expect to happen, and then highlighted the area where I think it's failing, as well as put a marker at the simulation readout:

Fail

So I am expecting the opcode at 0x8f8 to read from address 0xab030000, which should return a value of 0x00000000. I can see the read being performed on the data bus at address 0xab030000, and then 1 clock cycle later the read data on the bus is 0x00000000 as expected, however, at that point, the incorrect value is loaded into a4. It looks like it loaded a4 with the value of read data from the exact same clock cycle that 0xab030000 appeared on the address lines, which means maybe it's expecting instantaneous return data?

Is this a bug, or does this core only support zero latency reads?

"srai" instruction doesn't work on ISE/Spartan 6

for some reason the ">>>" operator

    FCT7[5] ? U1REG>>>U2REGX[4:0]) : U1REG>>U2REGX[4:0];

is synthesized into logic shift right by ISE 14.7. replacing U1REG with S1REG didn't help. i was able to work around this by

wire signed [31:0] RMDATA = 
     ...
    FCT7[5] ? $signed(S1REG>>>U2REGX[4:0])) : U1REG>>U2REGX[4:0];

mixed little/big-endian

Although the RISCV by default is a little-endian processor, the DarkRISCV core is designed almost all as a big-endian processor, which obviously does not work well in some conditions and requires reverse the byte order from the binaries generated by the gcc in order to load and execute the code. As result, the DarkRISCV runs the big and little endian codes without changes, which obviously is not correct.

The reversal of the binaries was done initially by the hexdump, but it was replaced by a proper option in the objcopy, in a way that the xxd tool generates the hex code that matches with the binary. The fix for this issue requires add the proper configuration in a way that byte reversal is not needed in the objcopy anymore and make the DarkRISCV supports the big and little endian binaries not at the same time.

Additional problems: although the core appears to be working fine, there is a small issue regarding the integers converted from the UART input: they are stored as little-endian numbers. Also, appears that the mac() function was affected by the endian changes (the instruction works, but the parameter order in the stack appears to be reversed).

Issue in LDATA

In LDATA expression, when byte load(FCT3==0||FCT3==4), high bits should be ALL1[31:8]:ALL0[31:8], now it's ALL1[31:24]:ALL0[31:24].

Suggestion for higher frequency

In #12 , you said that on Artix-7 200T, the darkriscv can barely reach the performance of 90Mhz. I want to give some suggestion:

  1. I have found that you use opposite clk edge for RAM/ROM and cpu, in the 2-stage configuration. That will slow down the speed because the "Load" instruction must get data from RAM and write it into the registers. I suggest that implement a 3-stage configuration without I/D cache. As much as I know, many low-end ARM cores, such as cortex M0, M1, M3, M4, don't have caches. They only have some embedded SRAM and FLASH. And they are designed for simple application scenarios, which don't need much room for user codes and data. Therefore cache might be uncessary and dozens of KByte of SRAM is enough.

image

  1. Use the latest Vivado for synthesis and implementation instead of ISE 14.7. I find that the result of Vivado is better than ISE for Xilinx 7 series FPGA.

Enabling __INTERACTIVE__ option when running make file

Hello,

I am using darkriscv in a project for school and am in the process of writing software testbenches. I would like to be able to include scanf functions in the main.c file, but it does not seem to be working. Is there a setting somewhere I could change to enable the interactive option/allow scanf function to be used?

Thank you

Compiling error

Hello @samsoniuk,

First of all thank you for this amazing project. I am new to this field and currently pursuing my masters. I want to develop my skills in VLSI and someday make a career out of it.

I am facing some issues while compiling the environment. Please see below the error.

Screen Shot 2020-07-09 at 3 28 47 PM

Error: Command not found.

As mentioned in the readme, I made some changes to the makefile to choose only the default paths and boards.

My run did not generate the same results as in the screenshot in the readme.

Screen Shot 2020-07-09 at 3 32 53 PM

I am unable to find the reason of this error. Your help is much appreciated.

Best,
Avro Mukherjee

Running CoreMark benchmark in DarkRISCV!

I am checking some suggestions and updates for DarkRISCV from @vgegok:

vgegok@ca04bff

There are some points open, but the most interesting update is the support for CoreMark application! It appears to work and the result of 0.925 CoreMark/MHz is comparable to other RISCV cores pointed here:

https://www.esa.informatik.tu-darmstadt.de/assets/publications/materials/2019/heinz_reconfig19.pdf

Well, but there are some fixes in order to make it work:

a) it appears that the boot.s is missing in your version:

$ make
make -C src all
make -C darksocv all
make[2]: *** No rule to make target `../common/boot.s', needed by `../common/boot.o'.  Stop.
make[1]: *** [all] Error 2
make: *** [all] Error 2
$ find . -name boot.s
$ 

I got it from my version, but I made some changes (see below).

b) there is no darksocv.lds:

/usr/local/share/gcc-riscv32-embedded-elf/bin//riscv32-embedded-elf-as -march=rv32e -c ../common/boot.s -o ../common/boot.o
make[2]: *** No rule to make target `darksocv.lds', needed by `darksocv.ld'.  Stop.
make[1]: *** [all] Error 2
make: *** [all] Error 2

It is probably because the .gitignore... I get it from the coremark.lds.

c) The lds is asking about a include/io.o, but it is better use the following syntax:

.io : { *io.o(COMMON) } > IO

So it will need no paths inside de lds...

d) I think the new src/ directory structure is good, but I think use "darklibc" instead of "common" sounds better! Maybe I can separate the stdio.c in different files, so is possible include only is needed for specific applications.

e) I moved the banner and printf from boot.s to main.c, in order that to remove the putchar/printf dependency in the boot.s.

f) Regarding the coremark app, I found that it is too large:

WARNING: ../rtl/darksocv.v:250: $readmemh(../src/darksocv.mem): Too many words in the file for the requested range [0:4095].
But you mentioned it in the README.md, so when the memory size was adjusted, I get the correct results... in the case of the kintex-7 FPGA, I got the folowing results:

CoreMark start in 92932555 us
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks     : 18537518
Total time (scs): 18
Iterations/Sec   : 222
Iterations       : 4000
Compiler version : GCC9.0.0 20180818 (experimental)
Compiler flags   : -O2 -DPERFORMANCE_RUN=1
Memory locatin  : STACK
seedcrc          : 0xe9f5
[0]crclist       : 0x%x
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x65c5
Correct operation validated. See README.md for run and reporting rules.
CoreMark finish in 111516081 us

I am not sure how read it... it is othe interactions/second per MHz? in this case 0.925 coremarks/MHZ? I tested with rv32i too, but it shows almost the same speed. Of course, the performance is pretty deterministic regardless the clock.

f) I am not much aware about the numbers for coremark, so I guess the dhrystone is much more comprehensive, as it shows a magic number that divided by another magic number shows the performance in VAXMIPS... maybe I will try port it as an application in the new scheme!

@vgegok: I will make some additional changes and update to your repo... as long we close the open points, we can merge it to the main version! :)

branch issue

Hi,
In darkriscv.v, there are some issues in branch condition BMUX.

// J/B-group of instructions (OPCODE==7'b1100011)
wire BMUX = BCC==1 && (
FCT3==4 ? S1REG>=S2REG : // signed
FCT3==5 ? S1REG<=S2REG : // signed
FCT3==6 ? U1REG>=U2REG : // unsigned
FCT3==7 ? U1REG<=U2REG : // unsigned
FCT3==0 ? U1REG==U2REG :
FCT3==1 ? U1REG!=U2REG :
0);

FCT3==4 is BLT instruction. BLT takes the branch if rs1 is less than rs2. So it's should be FCT3==4 ? S1REG < S2REG : // signed
The same question when FCT3==5,6,7.
FCT3==5 ? S1REG >S2REG : // signed BGE
FCT3==6 ? U1REG<U2REG : // unsigned BLTU
FCT3==7 ? U1REG>U2REG : // unsigned BGEU

bus bridge to AXI

Although the DarkRISCV native data bus is very simple, flexible and faster (although not faster and flexible at the same time), the standard bus for FPGAs appears to be converging to the AXI bus... so a bus bridge between the DarkRISCV bus and the AXI bus may increase the possibilities to interconnect the DarkRISCV with the available FPGA IPs, such as DDR controllers, 10GbE ethernet interfaces, etc.

big-endian support working on simulation only

The big-endian support works only in the simulation. The problem is probably related to the gcc : even in the simulation the -Os optimization appears to not match with the equivalent optimization in little-endian version of gcc.

Problem with sample firmware

Dear author(s),

I tried to do a fresh compilation by removing all the .o and .s files from the src directory but no other changes. The compilation is successful but the simulation does not run (blank output, no banner etc). The error seems to be with one particular file io.s.

Everything works if I use the io.s supplied in the repository but nothing woks if I compile io.s from io.c locally. I have done a meld compare for the io.s from the repo and the one that I have compiled. The error seems to be with the location and type of the "io" symbol but I cant figure out exactly whats wrong.

My aim is to run the riscv-tests on the core keeping the existing firmware worflow (using darkuart.v for console output in simulation).

Using gcc version 10.2.0 (GCC)

Please let me know if you can share some insight on what could be done.

Thanks,
Monideep

Need help simulating with iverilog and gtkwave

I'm trying to use iverilog and gtkwave to see how this RISC-V core works. First, I modify the darksim.v to add the messing parameter of darksocv, and modify darksocv.v to dump core0 to a vcd file.

[irudog@mylaptop darkriscv]$ git diff
diff --git a/rtl/darksocv.v b/rtl/darksocv.v
index 88b5c83..310274e 100644
--- a/rtl/darksocv.v
+++ b/rtl/darksocv.v
@@ -32,7 +32,7 @@
 
 // pseudo-soc for testing purposes
 
-`define CACHE_CONTROLLER 1
+`define CACHE_CONTROLLER 0
 //`define STAGE3           1
 
 module darksocv
@@ -271,4 +271,9 @@ module darksocv
         .DEBUG(DEBUG)
     );
 
+       initial begin
+               $dumpfile("darkriscv.vcd");
+               $dumpvars(0, core0);
+       end
+
 endmodule
diff --git a/sim/darksimv.v b/sim/darksimv.v
index 5247e45..4fe8eb2 100644
--- a/sim/darksimv.v
+++ b/sim/darksimv.v
@@ -42,6 +42,9 @@ module darksimv;
 
     always@(posedge CLK) RES <= RES ? RES-1 : 0;
 
-    darksocv darksocv(CLK,|RES);
+       wire [31:0] dummyin = 32'b0;
+       wire [3:0] dummyout;
+
+    darksocv darksocv(CLK,|RES, dummyin, dummyin, dummyin[0], dummyin[0], dummyout);
 
 endmodule

But after I watch the wave in gtkwave, I saw a problem. When PC advances as expected, the instruction input IDATA is always xxxxxxx, I don't know why this happens.

Can you give a good instruction on how to simulate the darkriscv core using iverilog and gtkwave?

3-stateg pipeline without ICache/DCache

Hi! Thanks for your amazing work!
It seems that darkriscv can't run as a 3-stage pipeline cpu without ICache/DCache. Do you plan to add this configuration?

Expanding the DarkRISCV Family!

Hello Colleagues!

I think It is the moment to expand the DarkRISCV family in order to keep it simple, clear and at same time introduce new features. The proposal is split the core and the upper layers in a way that the complexity of new features does not affect the existing features, resulting in three lines of development:

  • the 2-stage pipeline core w/ MAC and on-chip DPRAM, to provide ~40MIPS@50MHz
  • the 3-stage pipeline core w/ MT, MAC and on-chip DPRAM support, to provide ~70MIPS@100MHz
  • the new configurable and modular pipeline core w/ MT, MAC designed in a more modular way, with support for on-chip DPRAM, on-chip caches, external SDRAM/DDR and other new features.

Basically, the first two cores will focus the stability and simplicity, while the third core will be used to introduce and test new features, as well optimize them in order that the updates can be released in the first two cores or new intermediary cores to be introduced. In some sense, the first two cores will work as a kind of "design freeze", in a way that there will be some guarantee that new features will not create incompatibilities in the already deployed cores.

Finally, In order to keep the chronology, the proposal is call the cores as:

  • DarkRISCV V1 or DarkRISCV "Classic" in the case of 2-stage pipeline core
  • DarkRISCV V2 or DarkRISCV "Advanced" in the case of 3-stage pipeline core
  • DarkRISCV V3 or DarkRISCV "Research" in the case of the new modular core

The names "Advanced" and "Research" are really, really bad choices and I probably need think in something better... anyway, feel free to comment and make suggestions.

Best regards,
Marcelo

RV64 possibilities?

Hi,

What is your estimate of supporting 64 bit RISC-V. I'm still pretty new to this, Could it run 100mhz on an Artix board? Would it even fit?

Development tool install error

Hi,
I'm having trouble development tool installing.
I followed the install script order but the following error message appears.

--------------- Error Message --------------------------------------------
/tmp/CC455VaP.s: Assembler messages:
/tmp/CC455VaP.s:437: Error: non-constant .uleb128 is not supported
/tmp/CC455VaP.s:438: Error: non-constant .uleb128 is not supported
Makefile:500: recipe for target 'all -target '_negdi2.o' failed
........

How can I solve the problem?
Thank you.

from Chris Lee

New board: Alinx AX309

PXL_20220614_003347449.jpg

I've got this FPGA board that I wish to support. It's got multiple buttons, LEDs, a camera port, and a vga port. There's also a small 7 segment display that I think could be used by firmware for error codes. There's even an SD card slot which could be used to load an OS. I think this would be a great board to support with all the connectivity it comes with. I would port DarkRISC-V except I've never written verilog and I just got a programmer.

Adding a In-System Programmability(ISP) function?

I advise adding a ISP function to darkriscv. At present, if I write some new C codes and generate new data for ram and rom, I have to use ISE or Vivado to run synthesis and implementation again, which takes lots of time .

If PC can send these data to BlockRAMs for ram and rom via USB/JTAG/UART interface, it would be very nice. So the .mcs file can only store information about darkriscv itself, which means less time for running ISE or Vivado. We can also use some buttons to control this progress.

Be confusing about interrupt

Hi Marcelo,

I have read these new codes which are commited recently.
I am confusing about NXPC2[2] thing, which index 0 and 1 represent different threads if I am right.
And in C codes, 'int tmp = 1&threads++;' is used to distuguish different threads.
My question is why to implement these two PC values?
In my opinion, if an interrupt happens, the PC register should be configured to a special value, in RISCV it is mtvec, and PC should be reverted back the old value if the interrupt returns, right?
It is hardware who decides whether to go into interrupts, and it is OS who decides whether to change the current thread to another one and save some current context.
So why not to implement mtvec and something related to that?
At least, for your new codes, I couldn't understand your purpose, and there is even no save context operations for thread switching.
So please tell me your purpose and thoughts about this if you are free, thanks!
And if you have any email or slack, please let me know if you are OK with it.

BR,
Peng

libncursesw-dev does not exist

when I run this command sudo apt-get install libncurses5 libncursesw-dev, the error message displayed is Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package libncursesw-dev
Due to this my xlininx installation also fails. Kindly help me out setting this up.

Some suggestion

  1. I find that you use riscv32-unknown-elf-objcopy and hexdump to convert the ELF file to HEX file for simulation/FPGA BRAM initialization . I think you could have a look at https://github.com/sifive/elf2hex. This tool is more convenient if you change the regions of ROM/RAM or even make them not consecutive.

  2. The opensource RTL simulation tool Verilator is said to be very fast and efficient. Many opensource risc-v cores also use it for simulation, such as rocket-chip and scr1. It's very nice if you can upload a verilator simulation script.

Finally, darkriscv is becoming more and more powerful, thanks for your amazing work!

WIll you be willing to add dcumentation on memories ?

Hi,

I am trying to follow your your code to figure out what memories are being used and how and who is reading/ writing.
There are too many `ifdefs and I am not sure if I am getting it correctly.
Is it possible for you to add some documentation on this ?

Thanks and best regards
Bhawandeep SIngh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.