hughperkins / verigpu Goto Github PK

OpenSource GPU, in Verilog, loosely based on RISC-V ISA

License: MIT License

SystemVerilog 53.97% Python 10.35% Tcl 0.05% Shell 3.93% Dockerfile 0.36% Verilog 2.70% C++ 25.13% CMake 1.27% C 2.23%

verilog risc-v risc-v-assembly hardware-designs asic-design gpu gpu-acceleration machine-learning verification

verigpu's Introduction

OpenSource GPU

Build an opensource GPU, targeting ASIC tape-out, for machine learning ("ML"). Hopefully, can get it to work with the PyTorch deep learning framework.

Vision

Create an opensource GPU for machine learning.

I don't actually intend to tape this out myself, but I intend to do what I can to verify somehow that tape-out would work ok, timings ok, etc.

Intend to implement a HIP API, that is compatible with pytorch machine learning framework. Open to provision of other APIs, such as SYCL or NVIDIA® CUDA™.

Internal GPU Core ISA loosely compliant with RISC-V ISA. Where RISC-V conflicts with designing for a GPU setting, we break with RISC-V.

Intend to keep the cores very focused on ML. For example, brain floating point ("BF16") throughout, to keep core die area low. This should keep the per-core cost low. Similarly, Intend to implement only few float operations critical to ML, such as exp, log, tanh, sqrt.

Architecture

Big Picture:

GPU Die Architecture:

Single Core:

Single-source compilation and runtime

Simulation

Single-source C++

Single-source C++:

examples/cpp_single_source/sum_ints.cpp

Compile the GPU and runtime:

CMakeLists.txt: src/gpu_runtime/CMakeLists.txt
GPU runtime: src/gpu_runtime/gpu_runtime.cpp
GPU controller: src/gpu_controller.sv
Single GPU RISC-V core: src/core.sv

Compile the single-source C++, and run:

examples/cpp_single_source/run.sh sum_ints

Planning

What direction are we thinking of going in? What works already? See:

docs/planning.md

Tech details

Our assembly language implementation and progress. Design of GPU memory, registers, and so on. See:

docs/tech_details.md

Verification

If we want to tape-out, we need solid verification. Read more at:

docs/verification.md

Metrics

we want the GPU to run quickly, and to use minimal die area. Read how we measure timings and area at:

docs/metrics.md

verigpu's People

Contributors

Stargazers

Watchers

Forkers

hossamfadeel shi27feng wxbbuaa2011 mfkiwl troyguo hfyfpga shinewave yangwang92 jurjsorinliviu rainscut hao310rui140326 simrit1 almahmudrony buhuixie kevinpinto98 orb1t-ua tomsunchen999 ninesheep amoskong jcteng alexsjj deepelixir octaplexsys jasonxu nounis saitej25 leochand101 597152515 amanda-barbara dollroot inarikami urela blacklightpy zitonghan amin-mamandi occupymars2025 carpetaa ivan-kits tomhuangsrc honggui pogomoono astroc86 htetwyan xcleancode drawson5570 jeffeehsiung luansh tharal-pius acproject fdh21 xfyecn sachin191vex bguan2022 andreslavescu dong-df svenka3 qkgitcn yashwanthsinghm smg650 xiaoyu1004 yabarji59

verigpu's Issues

Generation of files in "examples/direct/expected"

Hi, I am adding some more direct test ASMs to the project so xxxx_expected.txt should also be added.

I'd like to know, how the files in "examples/direct/expected" are generated? I didn't see any info about this in docs.

Consider doing a tapeout using SKY130 using Google's no cost shuttle program

Have you considered doing a tape out of your VeriGPU using the SKY130 PDK and no-cost shuttle program?

My team launched the fully open source 130nm PDK with SkyWater (http://github.com/google/skywater-pdk) with a no-cost shuttle program (https://efabless.com/open_shuttle_program). You can see more information from my talk at ESSCIRC/ESSDERC last year, see slides @ https://j.mp/esscxxrc21-sky130 and the recording recording @ https://j.mp/esscxxrc21-sky130-video

While 10mm2 might be a bit small for a GPU, it could still be used to test a lot of interesting parts. Things like DFFRAM (https://github.com/Cloud-V/DFFRAM) can be used for optimizing things like the register file.

Does anyone have an understanding into the trade-offs of providing an instruction pointer to each core, or only to each computer unit?

Does anyone have an understanding into the trade-offs of providing an instruction point to each core, or only to each computer unit?

Things that occur to me, in favor of only having a single program counter per compute unit:

one fewer register needed for each core...
probably easier to cache instructions
easier to heuristically consolidate memroy requests
- but a cache would hanlde that anyway: if two threads request data from the same cache line, the first thread to make the request will load that cache line, and the second one can use it too; as long as the threads don't diverge in time too much

On the other hand, in favor of separate program counters in each core:

increases potential parallelization; and eg one thread might warm up the cache block when it hits the load first, and then the second thread to hit the load will have decreases latency
easier to handle ifstatements, and branching: each thread just executes what it needs to execute. no need for all threads to execute an if block that only one thread actually needs, and then throwing away their results

Open gpu v

deviations from RISC-V

There are 31 registers, x1 to x31, along with x0, which is always 0s. Use the same registers for both integers and floats. (this latter point deviates from RISC-V, because we are targeting creating a GPU, where locality is based around each of thousands of tiny cores, rather than around the FP unit vs the integer APU).

RISC-V has the "Zfinx" extension, specifically for this. So if you follow that then you're not deviating.

https://github.com/riscv/riscv-zfinx

What else?

DDR4 Controller

We need a DDR4 Controller, to manage global memory, which sits in DDR chips, separate from the main GPU chip.

The DDR4 Controller will be used to copy data to and from the GPU global memory, in order to populate shared memory, caches and similar on the GPU chip itself; and in order for the GPU to be able to write data back to global memory. For now, we will assume that all communications are with a single GPU Controller module on the GPU die. For example, we will assume for now that any data copied from mainboard main memory to GPU global memory will pass via the GPU controller.

What we need for VeriGPU:

comprehensive verification, ideally including formal verification
clear documentation on how to use and integrate with VeriGPU
- how to use the interface to write to GPU global memory?
- how to use the interface to read from GPU global memory?
ideally, a PR that integrates the DDR4 controller into VeriGPU

Bear in mind that tape-out at 5nm costs $250M or so, so we want things to work first time. Therefore verification is important :)

FV on the Current Design

Hi, I'm trying to add some FVs on the current design, is there any cmodel/golden?

Network on a Chip (NoC) implementation

Need a network on a chip implementation

Firstly, what is network on a chip? See https://amstel.estec.esa.int/tecedm/NoC_workshop/GinosarNOC_Tutorial.pdf Buses are becoming spaghetti, so chips nowadays use an internal packet-switching network instead.

(slide from presentation linked above)

Then the tasks for NoC for VeriGPU are:

what implementations currently exist?
- how is RaveNoC? https://github.com/aignacio/ravenoc
what do similar projects use?
- e.g. what does Vortex use? https://github.com/vortexgpgpu/vortex (it's possible they use the FPGA-native routing/networking structures, since they target specifically FPGAs)
for any existing NoC implementations:
- good points?
- anything missing, or not quite working yet?
- how recently maintained are they? (i.e. are there recent commits?)
- have they already been used in actual taped-out ASICs?
- if it looks good, could we get a full independent verification? (you would create this, in a new repo, your own project :) )
- ideally, including formal verification (again, your own repo, your own project :) )
of course, you could provide your own implementation too, but NoC is complex, so simply verifying someone else's NoC is hard. Bear in mind that 80% of semiconductor development in industry is spent on verification.
what configuration(s) do you recommend for a GPU? (Please let me know if you need more information on what logical architecture I'm envisaging).
Clear documentation on how to integrate the NoC with VeriGPU
Ideally, PR that integrates the NoC with VeriGPU (could be multiple PRs)

CMake issues

Hello, I'm having troubles in building the project, as the documentation is not that clear.

What is happening is that during the exec of the CMake verilate command, the flag "--make" is not found, hence it doesn't create further .cmake files (e.g., gpu_card_copy.cmake).

Is this issue related to the version of Verilator being used? If so, could you please tell which version of Verilator are you using? Thanks.

Implement into VAAMAN!

Hey,

I want to implement this thing in VAAMAN, ( https://vicharak.in/vaaman ),
I can dedicate few 2-3 persons full time to it. But i need some documentation to understand architecture.

Thanks

Why not fully RISC-V compliant?

This way all software can be reused.

And if possible, make this RVV1.0 ready will be really interesting.

PCIe 4+ interface

Need a PCIe 4+ interface, for communications between the main computer's CPUs, and the GPU board.

The PCIe interface will be used to copy data to and from the mainboard memory; to receive kernel launch requests from a mainboard CPU core; and to inform the mainboard CPU core once the kernel has finished running.

What we need for VeriGPU:

comprehensive verification of the PCIe interface
- ideally including formal verification
need clear documentation on how to integrate the PCIe interface with VeriGPU
- how to use the interface to copy data to and from mainboard main memory?
- how to receive data/instructions from a mainboard CPU?
- how to send data/responses to a mainboard CPU?
ideally need a PR to integrate the PCIe 4+ interface with VeriGPU

Introduction to PCIe https://pcisig.com/sites/default/files/files/PCI_Express_Basics_Background.pdf

(from linked presentation)

Bear in mind that taping out at 5nm costs $250M or so, so we want things to work first time. Therefore verification is important :)

Any specific environment?

Hi, I got some error message during make gpu runtime.

error: ‘class llvm::ElementCount’ has no member named ‘getFixedValue’
         int elementCount = vectorType->getElementCount().getFixedValue();

The environment is ubuntu20 with clang 10.