Giter Site home page Giter Site logo

hughperkins / verigpu Goto Github PK

View Code? Open in Web Editor NEW
567.0 23.0 65.0 6.91 MB

OpenSource GPU, in Verilog, loosely based on RISC-V ISA

License: MIT License

SystemVerilog 53.97% Python 10.35% Tcl 0.05% Shell 3.93% Dockerfile 0.36% Verilog 2.70% C++ 25.13% CMake 1.27% C 2.23%
verilog risc-v risc-v-assembly hardware-designs asic-design gpu gpu-acceleration machine-learning verification

verigpu's Introduction

OpenSource GPU

Build an opensource GPU, targeting ASIC tape-out, for machine learning ("ML"). Hopefully, can get it to work with the PyTorch deep learning framework.

Vision

Create an opensource GPU for machine learning.

I don't actually intend to tape this out myself, but I intend to do what I can to verify somehow that tape-out would work ok, timings ok, etc.

Intend to implement a HIP API, that is compatible with pytorch machine learning framework. Open to provision of other APIs, such as SYCL or NVIDIA® CUDA™.

Internal GPU Core ISA loosely compliant with RISC-V ISA. Where RISC-V conflicts with designing for a GPU setting, we break with RISC-V.

Intend to keep the cores very focused on ML. For example, brain floating point ("BF16") throughout, to keep core die area low. This should keep the per-core cost low. Similarly, Intend to implement only few float operations critical to ML, such as exp, log, tanh, sqrt.

Architecture

Big Picture:

Big Picture

GPU Die Architecture:

GPU Die Architecture

Single Core:

Single Core

Single-source compilation and runtime

End-to-end Architecture

Simulation

Single-source C++

Single-source C++:

Single-source C++

Compile the GPU and runtime:

Compile GPU and runtime

Compile the single-source C++, and run:

Run single-source example

Planning

What direction are we thinking of going in? What works already? See:

Tech details

Our assembly language implementation and progress. Design of GPU memory, registers, and so on. See:

Verification

If we want to tape-out, we need solid verification. Read more at:

Metrics

we want the GPU to run quickly, and to use minimal die area. Read how we measure timings and area at:

verigpu's People

Contributors

hughperkins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

verigpu's Issues

Generation of files in "examples/direct/expected"

Hi, I am adding some more direct test ASMs to the project so xxxx_expected.txt should also be added.

I'd like to know, how the files in "examples/direct/expected" are generated? I didn't see any info about this in docs.

Consider doing a tapeout using SKY130 using Google's no cost shuttle program

Have you considered doing a tape out of your VeriGPU using the SKY130 PDK and no-cost shuttle program?

My team launched the fully open source 130nm PDK with SkyWater (http://github.com/google/skywater-pdk) with a no-cost shuttle program (https://efabless.com/open_shuttle_program). You can see more information from my talk at ESSCIRC/ESSDERC last year, see slides @ https://j.mp/esscxxrc21-sky130 and the recording recording @ https://j.mp/esscxxrc21-sky130-video

While 10mm2 might be a bit small for a GPU, it could still be used to test a lot of interesting parts. Things like DFFRAM (https://github.com/Cloud-V/DFFRAM) can be used for optimizing things like the register file.

Does anyone have an understanding into the trade-offs of providing an instruction pointer to each core, or only to each computer unit?

Does anyone have an understanding into the trade-offs of providing an instruction point to each core, or only to each computer unit?

Things that occur to me, in favor of only having a single program counter per compute unit:

  • one fewer register needed for each core...
  • probably easier to cache instructions
  • easier to heuristically consolidate memroy requests
    • but a cache would hanlde that anyway: if two threads request data from the same cache line, the first thread to make the request will load that cache line, and the second one can use it too; as long as the threads don't diverge in time too much

On the other hand, in favor of separate program counters in each core:

  • increases potential parallelization; and eg one thread might warm up the cache block when it hits the load first, and then the second thread to hit the load will have decreases latency
  • easier to handle ifstatements, and branching: each thread just executes what it needs to execute. no need for all threads to execute an if block that only one thread actually needs, and then throwing away their results

deviations from RISC-V

There are 31 registers, x1 to x31, along with x0, which is always 0s. Use the same registers for both integers and floats. (this latter point deviates from RISC-V, because we are targeting creating a GPU, where locality is based around each of thousands of tiny cores, rather than around the FP unit vs the integer APU).

RISC-V has the "Zfinx" extension, specifically for this. So if you follow that then you're not deviating.

https://github.com/riscv/riscv-zfinx

What else?

DDR4 Controller

We need a DDR4 Controller, to manage global memory, which sits in DDR chips, separate from the main GPU chip.

The DDR4 Controller will be used to copy data to and from the GPU global memory, in order to populate shared memory, caches and similar on the GPU chip itself; and in order for the GPU to be able to write data back to global memory. For now, we will assume that all communications are with a single GPU Controller module on the GPU die. For example, we will assume for now that any data copied from mainboard main memory to GPU global memory will pass via the GPU controller.

What we need for VeriGPU:

  • comprehensive verification, ideally including formal verification
  • clear documentation on how to use and integrate with VeriGPU
    • how to use the interface to write to GPU global memory?
    • how to use the interface to read from GPU global memory?
  • ideally, a PR that integrates the DDR4 controller into VeriGPU

Screen Shot 2022-04-07 at 6 06 10 PM

Bear in mind that tape-out at 5nm costs $250M or so, so we want things to work first time. Therefore verification is important :)

Network on a Chip (NoC) implementation

Need a network on a chip implementation

Firstly, what is network on a chip? See https://amstel.estec.esa.int/tecedm/NoC_workshop/GinosarNOC_Tutorial.pdf Buses are becoming spaghetti, so chips nowadays use an internal packet-switching network instead.

Screen Shot 2022-04-07 at 9 05 59 AM

(slide from presentation linked above)

Then the tasks for NoC for VeriGPU are:

  • what implementations currently exist?
  • what do similar projects use?
  • for any existing NoC implementations:
    - good points?
    - anything missing, or not quite working yet?
    - how recently maintained are they? (i.e. are there recent commits?)
    - have they already been used in actual taped-out ASICs?
    - if it looks good, could we get a full independent verification? (you would create this, in a new repo, your own project :) )
    - ideally, including formal verification (again, your own repo, your own project :) )
  • of course, you could provide your own implementation too, but NoC is complex, so simply verifying someone else's NoC is hard. Bear in mind that 80% of semiconductor development in industry is spent on verification.
  • what configuration(s) do you recommend for a GPU? (Please let me know if you need more information on what logical architecture I'm envisaging).
  • Clear documentation on how to integrate the NoC with VeriGPU
  • Ideally, PR that integrates the NoC with VeriGPU (could be multiple PRs)

CMake issues

Hello, I'm having troubles in building the project, as the documentation is not that clear.

What is happening is that during the exec of the CMake verilate command, the flag "--make" is not found, hence it doesn't create further .cmake files (e.g., gpu_card_copy.cmake).

Is this issue related to the version of Verilator being used? If so, could you please tell which version of Verilator are you using? Thanks.

PCIe 4+ interface

Need a PCIe 4+ interface, for communications between the main computer's CPUs, and the GPU board.

The PCIe interface will be used to copy data to and from the mainboard memory; to receive kernel launch requests from a mainboard CPU core; and to inform the mainboard CPU core once the kernel has finished running.

What we need for VeriGPU:

  • comprehensive verification of the PCIe interface
    • ideally including formal verification
  • need clear documentation on how to integrate the PCIe interface with VeriGPU
    • how to use the interface to copy data to and from mainboard main memory?
    • how to receive data/instructions from a mainboard CPU?
    • how to send data/responses to a mainboard CPU?
  • ideally need a PR to integrate the PCIe 4+ interface with VeriGPU

Screen Shot 2022-04-07 at 6 19 04 PM

Introduction to PCIe https://pcisig.com/sites/default/files/files/PCI_Express_Basics_Background.pdf

Screen Shot 2022-04-07 at 6 13 24 PM

(from linked presentation)

Bear in mind that taping out at 5nm costs $250M or so, so we want things to work first time. Therefore verification is important :)

Any specific environment?

Hi, I got some error message during make gpu runtime.

error: ‘class llvm::ElementCount’ has no member named ‘getFixedValue’
         int elementCount = vectorType->getElementCount().getFixedValue();

The environment is ubuntu20 with clang 10.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.