Giter Site home page Giter Site logo

gpgpu-sim_uvmsmart's Introduction

UVM Smart

UVM Smart is the first repository to provide both functional and timing simulation support for Unified Virtual Memory. This framework extends GPGPU-Sim v3.2 from UBC. Currently, it supports cudaMallocManaged, cudaDeviceSynchronize, and cudaMemprefetchAsync. It includes 10 benchmarks from various benchmark suites (Rodinia, Parboil, Lonestar, Parboil, HPC Challenge). These benchmarks are modified to use UVM APIs.

If you use or build on this framework, please cite the following papers based on the functionalities you are leveraging.

  1. Please cite the following paper when using prefetches and page eviction policies.

    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA '19). ACM, New York, NY, USA, 224-235.

  2. Please cite the following paper when using access counter-based delayed migration, LFU eviction, cold vs hot data structure classification, and page migration and pinning.

    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2020. Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription. In Proceedings of the 34th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2020). IEEE , New Orleans, Louisiana, USA, 451-461.

  3. Please cite the following paper when using adaptive runtime to detect pattern in CPU-GPU interconnect traffic, and policy engine to choose and dynamically employ memory management policies.

    Debashis Ganguly, Rami Melhem, and Jun Yang. 2021. An Adaptive Framework for Oversubscription Management in CPU-GPU Unified Memory. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE 2021).

Features

  1. A fully-associative last-level TLB and hence TLB lookup is performed in a single core cycle,
  2. A multi-threaded page table (the last level shared) walker (configurable page table walk latency),
  3. Workflow for replayable far-fault management (configurable far-fault handling latency),
  4. PCIe transfer latency based on an equation derived from curve fitting transfer latency vs transfer size,
  5. PCIe read and write stage queues and transactions (serialized transfers and queueing delay for transaction processing),
  6. Prefetchers (Tree-based neighbourhood, Sequential-local 64KB, Random 4KB, On-demand migration),
  7. Page replacement policies (Tree-based neighbourhood, Sequential-local 64KB, LRU 4KB, Random 4KB, LRU 2MB, LFU 2MB),
  8. 32-bit access registers per 64KB (basic block),
  9. Delayed migration based on an access-counter threshold,
  10. Rounding up managed allocation and maintaining large page (2MB) level full-binary trees.
  11. A runtime engine to detect underlying pattern in CPU-GPU interconnect traffic, a policy engine to choose and dynamically apply the best suitable memory management techniques.

Note that currently, we do not support heterogeneous systems for CPU-GPU or multi-GPU collaborative workloads. This means CPU page table (validation/invalidation, CPU-memory page swapping) is not simulated.

How to use?

Simple hassle-free. No need to worry about dependencies. Use the Dockerfile in the root directory of the repository.

sudo docker build -t gpgpu_uvmsmart:latest .
sudo docker run --name <container_name> -it gpgpu_uvmsmart:latest
cd /root/gpgpu-sim_UVMSmart/benchmarks/Managed/<benchmark_folder>
vim gpgpusim.config
./run > <output_file>
sudo docker cp <container_name>:/root/gpgpu-sim_UVMSmart/benchmarks/Managed/<benchmark_folder>/<output_file> .

How to configure?

Currently, we support architectural support for GeForceGTX 1080Ti with PCIe 3.0 16x. The additional configuration items are added to GeForceGTX1080Ti under configs. Change the respective parameters to simulate the desired configuration.

What are included?

  1. A set of micro-benchmarks to determine the semantics of prefetcher implemented in NVIDIA UVM kernel module (can be found in micro-benchmarks under root).
  2. A micro-benchmark to find out transfer bandwidth for respective transfer size (cudaMemcpy host to device).
  3. A set of benchmarks both with the copy-then-execute model (in Unmanaged under benchmarks folder) and unified virtual memory (in Managed under benchmarks folder).
  4. Specification of the working set, iterations, and the number of kernels launched for managed versions of the benchmarks.
  5. Output log, scripts for plotting, and the derived plots for ISCA'19, IPDPS'20, and DATE'21 papers in Results under benchmarks folder.

Copyright Notice

Copyright (c) 2019 Debashis Ganguly, Department of Computer Science, School of Computing and Information, University of Pittsburgh All rights reserved

gpgpu-sim_uvmsmart's People

Contributors

debashisganguly avatar ronianz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gpgpu-sim_uvmsmart's Issues

Query in Eviction code

latency_type ltype = latency_type::PCIE_WRITE_BACK;
for (std::list<eviction_t *>::iterator it = valid_pages.begin(); it != valid_pages.end(); it++) {
if ((*it)->addr <= iter->first && iter->first < (*it)->addr + (*it)->size) {
if ((*it)->RW == 1) {
ltype = latency_type::INVALIDATE;
break;
}
}
}

In the part where the pages are evicted, there is this loop that checks if there are any RW pages and decides the latency type based on it. The code uses latency_type::PCIE_WRITE_BACK by default and switches to latency_type::INVALIDATE if there are any RW pages in the set of evicted pages.

Shouldn't it be the other way around? I would expect to use writeback latency only when there are modified pages that needs to be written back and invalidate latency when all the pages are clean.

Error when running managed benchmarks in Docker image

After building the image from the Dockerfile, I am having trouble running the test cases provided in benchmarks/Managed. Specifically, the backprop benchmark outputs "ALLOC_1D_DBL: Couldn't allocate array of n floats", indicating that the cudaMallocManaged calls returned null. Running some of the other benchmarks like sssp give an error saying that the CUDA driver version is insufficient for the runtime version. This is directly after having built the image, before making any modifications, aside from adding the CUDA installation to the path. What could be causing this?

is_block_evictable() returns false forever due to is_valid() condition on large page

I used 10 MB gddr size, 2 MB LRU page eviction policy and TBN prefetcher.

When the page eviction has to occur (when should_evict_page() returns true), is_block_evictable() checks whether the first page of the large page is valid or not.

Because is_valid() condition keeps returning false (because first page is not touched and arrived yet), pages cannot be evicted.

It seems that if the first page of the large page is not touched forever, the eviction cannot be done forever.

This means that the simulator runs forever, finding the evictable 2MB pages.

D2H time missing under oversubscription

In my understanding, there should be D2H memcpy under oversubscription. But not all log files under Results/Smart_Runtime/Oversub have D2H memcpy time, am I missing some details?

I miss a error and can not fix it, please help me

GPGPU-Sim PTX: allocating global region for "__T20" from 0x100 to 0x135 (global memory space)
GPGPU-Sim PTX: Warning __T20 was declared previous at _4.ptx:12 skipping new declaration
GPGPU-Sim PTX: allocating stack frame region for .param "_ZN5cufhe12DataTemplateIiED1Ev_param_0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "_ZN5cufhe12DataTemplateIiED0Ev_param_0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "func_retval0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "_ZNK5cufhe12DataTemplateIiE8SizeDataEv_param_0" from 0x8 to 0x10
GPGPU-Sim PTX: allocating stack frame region for .param "_ZN5cufhe11LWESample_TIiED1Ev_param_0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "_ZN5cufhe11LWESample_TIiED0Ev_param_0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "func_retval0" from 0x0 to 0x8
GPGPU-Sim PTX: allocating stack frame region for .param "_ZNK5cufhe11LWESample_TIiE8SizeDataEv_param_0" from 0x8 to 0x10
GPGPU-Sim PTX: allocating stack frame region for .param "free_param_0" from 0x0 to 0x8
GPGPU-Sim PTX: Warning __T20 was declared previous at _4.ptx:12 skipping new declaration
GPGPU-Sim PTX: allocating global region for "_ZTVN5cufhe12DataTemplateIiEE" from 0x180 to 0x1a8 (global memory space)
_4.ptx:60: Syntax error:

.global .align 8 .u64 _ZTVN5cufhe12DataTemplateIiEE[5] = {0, 0, _ZN5cufhe12DataTemplateIiED1Ev, _ZN5cufhe12DataTemplateIiED0Ev, _ZNK5cufhe12DataTemplateIiE8SizeDataEv};
^

GPGPU-Sim PTX: parser error detected, exiting... but first extracting .ptx to "_ptx_errors_wFRTmT"
Aborted (core dumped)

Build From Source Errors

I am not using Docker but just build from source. My environment is CentOS 6.10, and gcc 5.4 also cuda 8.0. When running benchmarks, it will have the errors at src/option_parser.cc:102 because the argument str is an empty string and the type of m_variable is integer and an exception happens. The cause is from src/gpgpu-sim/gpu-sim.cc:171. So that I change the 6th parameter from "" to "0". Recompile and then running the benchmark is ok now.

cudaMallocManaged addresses which are not multiple of page size do not work correctly

UVM make the base address of managed malloc addresses be multiple of 256.

This causes that two basic blocks within a large page share same page.

For example, assuming each basic block corresponds to 64 KB, first basic block owns 17 pages and second basic block owns 17 pages, This means that two basic blocks share the same page.

As a result, same pages can be requested on PCI-e. In addition, when evicting he pages, the algorithm implemented in the source code does not work correctly (valid_pages_erase() erases wrong pages, fragmentation, etc).

cudaMallocManaged support

Can the simulator can support over-gpu-memory allocation using cudaMallocManaged? Supposed gpu memory is 4GB and I want to allocate 6GB on host first and then gpu accesses by on-demand paging and then have oversubscription. Since I found thatcudaMallocManaged in cuda_runtime_api.cc and then subsequent call of gpu_mallocmanaged seem to have a 4G size limit?

Duplicated calls on reserved_page_remove()

If reserve_pages_remove() is called when cache MISS or HIT_RESERVED occur,

reserve_pages_remove() from 'case 4' in writeback() over-decreases reserve page counter, which finally cause assertion.

reserve_pages_remove() should only be called for the mem_fetch object returned from the lower level (e.g., L2).
The locations are 'case 4' in writeback() and where WRITE_ACK is checked.

Am I missing something?

prmt_impl instruction

I'm kind of new to this thing, but how can we add our own implementation of instructions? Are there any useful links to learn that maybe?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.