Light

thecodeinator / nvshmem-db Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 2.99 MB

License: GNU Affero General Public License v3.0

CMake 6.11% C++ 21.56% Cuda 51.64% Python 16.55% Shell 4.13%

nvshmem-db's Introduction

nvshmem-db

Tested Environment:

Compiler: /usr/bin/gcc-11 and /usr/bin/g++-11 on c01
cuda compiler version: /usr/local/cuda-12/bin/nvcc on c01
cmake cli variables: -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12/bin/nvcc -DNVSHMEM_PREFIX=/opt/nvshmem/ -DMPI_HOME=/usr/hpcx/ompi -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-11 -DCMAKE_BUILD_TYPE=Debug
nvshmem library version 2.9.0 located at /opt/nvshmem/lib/ on c01

nvshmem-db's People

Contributors

Stargazers

Watchers

nvshmem-db's Issues

Eliminate Sendbuffer

Implement a version of the shuffle that sends directly tuple-by-tuple using nvshmem_put.
For this, we have to have a thread-level histogram to calculate thread-level write offsets on the destinaiton PEs.
We also need to make sure that the input data is in symmetric device memory. We can register it as symm. mem. if possible or allocate symm. dev. mem. and copy the data over.

Presentation

The presentation is important for grading.
Should show a nice story and focus on the most interesting things.

Storyboard/Ideas:

Motivate the topic on one or two slides -> directly at the start, what is the issue, what do we hope to gain
Short overview -> after the listeners interest has been captured, shortly introduce what we want to talk about
Introduce shuffle theory -> how does shuffling work in general
Refer back to motivation -> why NVSHMEM, which advantages do we get
Theory of GPU initiated shuffle -> how is it different from classic shuffle
TBD: start with implementation details or start with introducing our benchmarks to guide to our implementation?

Install MPI

ask supervisors to install MPI because it is required for some examples and for the reference benchmarks with GPUDirect

Data shuffling with histograms

Implement data shuffling with histograms.

Each PE gets a vector of elements as input. Each element consists of a random id (the shuffling key) and a configurable, but at runtime constant number of bytes (the remaining data of the tuple whose contents are not relevant for our benchmark and can be any data)
Each PE scans the local data and precomputes a local histogram of the number of tuples that it has for each PE.
Each PE broadcasts the local histogram to the other PEs
Each PE accumulates the received histograms to get a global histogram
Each PE allocates symmetric memory big enough to receive all data.
Each PE scans again over the local tuples and uses asynchronous sends to put the data to the remote PEs. The histograms are used to compute offsets such that each PE writes to a distinct memory location and they can write without synchronization.
At the end of the shuffle, all PEs need to call quiet, fence or barrier (consider which is best) to finish the shuffle operation
Note that we can assume that all data fits in RAM. The benchmark should be configured in a way that it does not exceed the RAM capacity.

Conceptually understand the symetric memory model

Research the assumptions behind the memory model of nvshmem and document+present findings (including simple! examples)

where is memory actually allocated (at each node or only at one node?)

Implement #4 without nvshmem

The idea of this task is to implement the logic described in #4 without nvshmem (-> using GDR) as a baseline for comparison

This ticket consists of two steps:

Research GPUDirect and prepare a short presentation + minimum working example for the group
Take the implementation of #4 and port to GPUDirect

use chrono for microbenchmarks

do not use clock() but use std::chrono for end-to-end measurements.

different dest version of 21

Implement the mutlithreaded part of issue #21 with mulitple receiver PEs (maybe 3 receivers and 1 sender) and send on each put call to a different destination PE. We want to send to different PEs from different threads and to different PEs in different iterations of the same thread to come close to a realistic shuffle scenario. We want to see whether nvshmem_put can coalesce calls in a multithreaded context even if the data being sent to the same PE is not contiguous in memory at the sender side.
This shall then be compared to the results of ticket #21.
pseudocode example:

// with 8 threads and 4 PEs
const uint32_t thread_offset = thread_id * 8;
for (int i{0}; i < num_elements / 64; i += 64) {
      int dest = (thread_id + (i / 64)) % number_pe; // send to a different dest from each thread and to adifferent dest in each iter
      nvshmem_put_nbi(start_of_array + i * 64 + thread_offset, 8_bytes, dest)
} 
quiet() // wait for send to finish

Thread block level atomic insert into send buffers

Like with issue #9, use atomic offsets for inserting into the send buffers. But since atomics are possibly more expensive if used between thread blocks, create thread block-level send buffers that can be written to using thread-level atomics.

However, we should test first how much performance we could gain through micro-benchmarks comparing GPU-level atomics with thread block-level atomics. Refer to issue #12 for that.

Parallel writing of tuples to send buffer with fine grained offsets per thread

Same idea as #9 but without the need for atomic adds for send buffer insert

When computing the histograms in the first kernel, handle the whole vector of local tuples in batches. Each batch has the size of the send buffer size that is later used.
For each batch, track how many tuples for each destination have been scanned by each GPU thread and save that for later
When sending the tuples out to the destination in the second kernel, send them out in batches of the buffers size that has been used earlier.
To write synchronization-free into the send buffers (one buffer for each destination), use the number of tuples that each thread has scanned earlier for this destination in this send iteration to calculate an offset into the send buffers for this send iteration.

To clarify: If the total number of local elements is n, and the send buffer size is k, the batches process k elements and there are n/k iterations. Since we have multiple destinations, there is one send buffer for each destination (number of PEs n_pes).
In the worst case (all tuples of one batch go to the same destination PE), all elements of this batch still fit into one send buffer before it is send out and eventually can be reused.
For each of the n/k batches, we have n_pes histograms with n_threadsPerPE elements that store how many elements each thread has to send for the destination in this batch. I.e. we need n/k * n_pes * n_threadsPerPE values to store all offsets.

Note that this thread-level statistic can be kept in node-local memory and does NOT need to be shared between nodes. The reason is that they are only used to write into the local send buffer. (Opposed to the node-level statistics that are important for remote writing).

get used to nsightsystem nsightcompute

Test out the two profiling tools with our existing examples and benchmarks.
Everyone should have basic knowledge of the tools. This should help us in the later project development.

Jonas
Luis
Alex Muth
Alex Städing

Tuple Scan benchmark

This benchmark should show how many tuples a certain number of GPU threads can scan and copy to a send buffer per time unit: Outcome: x-axis, number of threads, y-axis, tuple throughput.
We should test this with different tuple sizes to see if there are interesting effects when the data to be copied per tuple is larger.

Knowing the results, we would be able to calculate roughly how large the tuple size must be and how many threads we would need to fill the send buffers fast enough to reach a throughput near BW in the shuffle.

compare nvshmem_put granularity

Implement the following and compare the two version to each other with different configurations

For a fixed size of bytes transmitted per loop iteration see the following pseudo code:

// with multiple threads
send_size = bytes_per_iter / n_threads;
const uint32_t thread_offset = thread_id * send_size;
for (int i{0}; i < num_elements; i += bytes_per_iter) {
      nvshmem_put_nbi(start_of_array + i  + thread_offset, send_size)
} 
quiet() // wait for sending to finish

// with one thread
// we send everything at once that is send by different threads in the other version.
// The number of oop iterations per thread stay identical
for (int i{0}; i < num_elements; i += bytes_per_iter) {
      nvshmem_put_nbi(start_of_array + i, bytes_per_iter);
}
quite()  // wait for sending to finish

Run existing benchmarks

Find existing nvshmem benchmarks, run them:

Clearly document the configuration for building and running
Run on one node (sanity check)
Run on two nodes with one GPU per node (use host file for configuration)
Document the benchmark results
Eventually, compare the results to the results of our benchmarks and check whether our benchmark results are reasonable.

Micro Benchmarks

To produce meaningful performance measurements and results for the evaluation report, we need to microbenchmark different aspects of the implementation.

1. Check if nbi_put command buffers calls in background (compare one big nbi_put vs. multiples small)
2. Compare global atomic vs local atomic (grid level vs. block level)
3. (Higher level) Compare default stream vs. decomposed input on multiple streams (hide sync cost)
4. use multiple threads for sending in put coalescing benchmark for different data. i.e. use multiple threads iterate over input and send different elements in parallel using nvshmem
5. investigate what message size is required to approach link BW with nvhshmem_put_nbi: call nvshmem_put_nbi in a for-loop using the same data and measure the time until all puts complete. Then calculate the throughput. Execute for different message sizes and plot. Probably should approach network BW as message size is increased. Implement with one GPU thread and with multiple GPU threads.

compare one kernel launch with nvshmem vs. multiple kernel launches with RDMA

Compare the following two things:

Allocate memory for the input data

(1)
Do the following in a loop:

compute a part of the big input array (sleep might be sufficient)
return the address of the computed from the kernel launch data to the CPU.
wait for kernel to finish
send data using rdma to other node
continue loop with next part input data

(2)

Start a kernel
do the following in a loop:
- compute part of the data (sleep sufficient)
- send the computed part of the data
- wait for send operation to complete (nvshmem_quiet)
- repeat for next portion of data

Time the runtime of both approaches. This should show the potential one kernel launch with nvshmem in comparison to multiple kernel launches with device RDMA.

NOTE: For simplicity, the data can be all written to the same destination, thereby overwriting the data from previous send operations. In reality, we would have to flush the data from the receive buffer to some other storage medium.

NOTE: For simplicity, it is sufficient to have only one sending PE and one receiving PE.

NOTE: We might implement a second version of NVSHMEM sending that uses double buffering like our shuffle does. This allows us evaluate the effect of double buffering independent of the remainder of the shuffle code.

Decouple node number and partition number (low priority)

The number of nodes should be independent of the number of partitions. This allows for more fine-grained assignment of partitions to nodes. If the partition number is chosen to be equal to the number of nodes or a multiple of the number of nodes, each node can get an equal number of the subsets.

Take input as one chunk of local data (no partitioning of input)
Set output partition count to a multiple of the PE count
Map tuples to partitions and create histograms based on that
After the size of each partition is known based on the histograms, map partitions to node ids such that each node id has a similar number of tuples assigned to it i.e. each node might have different numbers of partitions since the size of the output partitions can be unequal due to skewed data.
For assigning the partitions to node ids, an algorithm similar to the following can be used:
- (1) Sort partitions by size
- (2) Starting from the biggest partition, assign them to notes as follows:
- a partition always is assigned to the node that has the least tuples assigned to it so for
- This way, we get a maximally balanced mapping from partitions to nodes
Calculate write offsets based on the partitions that each node will receive. I.e. write offset from PE 3 to PE 1 is the number of tuples that each node with a smaller ID than PE 3 (i.e. PE 0, 1, 2) has for PE 1 in all partitions that belong to PE 1.

Create Benchmarking Tools

The idea of this issue is to write benchmarking tools to facilitate evaluation of our implementations. This shall include visualization by plots and storage and display of the data in a format usable for reports.

Ideas:

Use python3 + matplotlib for plotting
Data can be written to file (CSV style?) by benchmarks and data is read in by python script
Important discussion: What do we want to visualize?
Check what Nvidia tools already provide, do not implement redundantly

Micro Benchmark: Compare nvshmem_ptr to other nvshmem functions

nvshmem_ptr returns a pointer to a remote object that is locally accessible. This means that loads and stores to this pointer must be translated to nvshmem calls in the backend.
While this sounds very convenient, so far it is unclear at what cost this functionality comes.
Taks:

Figure out nvshmem interface functions that provide similar functionalities to those provided by loads and stores via nvshmem_ptr
Test whether nvshmem_ptr guarantees sequential consistency, which is not guaranteed by other nvshmem interface functions if fence or quiet is not used.
Write a benchmark that compares the explicit use of data transferring nvshmem functions to the implicit use via nvshmem_ptr. It is e.g. interesting how a large number of sequential and random accesses are handles by nvshmem_ptr. We would expect that normal nvshmem_put or similar functions can coalesce consecutive accesses because they do not guarantee sequential consistency, while nvshmem_ptr might have to execute all calls one after another. The benchmark should compare the performance with regard to this property.
Document the run configuration and benchmark results

compare one GPU thread sending with one CPU thread sending

implement a very simple benchmark using "ib_write_bw" that scales the message size in the same way we did it in
plot the results together with the results of the put_benchmark_06 with blocksize=1, gridsize=1

Parallel writing of tuples to send buffer with atomic add

This issue build on #4 and aims to make the writing to the send buffers parallel

Parallel scan of local tuple vector (get destination)
Parallel insert in send buffer by atomic increment of offset
Clean up code
Document most important functions
verify correctness

Draw Conclusions from Microbenchmarks

replot all relevant microbenchmarks
Sum up everything we have learned about NVSHMEM through microbenchmarking in bullet points (in discussion section of report)

Create shuffle benchmark

Continuing from #3, implement the first actual benchmark using our nvshmem shuffle

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.