Giter Site home page Giter Site logo

spec's Introduction

spec

A repository that contain some benchmarks for partial evaluation of different GPU application scenarios leveraging AnyDSL framework and CUDA. Benchmarks are available as Python notebooks.

Implementations

  • Multiple string matching:
    • Three naive implementations in CUDA using global, constant and shared memory respectively to store the patterns
    • Implementation utilizing partial evaluation to "store" the patterns
  • Image separable convolution
    • Separable convolution taken from NVIDIA's SDK with constant memory to store the filter
    • Implementation utilizing partial evaluation to "store" the filter

Partial evaluation

Partial evaluation or program specialization is a well-known optimization technique that given a program and part of its input data, called static, specializes the program with respect to the data, producing another, optimized, program which if given only the remaining part of input data, called dynamic, yeilds the same results as the original program would have produced being executed given both parts of the input data.

Considering optimization of GPU memory accesses, partial evaluation is able to embed statically known array values directly into the code. More precisely, it allows the values to be accessed via instruction cache rather than through issuing load instructions.

For example, the following CUDA kernel:

__global__ void match_multy(char* patterns, int* p_sizes, int p_number, char* text, long text_size, char* result_buf) {

    long t_id = threadId();

    if(t_id < text_size){
        int p_offset = 0;
        int matched = 1;
        
        result_buf[t_id] = 0;

        for(int i = 0; i < p_number; i++) {//for each pattern
            matched = 1;
            if(t_id < text_size - p_sizes[i] + 1) {
                for(int j = 0; j < p_sizes[i]; j++) {
                
                    if(text[t_id + j] != patterns[j+p_offset]) {
                        matched = -1;
                        break;
                    }
                } 
            
                if(matched == 1) {
                    result_buf[t_id] = i+1; // 0 stands for missmatch
                }
            }
            p_offset += p_sizes[i];
        }             
    }
}

compiles to something like this:

alt text

However, since *patterns are fixed at runtime, partial evaluation could be applied resulting in the following residual compiled code:

alt text

The observed transformation results in the following performance of naive GPU pattern matching (evaluated using 16 file signatures as patterns and raw disk data as a subject string on GTX 1070):

alt text

The performance gain occured due to poor access patterns to each memory space aggravated by thread divergence. Poor access patterns induce either sequential loading or fetching redundant memory to satisfy alignment and fetch size. E.g. actual amount of memory transered for GTX 1070 matching with patterns in global memory and for partially evaluated version are depicted below (subject string size is 2.9GB raw disk data, patterns total length is about 160B)

  • Global memory version:

alt text

  • Partially evaluated one:

alt text

Thus, in case when e.g. constant memory achieves its best perforamance, partial evaluation hardly makes the application significantly faster. For example, considering Separable Convolution Filter implementation from NVIDIA's SDK, where filter kernel resides in constant memory and has the best access pattern, the relocation of the kernel performed by partial evaluation has the following effect:

alt text

Build instructions

Dependencies

  • Working installation of AnyDSL framework

    AnyDSL should be installed with RUNTIME_JIT:=true option set in config.sh

  • Working installation of NVIDIA CUDA with all paths set up

  • Convolution filtering depends on libjpeg, e.g. sudo apt-get install libjpeg-dev

  • Python 3

  • Python Jupyter notebook to view benchmarks

  • Python matplotlib, e.g. pip3 install -r requirements.txt

Build

To build the test applications simply run:

mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DAnyDSL_runtime_DIR="${PATH_TO_ANYDSL_FOLDER}/runtime/build/share/anydsl/cmake/" && make

or

cmake .. -DAnyDSL_runtime_DIR="${PATH_TO_ANYDSL_FOLDER}/runtime/build/share/anydsl/cmake/"

if cmake --version >= 3.11

To run the coresponding benchmarks, move compiled applications to coresponding benchmarking folder, run jupyter notebook -> Cell->Run All. E.g. for multiple pattern matching:

mv build/matching/cpp/match.out ../matching/benchmarking && mv build/matching/impala/match_spec.out ../matching/benchmarking

Then open matching/benchmarking/MatchingBenchmarking.ipynb, choose a file with a subject string and pass it to Runner, e.g. one can provide /dev/sda1 as input: runner = Runner(...,filename='/dev/sda1',offset=0...) and click Cell -> Run All

spec's People

Contributors

tiltedprogrammer avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.