komputeproject / kompute Goto Github PK

General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

Home Page: http://kompute.cc/

License: Apache License 2.0

C++ 73.05% Dockerfile 1.47% Python 6.99% CMake 14.72% Makefile 2.06% Shell 1.72%

vulkan vulkan-compute vulkan-example vulkan-tutorial vulkan-demos vulkan-compute-tutorial vulkan-compute-framework vulkan-compute-example cpp machine-learning

kompute's Introduction

Kompute

The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)

Blazing fast, mobile-enabled, asynchronous, and optimized for advanced GPU acceleration usecases.

💬 Join the Discord & Community Calls 🔋 Documentation 💻 Blog Post ⌨ Examples 💾

Kompute is backed by the Linux Foundation as a hosted project by the LF AI & Data Foundation.

Principles & Features

Flexible Python module with C++ SDK for optimizations
Asynchronous & parallel processing support through GPU family queues
Mobile enabled with examples via Android NDK across several architectures
BYOV: Bring-your-own-Vulkan design to play nice with existing Vulkan applications
Explicit relationships for GPU and host memory ownership and memory management
Robust codebase with 90% unit test code coverage
Advanced use-cases on machine learning 🤖, mobile development 📱 and game development 🎮.
Active community with monthly calls, discord chat and more

Projects using Kompute ❤️ 🤖

GPT4ALL - An ecosystem of open-source on-edge large language models that run locally on your CPU and nearly any GPU.
llama.cpp - Port of Facebook's LLaMA model in C/C++.
tpoisonooo/how-to-optimize-gemm - row-major matmul optimization.
vkJAX - JAX interpreter for Vulkan.

Getting Started

Below you can find a GPU multiplication example using the C++ and Python Kompute interfaces.

You can join the Discord for questions / discussion, open a github issue, or read the documentation.

Your First Kompute (C++)

The C++ interface provides low level access to the native components of Kompute, enabling for advanced optimizations as well as extension of components.

void kompute(const std::string& shader) {

    // 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    kp::Manager mgr; 

    // 2. Create and initialise Kompute Tensors through manager

    // Default tensor constructor simplifies creation of float values
    auto tensorInA = mgr.tensor({ 2., 2., 2. });
    auto tensorInB = mgr.tensor({ 1., 2., 3. });
    // Explicit type constructor supports uint32, int32, double, float and bool
    auto tensorOutA = mgr.tensorT<uint32_t>({ 0, 0, 0 });
    auto tensorOutB = mgr.tensorT<uint32_t>({ 0, 0, 0 });

    std::vector<std::shared_ptr<kp::Tensor>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

    // 3. Create algorithm based on shader (supports buffers & push/spec constants)
    kp::Workgroup workgroup({3, 1, 1});
    std::vector<float> specConsts({ 2 });
    std::vector<float> pushConstsA({ 2.0 });
    std::vector<float> pushConstsB({ 3.0 });

    auto algorithm = mgr.algorithm(params,
                                   // See documentation shader section for compileSource
                                   compileSource(shader),
                                   workgroup,
                                   specConsts,
                                   pushConstsA);

    // 4. Run operation synchronously using sequence
    mgr.sequence()
        ->record<kp::OpTensorSyncDevice>(params)
        ->record<kp::OpAlgoDispatch>(algorithm) // Binds default push consts
        ->eval() // Evaluates the two recorded operations
        ->record<kp::OpAlgoDispatch>(algorithm, pushConstsB) // Overrides push consts
        ->eval(); // Evaluates only last recorded operation

    // 5. Sync results from the GPU asynchronously
    auto sq = mgr.sequence();
    sq->evalAsync<kp::OpTensorSyncLocal>(params);

    // ... Do other work asynchronously whilst GPU finishes

    sq->evalAwait();

    // Prints the first output which is: { 4, 8, 12 }
    for (const float& elem : tensorOutA->vector()) std::cout << elem << "  ";
    // Prints the second output which is: { 10, 10, 10 }
    for (const float& elem : tensorOutB->vector()) std::cout << elem << "  ";

} // Manages / releases all CPU and GPU memory resources

int main() {

    // Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    // files). This shader shows some of the main components including constants, buffers, etc
    std::string shader = (R"(
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    )");

    // Run the function declared above with our raw string shader
    kompute(shader);
}

Your First Kompute (Python)

The Python package provides a high level interactive interface that enables for experimentation whilst ensuring high performance and fast development workflows.

from .utils import compile_source # using util function from python/test/utils

def kompute(shader):
    # 1. Create Kompute Manager with default settings (device 0, first queue and no extensions)
    mgr = kp.Manager()

    # 2. Create and initialise Kompute Tensors through manager

    # Default tensor constructor simplifies creation of float values
    tensor_in_a = mgr.tensor([2, 2, 2])
    tensor_in_b = mgr.tensor([1, 2, 3])
    # Explicit type constructor supports uint32, int32, double, float and bool
    tensor_out_a = mgr.tensor_t(np.array([0, 0, 0], dtype=np.uint32))
    tensor_out_b = mgr.tensor_t(np.array([0, 0, 0], dtype=np.uint32))

    params = [tensor_in_a, tensor_in_b, tensor_out_a, tensor_out_b]

    # 3. Create algorithm based on shader (supports buffers & push/spec constants)
    workgroup = (3, 1, 1)
    spec_consts = [2]
    push_consts_a = [2]
    push_consts_b = [3]

    # See documentation shader section for compile_source
    spirv = compile_source(shader)

    algo = mgr.algorithm(params, spirv, workgroup, spec_consts, push_consts_a)

    # 4. Run operation synchronously using sequence
    (mgr.sequence()
        .record(kp.OpTensorSyncDevice(params))
        .record(kp.OpAlgoDispatch(algo)) # Binds default push consts provided
        .eval() # evaluates the two recorded ops
        .record(kp.OpAlgoDispatch(algo, push_consts_b)) # Overrides push consts
        .eval()) # evaluates only the last recorded op

    # 5. Sync results from the GPU asynchronously
    sq = mgr.sequence()
    sq.eval_async(kp.OpTensorSyncLocal(params))

    # ... Do other work asynchronously whilst GPU finishes

    sq.eval_await()

    # Prints the first output which is: { 4, 8, 12 }
    print(tensor_out_a)
    # Prints the first output which is: { 10, 10, 10 }
    print(tensor_out_b)

if __name__ == "__main__":

    # Define a raw string shader (or use the Kompute tools to compile to SPIRV / C++ header
    # files). This shader shows some of the main components including constants, buffers, etc
    shader = """
        #version 450

        layout (local_size_x = 1) in;

        // The input tensors bind index is relative to index in parameter passed
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };

        // Kompute supports push constants updated on dispatch
        layout(push_constant) uniform PushConstants {
            float val;
        } push_const;

        // Kompute also supports spec constants on initalization
        layout(constant_id = 0) const float const_one = 0;

        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint( in_a[index] * in_b[index] );
            out_b[index] += uint( const_one * push_const.val );
        }
    """

    kompute(shader)

Interactive Notebooks & Hands on Videos

You are able to try out the interactive Colab Notebooks which allow you to use a free GPU. The available examples are the Python and C++ examples below:

Try the interactive C++ Colab from Blog Post	Try the interactive Python Colab from Blog Post

You can also check out the two following talks presented at the FOSDEM 2021 conference.

Both videos have timestamps which will allow you to skip to the most relevant section for you - the intro & motivations for both is almost the same so you can skip to the more specific content.

Watch the video for C++ Enthusiasts	Watch the video for Python & Machine Learning Enthusiasts

Architectural Overview

The core architecture of Kompute includes the following:

Kompute Manager - Base orchestrator which creates and manages device and child components
Kompute Sequence - Container of operations that can be sent to GPU as batch
Kompute Operation (Base) - Base class from which all operations inherit
Kompute Tensor - Tensor structured data used in GPU operations
Kompute Algorithm - Abstraction for (shader) logic executed in the GPU

To see a full breakdown you can read further in the C++ Class Reference.

Full Architecture	Simplified Kompute Components
(very tiny, check the full reference diagram in docs for details)

Asynchronous and Parallel Operations

Kompute provides flexibility to run operations in an asynrchonous way through vk::Fences. Furthermore, Kompute enables for explicit allocation of queues, which allow for parallel execution of operations across queue families.

The image below provides an intuition on how Kompute Sequences can be allocated to different queues to enable parallel execution based on hardware. You can see the hands on example, as well as the detailed documentation page describing how it would work using an NVIDIA 1650 as an example.

Mobile Enabled

Kompute has been optimized to work in mobile environments. The build system enables for dynamic loading of the Vulkan shared library for Android environments, together with a working Android NDK wrapper for the CPP headers.

For a full deep dive you can read the blog post "Supercharging your Mobile Apps with On-Device GPU Accelerated Machine Learning".

You can also access the end-to-end example code in the repository, which can be run using android studio.

More examples

Simple examples

End-to-end examples

Python Package

Besides the C++ core SDK you can also use the Python package of Kompute, which exposes the same core functionality, and supports interoperability with Python objects like Lists, Numpy Arrays, etc.

The only dependencies are Python 3.5+ and Cmake 3.4.1+. You can install Kompute from the Python pypi package using the following command.

pip install kp

You can also install from master branch using:

pip install git+git://github.com/KomputeProject/kompute.git@master

For further details you can read the Python Package documentation or the Python Class Reference documentation.

C++ Build Overview

The build system provided uses cmake, which allows for cross platform builds.

The top level Makefile provides a set of optimized configurations for development as well as the docker image build, but you can start a build with the following command:

   cmake -Bbuild

You also are able to add Kompute in your repo with add_subdirectory - the Android example CMakeLists.txt file shows how this would be done.

For a more advanced overview of the build configuration check out the Build System Deep Dive documentation.

Kompute Development

We appreciate PRs and Issues. If you want to contribute try checking the "Good first issue" tag, but even using Kompute and reporting issues is a great contribution!

Contributing

Dev Dependencies

Testing
- GTest
Documentation
- Doxygen (with Dot)
- Sphynx

Development

Follows Mozilla C++ Style Guide https://www-archive.mozilla.org/hacking/mozilla-style-guide.html
- Uses post-commit hook to run the linter, you can set it up so it runs the linter before commit
- All dependencies are defined in vcpkg.json
Uses cmake as build system, and provides a top level makefile with recommended command
Uses xxd (or xxd.exe windows 64bit port) to convert shader spirv to header files
Uses doxygen and sphinx for documentation and autodocs
Uses vcpkg for finding the dependencies, it's the recommended set up to retrieve the libraries

If you want to run with debug layers you can add them with the KOMPUTE_ENV_DEBUG_LAYERS parameter as:

export KOMPUTE_ENV_DEBUG_LAYERS="VK_LAYER_LUNARG_api_dump"

Updating documentation

To update the documentation you will need to:

Run the gendoxygen target in the build system
Run the gensphynx target in the build-system
Push to github pages with make push_docs_to_ghpages

Running tests

Running the unit tests has been significantly simplified for contributors.

The tests run on CPU, and can be triggered using the ACT command line interface (https://github.com/nektos/act) - once you install the command line (And start the Docker daemon) you just have to type:

$ act

[Python Tests/python-tests] 🚀  Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ] 🚀  Start image=axsauze/kompute-builder:0.2
[C++ Tests/cpp-tests      ]   🐳  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
[Python Tests/python-tests]   🐳  docker run image=axsauze/kompute-builder:0.2 entrypoint=["/usr/bin/tail" "-f" "/dev/null"] cmd=[]
...

The repository contains unit tests for the C++ and Python code, and can be found under the test/ and python/test folder.

The tests are currently run through the CI using Github Actions. It uses the images found in docker-builders/.

In order to minimise hardware requirements the tests can run without a GPU, directly in the CPU using Swiftshader.

For more information on how the CI and tests are setup, you can go to the CI, Docker and Tests Section in the documentation.

Motivations

This project started after seeing that a lot of new and renowned ML & DL projects like Pytorch, Tensorflow, Alibaba DNN, Tencent NCNN - among others - have either integrated or are looking to integrate the Vulkan SDK to add mobile (and cross-vendor) GPU support.

The Vulkan SDK offers a great low level interface that enables for highly specialized optimizations - however it comes at a cost of highly verbose code which requires 500-2000 lines of code to even begin writing application code. This has resulted in each of these projects having to implement the same baseline to abstract the non-compute related features of the Vulkan SDK. This large amount of non-standardised boiler-plate can result in limited knowledge transfer, higher chance of unique framework implementation bugs being introduced, etc.

We are currently developing Kompute not to hide the Vulkan SDK interface (as it's incredibly well designed) but to augment it with a direct focus on the Vulkan SDK's GPU computing capabilities. This article provides a high level overview of the motivations of Kompute, together with a set of hands on examples that introduce both GPU computing as well as the core Kompute architecture.

kompute's People

Contributors

Stargazers

Watchers

Forkers

nihui bytemeow lindemeier cnheider donaldwhyte ph5 0x0f0f0f singh-ap gridl generalzzd monkeyking sfrias alexander-g jomiq dahlialia mlbo zeta1999 o7s8r6 hpgmiskin ryancargan heitaoflower stjordanis unexploredtest brunoscaglione paulicka asr-engineering-consulting thinking-tower corentin-pro vidveilarsen joyxu chenkuo trendingtechnology huangweiboy2 monishramadoss iamfork itsbasi jerryct chengwei920412 bambang mfkiwl norablackcat lopuhin wildarch mohammedjabi xiaohan2013 liuyf5231 yym064 b1060t geniusventures yehitsmee laplacekorea tpoisonooo tigrondev photon-schiesser xiaoyu1004 com8 xkey- tuskaw kdm-4 ajunlonglive arjunmurthys cherie6866 lucasgivord archie3d hungrymonkey gg-big-org icodein msnh2012 codezzg impact-of-compiler-warnings-thesis mirsadm crydsch jackylucifer yorrickslr forksnd jamestiotio brianpetkovsek nomic-ai lucasbartosch wdshin minda asbaharoon java-cds-club weblfe xenobenn magicrabbittang minisparrow caaperezag erdal-pb rulegen paulpan05 renancsales slopecraft cebtenzzre yuenxq qeu-b-458 gagliardetto arnochenfx dearborn-open-ai gmarko

kompute's Issues

Add documentation with Doxygen and Sphinx

Explore adding documentation with autodocs via doxygen, and integrate with sphinx

Enable layout to be configured dynamically within shaders

Currently layout is set up in the shader, it would be possible to explore having the algorithm configure the layout of the shader itself dynamically, so it can be set and configured on the host side even if this is something that was initially set up in the shader.

Create mocks to isolate unit tests for components

Currently the unit tests depend on the overarching components as well as underlying vulkan GPU objects. This of course is suboptimal for unit tests and would benefit from having a set of mocks for the core and vulkan classes. This may require migrating from Catch2 to Google Test.

Support imageBuffers (instead of just buffers)

Currently Tensors only use buffers to store data, we'll have to look into adding capabilities to support imageBuffer, but will also be important to identify the key usecases that this will unlock.

Use tool like gperftools for unit test memory profiling to identify/validate memory leaks

Explore tools like https://github.com/gperftools/gperftools to add as unit testing for memory profiling

Remove vulkan commandbuffer from Tensor

Currently a tensor owns a comand buffer vulkan object and performs the recordCopy as a command. Initially this was designed as the idea was to avoid exposing the buffer from the tensor itself. The command buffer is also used to create the barrier. However in the case of the barrier, it's possible to just return the barrier itself, and for the case of the recordCopy, it would be possible to just pass the commandbuffer as parameter as opposed to having it as an owned component. This is also important as a tensor shoudl not be bound to a single command buffer.

Add parallel scan sum aggregate example

Implement aggregate sum of array as an out of the box compute shader that can be used either as an example or as out of the box shader that users can leverage.

Add explicit multi-threading interfaces to ensure correctness when running in parallel

Currently the design has been built with isolation of vulkan and Kompute components in order to ensure mutiple applications can make use of the different resources available, however there is currently no explicit interface to enforce thread safety. This will require exploring as well a set of use-cases that could be performed in parallel, together with the explicit memory and computation requirements/constraints that need to be fullfilled.

Support multiple types of vk Buffer types like uniform, etc (beyond storage buffer)

Currently the storage buffer is the only one supported, it will be important to add an extra capability for buffers of types beyond storage buffer. This will have to be an initial research to identify which types to prioritise and why.

Create exported project via vcpkg for simpler installation

Explore creating an export for vcpkg and contribute it back to the listing so it's simpler to get started with vulkan kompute

Diagnose memory profiling to ensure there are no memory leaks on objects created [CPU]

Use memory profiler to perform initial assessment for any potential obvious non exhaustive memory leaks that may be left due to references to pointer objects that are not freed, as well as potentially Vulkan resources that are not freed

Expose vulkan components inside algorithm so they can be overriden by external application

Currently the kp::Algorithm doesn't allow for the internal components to be provisioned by default, and should be explored how these can be exposed so components like pipelines can be provided externally.

Handle situation where empty vector of tensors passed to AlgoOpBase initialiser

Currently if empty array of tensors is passed to the OpAlgoBase, it would cause an exception as dispatch is expected to be inferred to 1, so need to handle the situation either by erroring or providing a default.

Support granularity of configuration for descriptorsets via parameters/parameterGroup

Currently there is potential value from providing an abstraction on the way tensors are made available to algorithms via thec concept of Kompute "Parameters" or "Parameter Groups". This issue encompasses exploring the design for parameter groups to be able to expose how it can be achieved.

Make specialisation data extensible

Currently the specialization data is set automatically to the sizes for all the tensors. Explore how this could be extended both in providing further parameters as well as different types.

Remove spdlog as a required dependency

Currently spdlog is a required dependency, however for ease of use it should be possible to remove this dependency by providing optional macros that just use iostream to print the data provided. This would provide users to also have the flexibility of adding the relevant logging as required.

Create operation to copy data from local to device memory with staging

Currently the only way to copy data from the local tensor vector to device memory is when using the OpCreateTensor command, which would handle the creation of the staging tensor. Because of this limitation, the only way to continuously update a tensor for GPU processing after creation is by creating it as type TensorTypes::eStaging and run mapDataIntoHostMemory() after the vector data is updated. There should be a command that enables for the copy to device with a staging tensor, such as OpMapToDevice.

Add example showing how existing vulkan graphical engine can integrate Kompute

Currently all the examples show how to use Vulkan Kompute from scratch, but it will be important to show how vulkan kompute can be used on an existing vulkan application. A potential way to explore this may be by converting some of the Sacha Willems Vulkan examples but using Kompute for the computational aspect https://github.com/SaschaWillems/Vulkan

Expose ability to create barriers in OpTensor operations

Currently the optensor operations explicitly do not create a barrier, but there may be situations where this may be desired. Identify whether there are usecases where this would be relevant and explore best ways to expose the barrier creation (whether as a constructor parameter, or potentially as a completely separate default operation).

Add Tensor constructor that just require size instead of full data array

Currently Tensors have a constructor that allows to be created with full data array. This is sub-optimal when staging tensors or output tensors are created given that the data would be copied only to infer the size of the buffer, but the data would be replaced (that is unless the staging tensor is being used to copy from host to device). This issue encompasses adding a constructor or means to create tensors without data array.

Support multiple types for Kompute Tensors

Currently only uint32_t is supported but looking to support further types. This will also require evaluating how these are managed by "Algorithms" and "ParameterGroups"

Add specialisation data to algorithm with default tensor size

To add specialisation data point by default which would contain the size of the input tensor or tensors

Enable for compute shaders to be provided in raw form

For simplifying development iterations and initial experience explore how raw compute shaders can be provided so it doesn't require converting to compiled spirv. This would also require equal effort to ensure transitioning to the recommended way of statically compiled shaders.

Document the three types of memory ownership in classes - never, optional and always

Currently all Kompute classes have three types of conceptual ownership on vulkan resources; Never owned resources, optionally owned resources, and always owned resources. These shoudl be documented as they would provide an intuition on what are the components that are expected to always be passed (or not)

Add noexcept to all constructors to explicitly state no exception thrown

Add noexcept keyword to all constructors (except manager) to explicitly mention they don't throw exceptions.

Evaluat performance of copy command on tensor (recordCopy vs map)

Currently the OpCreateTensor performs a recordCopy command to all the tensors provided regardless whether they are hostvisible or devicevisible. The options for hosvisible would be to copy with a recordcopy command as it currently is, or alternatively perform a copy by mapping the tensor vector data. This could have performance optimizations that could be explored if this can be exposed to the user.

Explore abstracting further the Sequence begin end record eval workflow into just record and eval

Currently it requires an explicit setup of the begin end record eval workflow, however it would be possible to provide a simplified record->eval workflow where each would start and end the recording respectively. This would be based on usage.

Allow for extending memory allocations for optimization

When running in specialised hardware there are memory optimizations required on the type of memory that is used and allocated, so would be required to explore efficient ways to expose the memory types that are used (and whether the current abstractions are enough)

Add tests and documentation for loops passing data to/from device

Currently there is a complex test in the testLogisticRegression that shows how to address complex examples, but would be required to add a set of more generic tests that evaluate the key functionality around being able to perform loops of existing evaluate functions that pass data in and out of shaders. This will involve complex loops and data passing, together with documentation into the advanced examples that shows how this can be used for multiple design patterns.

Add more advanced ML implementations (starting with LR, then DL, etc)

Currently the examples (and tests) use very basic shaders. Should explore adding more advanced implementations for either data processing or machine learning training/inference. Could start with something more basic like logistic regression implementation.

Add non-copyable class to core kompute components

Currently most of the Kompute components should not be passed using assignment or copy constructor operators. Also potentially deleting the base constructor on most classes that should not be created with the required data. ie

class NonCopyable {
public:
    NonCopyable() = default;
    NonCopyable(const NonCopyable&)  = delete;
    NonCopyable(const NonCopyable&&) = delete;
    NonCopyable& operator=(const NonCopyable&) = delete;
    NonCopyable& operator=(const NonCopyable&&) = delete;
};
}

Add method to mark tensor as out-of-sync when data is modified

Currently tensors can be initialised and this can be checked with the isInit function. We should explore how to also guard from the tensors beign out-of-sync when operations that modify memory are carried out, as well as when the host memory is modified without calling maptohost function

Explore / discuss for potential ideas or improvements

Open issue to openly discuss potential ideas or improvements, whether on documentation, interfaces, examples, bug fixes, etc.

Migrate to GTest

In order to start exploring mock testing via #8 first step will be to move from catch2 to gtest

Create delete function in manager to free / destroy sequence

Currently there is only way to create a new sequence but there is no way to destroy the sequence inside the manager

See #113 for the discussion providing further context on the exploration related to this issue. For compleeteness, the conversation is added below:

@alexander-g : @axsaucedo Bump, this is quite important.
If you are unsure about asynchronous operations, it would already help to make those methods private and to only clean up anonymous sequences in synchronous ops, i.e. in evalOpDefault directly after the sequence is evaluated.

@axsaucedo: @alexander-g hmm that is actually a pretty good idea... ok I agree this is quite important, thank you @aliPMPAINT for the initial work, I will pull the branch, and do some deeper testing, as well as expose via python interface to make sure we can merge. Thanks both for driving forward this areas.

@axsaucedo: @alexander-g currently looking at this, I now remember why we don't delete the function right after the sequence is executed. The reason why this is by design is because the sequence actually acquires the GPU memory ownership of the Tensors when running the OpCreateTensors. More specifically, it is currently in the memory hierarchy for the OpBase to have a choice whether to free the Tensors it owns or not - here is the line that shows how OpCreateTensor owns the tensors it uses by setting the last value to true [... code example ...] This is actually tied to the discussion I provided in your #130 PR (#130 (comment)) where I basically mention that there may be a better way to think about the OpTensorCreate operation, and the hierarchy of the Tensor memory management overall. This is something that would require a deeper dive, primarily as I need to check if it can make sense for the memory ownership to no longer be Manager -> Sequence -> Op -> Tensor, and update it to remove Tensor as a dependency of the operation. This could require a less trivial refactor, as it's not clear what that hierarchy would be otherwise. This is teh reason why anonymous Sequences are actually kept. This could actually be change such that anonymous sequences are destroyed after execution, which is what we had at the very beginning but as you can imagine this led to tensor memory being destroyed right after a OpTensorCreate in an anonymous function. But this woudl require specific awareness of users, which would have to always run the OpTensorCreate in a non-anonymous function, and then the rest of the operations in an anonymous function.

@axsaucedo: For completeness it may be worth sharing another piece of insight to provide the full picture. Namely that when I was initially desigining the tensors, i had an idea where instead of having two tensors - a Staging and a Device tensor - all the logic would be contained as a single Tensor. Namely this single tensor would contain both relevant memories for the staging and host tensors. The reason why this was not pursued at the end is because the idea was to provide further access and granularity into how the memory is made available to the user - this way the user can actually decide to build their own OpTensor operations, which may own different tensors, sometimes perhaps destroying the staging tensor completely to avoid having the memory overhead. With this in mind, it could be revisited, but there is that disadvantage that for any device tensor, there would be always the overhead of an extra memory component that would be obscured by the user. The advantage of this is that in this case the staging tensor wouldnt' be created directly by the OpCreate operation, which means that the tensors could be owned by the top level manager. For all of these there are various tradeoffs that could be reassessed. For now, the simplest way to approach this is to enable a flag that alllows for deletion of anonymous sequences as soon as they are executed, which can be used by more advanced users that would know that using the OpCreateTensor would involve memory management, and hence should be executed in a non-anonymous function that should be managed manually (and then deleted explicitly with the deleteNamedSequence.

Enable OpCreateTensor for more than 1 tensor

Currently OpCreateTensor is limited to a single tensor, would be required to add capabilities to perform creation action to all the tensors passed. This would have to take into account tensors which will require staging tensors (ie eDevice type tensors)

Provide further granularity on handling staging tensors

Currently staging tensors are created by OpCreateTensor but they are kept for the life of the OpCreateTensor. Should explore ways to expose more granular functionality, whereby the mStaging tensors may only be needed once, for example, in the case of input tensors. There are other cases where the staging tensors would be expected to be used many times, in case that ther input tensor is to be updated multiple times from the host (to avoid creating a staging tensor every time). This may require potentially moving the staging tensor ownership into the Tensor class itslef. This would mean that the ownership for the vulkan memory of this tensor would still be owned by the life of the OpCreateTensor, but it would still be possible to have other functions like an OpMapData (as per #39) which could still leverage the staging tensor (whcih can either be already existing or created/destroyed in the single operation.

Support multi-dimensional Tensors

Currently only 1 dimensional vectors are supported, looking to add extra support for multi-dimensional vectors

Expose dst/src offsets in recordCopyFrom tensor function

Currently it's not possible to specify the region to copy from when calling the tensor.recordCopyFrom function. Should investigate the most effective way in which this can be exposed in case people may want to extend this functionality.

Validate if mapped memory should always have flush and the workflow in which it should be done

Currently the way that memory is mapped into host memory is always flushed. This in theory is only necessary if the host bit is not coherent, so should ensure this doesn't have a performance issue, or alternatively make sure it's only carried out if the memory has to be flushed (ie if the memory host bit is not coherent).

ie https://github.com/axsaucedo/vulkan-kompute/blob/af4f429d4df40e3b963bc2e106ba0756e80f1d29/src/Tensor.cpp#L202-L212

Introduce optional checks to ensure operations are being done on same device

Currently there are commands such as the tensor's recordCopyFrom command that can be potentially executed with resources that may hvae different devices. This would encompass adding functionality to check that the commands are being carried out on the same device (otherwise it would lead into segfaults)

Improve access to tensor underlying data for speed and ease of access

Currently the only access to tensor data is via the data() function, which currently returns as value making it quite inefficient. Should explore extending to pass by const ref, as well as potentially overloading the subscript operator.

Move all todos in the code into github issues

Currently there were quite a few todos added in the code, objective is to move all these into github issues for either next or the following release

Explore simplifying AccessFlags / PipelineStageFlags for recordBufferMemoryBarrier

Currently the recordBufferMemoryBarrier from the Tensor class requires raw vulkan flags to set up the barrier. There are higher level patterns and information that the tensor itself holds that could eb used to infer or simplify the provisoning of these parameters. Will be required to explore if this is something required / worth abstracting
https://github.com/axsaucedo/vulkan-kompute/blob/cb0d7f7cf38974410abb3ff76563c5772f0f24f0/src/include/kompute/Tensor.hpp#L58-L61

OpCreateTensor doesn't map data into GPU with OpCreateTensor for host tensors

Currently the opcreatetensor maps the data into the gpu for device tensors using a staging tensor, but for the host tensors it does not perform a copy, expecting the copy to be done manually (as per the TestTensor.cpp tests). This should be also carried out by the OpCreateTensor on the postSubmit function.

OpBase provide further granularity on which tensors in the array should be freed

Explore adding functionality to provide visibility on which tensors should be freed (as opposed to binary, all or none). This is in the following line:
https://github.com/axsaucedo/vulkan-kompute/blob/b91c392f5e1eaa0393bc8bef4555ff9918cdf98f/src/include/kompute/OpBase.hpp#L82

Add Android example for Kompute

One of the bigger benefits of Vulkan is its use in mobile. Should add an example on how vulkan can be used in android to enable android devs to get up and running with vulkan kompute easily on mobile.

Add function to reinitialise / resize / recreate tensor

Currently once the tensor is created, the size of the vector is fixed and there is no option to reshape or recreate. If new data is passed which is different size (or if it's required to re-configure a tensor to use a new array size), it is not possible. It should be made possible to resize and re-initialise tensors, whether directly through the OpCreateTensor or through a new operation.

Add preSubmit function to OpBase to account for multiple eval commands in parallel

Currently there is functionality for postSubmit which is executed after the submission has been carried out. However there are situations where there are commands that need to be carried out on every eval, such as in the case of the OpTensorSyncDevice, where the data should be copied and mapped into the device memory before the record command is re-executed.

Add vulkan memory allocator capabilities (or extension capabilities)

Currently it's not possible to allow users to provide custom memory allocators. Adding capabilities for memory allocators would provide further granularity and would be especially helpful for applications that bring their own vulkan systems, and use kompute for kompute specific tasks. An exploration could be to use Vulkan Memory allocator https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator which has C++ headers https://github.com/malte-v/VulkanMemoryAllocator-Hpp