m4rs-mt / ilgpu Goto Github PK

View Code? Open in Web Editor NEW

1.1K 29.0 108.0 11.13 MB

ILGPU JIT Compiler for high-performance .Net GPU programs

Home Page: http://www.ilgpu.net

License: Other

C# 98.73% PowerShell 0.10% HTML 0.73% Ruby 0.01% SCSS 0.17% JavaScript 0.12% Python 0.15%

ilgpu jit gpu compiler nvidia amd intel ptx cuda opencl

ilgpu's People

Contributors

Stargazers

Watchers

Forkers

cephdon hansen-consultancy danwbr awesomedotnetcore moftz samuto denniste kevmal linhdh chrisaliotta karayakar keroosha nguyenvuduc alllucky1996 costigan adamreeve jgiannuzzi kingland yey007 timer-t mfagerlund saimarpaka mm86133 asaurytski-exadel uzbekdev1 davidmatthews1uvm ruberik killop devbox10 marcin-krystianc lwqwag bangush nnelg losttech rpfeuti mfkl pc-crashes hengle nullandkale 76creates dystudio ljubon joey9801 leeairw sucrose0413 mpsquark star64 deng0 nitell redcool zamanashiq3 infarh mfkiwl hokb bmstrzelecki mtkaaai free-gate aluevano telgwin ericjohnsohnisc liudeke nscript-site toanlcgift mlh-fellowship naskio visiontekf ginkobalboa mayuri10000 boyfacegirl tortillazhawaii pavlovic-ivan damian-666 doctorcode-org hyamw corwinjoy ericeckstein fabiovincenzi deimi1985 hengtek stantoxt ddjerqq enricomi contlatdiabe xueshaoyu 7comploncuwo demarillacizere sunxiaotianmg incerrygit dfki-jugr vivoware profe-98 kant2002 diluculo aozora404 12345swordy darth-vader-lg zhaozhongguo95 eecsb jesperbandersen kilngod

ilgpu's Issues

LDToken/Static array is not supported

I'm having issue with creation of new array in kernel.
If I'm using something like

var sobelXKern = new short[3, 3]
{
    { -1,0,1 },
    { -2,0,2 },
    { -1,0,1 }
};
var sobelXKern = new short[9]
{
    -1,0,1,
    -2,0,2,
    -1,0,1
};

The compiler will create ldtoken instruction, which is not supported.

Is there any way to have predefined/readonly array to use in kernel?

Missing support for constant & __launch_bounds__ ?

Example __launch_bounds__

Cpp

#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP     2
__global__ void
__launch_bounds__( MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP )
Kernel1() { }

LLVM-IR

define void @Kernel1() #0 {
  %1 = call i32 @cudaLaunch(i8* bitcast (void ()* @Kernel1 to i8*))
  br label %2
  ret void
}
!0 = !{void ()* @Kernel1, !"kernel", i32 1}
!1 = !{void ()* @Kernel1, !"maxntidx", i32 256}
!2 = !{void ()* @Kernel1, !"minctasm", i32 2}

CS

/// <summary>
/// LaunchBounds
/// Specify Launch Bounds for CUDA execution
/// </summary>
public struct LaunchBounds
{
    /// <summary>
    /// Max threads per block
    /// </summary>
    public readonly uint maxTperB;
    /// <summary>
    /// Min blocks per SM
    /// </summary>
    public readonly uint minBperSM;

    public LaunchBounds(uint maxT = 1024, uint minB = 1)
    {
        maxTperB = maxT;
        minBperSM = minB;
    }
}

var kernel = accelerator.LoadKernel<>(MyKernel, LaunchBounds);

Example constant

Cpp

__constant__ int constant_var;

LLVM-IR

@constant_var = internal global i32 0, align 4

Net core PublishSingleFile=true ArgumentOutOfRangeException

Hi @m4rs-mt,
ILGPU v0.6.0

I use net.core PublishSingleFile=true.

dotnet publish -r linux-x64 /p:PublishSingleFile=true /p:PublishTrimmed=true /p:AssemblyVersion=2.1.3 -c release

The application is success build. But it crashes on startup with an error:

sudo ./ConsoleApp2
Unhandled exception. System.TypeInitializationException: The type initializer for 'ILGPU.Context' threw an exception.
 ---> System.ArgumentOutOfRangeException: Length cannot be less than zero. (Parameter 'length')
   at System.String.Substring(Int32 startIndex, Int32 length)
   at ILGPU.Context..cctor()
   --- End of inner exception stack trace ---
   at ConsoleApp2.Program.Main(String[] args)
Aborted

Since as a result of building my application as PublishSingleFile without FileVersion.
When initializing the context is required an AssemblyFileVersionAttribute for the Assembly.

ILGPU/Src/ILGPU/Context.cs

Line 60 in db7bdd2

var versionString = Assembly.GetCallingAssembly().

ILGPU/Src/ILGPU/Context.cs

Line 61 in db7bdd2

GetCustomAttribute<AssemblyFileVersionAttribute>().Version;

Workaround:
As a workaround, I added flag /p:FileVersion for dotnet publish.

Thanks in advance.

Debug symbols for Roslyn-generated in-memory assembly

Using 0.3.0, when running in debug mode, with an in-memory assembly generated with Roslyn I get this:

System.ArgumentException: The path is not of a legal form. Parameter name: path at System.IO.Path.GetDirectoryName(String path) at ILGPU.Compiler.DebugInformation.DebugInformationManager.TryLoadSymbols(Assembly assembly, AssemblyDebugInformation& assemblyDebugInformation) at ILGPU.Compiler.DebugInformation.DebugInformationManager.TryLoadDebugInformation(MethodBase methodBase, MethodDebugInformation& methodDebugInformation) at ILGPU.Compiler.DebugInformation.DebugInformationManager.LoadSequencePoints(MethodBase methodBase) at ILGPU.CompilationContext.EnterMethod(MethodBase method) at ILGPU.CompileUnit.GetMethod(MethodBase methodBase, Boolean create) at ILGPU.CompileUnit.GetMethod(MethodBase methodBase) at ILGPU.Backends.PTXBackend.CreateEntry(CompileUnit unit, EntryPoint entryPoint, String& entryPointName) at ILGPU.Backends.LLVMBackend.Compile(CompileUnit unit, MethodInfo entry, KernelSpecialization specialization) at ILGPU.Backends.Backend.Compile(CompileUnit unit, MethodInfo entry)

Everything's peachy in release builds.

Maybe I'm missing something, but there doesn't seem to be a way to supply custom PDBs not from files. I'd expect to have an option to build an AssemblyDebugInformation or give DebugInformationManager a stream.

Also a related question, can I somehow compile a kernel without loading the assembly? maybe some kind of Mono.Cecil integration? I'm using core 2.0 and there doesn't seem to be a way to unload assemblies (no AppDomain, Assembly.ReflectionOnlyLoad, or collectible AssemblyLoadContext)

Partial scripting support?

Hello @m4rs-mt ,
It appears that Google forums is no longer publicly available so I will ask my question here.

I'm looking to provide partial scripting support in my application that will leverage ILGPU. The kernel definitions and data made available will all be uniform (meaning they will all have the same static method declaration and the same number of properties for each kernel); however, I'd like to allow for customization to be made within the body of these methods.

How might I be able to leverage Roslyn generated code and have it work with ILGPU?

The main purpose of this request is to allow engineers to change formulas without having to recompile the application.

Thanks again,
Chris

p.s - This is really an amazing library! Keep up the good work!

Incompatibility with custom struct

i was trying to use a custom struct with the library but it says me that it is not supported:

using SMWControlLibRendering.Interfaces;

namespace SMWControlLibRendering.Colors
{
    public struct ColorR5G5B5 : ICanBeMultibytes
    {
        /// <summary>
        /// Gets or sets the r.
        /// </summary>
        public byte R { get; set; }
        /// <summary>
        /// Gets or sets the g.
        /// </summary>
        public byte G { get; set; }
        /// <summary>
        /// Gets or sets the b.
        /// </summary>
        public byte B { get; set; }
        /// <summary>
        /// Initializes a new instance of the <see cref="ColorR5G5B5"/> class.
        /// </summary>
        /// <param name="r">The r.</param>
        /// <param name="g">The g.</param>
        /// <param name="b">The b.</param>
        public ColorR5G5B5(byte r, byte g, byte b)
        {
            R = r;
            G = g;
            B = b;
        }
        public static implicit operator int(ColorR5G5B5 col)
        {
            return (col.R << 16) | (col.G << 8) | col.B;
        }
        public static implicit operator bool(ColorR5G5B5 col)
        {
            return ((col.R & 0x07) | (col.G & 0x07) | (col.B & 0x07)) != 0;
        }

        /// <summary>
        /// Sets the byte.
        /// </summary>
        /// <param name="index">The index.</param>
        /// <param name="b">The b.</param>
        public void SetByte(int index, byte b)
        {
            switch(index)
            {
                case 0:
                    R = b;
                    break;
                case 1:
                    G = b;
                    break;
                default:
                    B = b;
                    break;
            }
        }
    }
}

using ILGPU;
using System;

namespace SMWControlLibRendering.KernelStrategies.BitmapBuffer
{
    /// <summary>
    /// The draw bitmap buffer with zoom.
    /// </summary>
    public class DrawBitmapBufferWithZoom<T> : KernelStrategy<Index2, ArrayView<T>, ArrayView<T>, int, int, int, int> where T : struct
    {
        private static readonly DrawBitmapBufferWithZoom<T> instance = new DrawBitmapBufferWithZoom<T>();
        /// <summary>
        /// Executes the.
        /// </summary>
        /// <param name="index">The index.</param>
        /// <param name="destBuffer">The dest buffer.</param>
        /// <param name="srcBuffer">The src buffer.</param>
        /// <param name="offset">The offset.</param>
        /// <param name="dstWidth">The dst width.</param>
        /// <param name="srcWidth">The src width.</param>
        /// <param name="zoom">The zoom.</param>
        public static void Execute(Index2 index, ArrayView<T> destBuffer, ArrayView<T> srcBuffer, int offset,
            int dstWidth, int srcWidth, int zoom)
        {
            instance.strategy(index, destBuffer, srcBuffer, offset, dstWidth, srcWidth, zoom);
            HardwareAcceleratorManager.GPUAccelerator.Synchronize();
        }
        /// <summary>
        /// strategies the.
        /// </summary>
        /// <param name="index">The index.</param>
        /// <param name="destBuffer">The dest buffer.</param>
        /// <param name="srcBuffer">The src buffer.</param>
        /// <param name="offset">The offset.</param>
        /// <param name="dstWidth">The dst width.</param>
        /// <param name="srcWidth">The src width.</param>
        /// <param name="zoom">The zoom.</param>
        protected override void strategy(Index2 index, ArrayView<T> destBuffer, ArrayView<T> srcBuffer, int offset,
            int dstWidth, int srcWidth, int zoom)
        {
            T color = srcBuffer[(index.Y * srcWidth) + index.X];

            if (Convert.ToBoolean(color))
                return;

            int jw = offset;
            for (int j = 0; j < zoom; j++)
            {
                jw += dstWidth;
                for (int i = 0; i < zoom; i++)
                {
                    destBuffer[jw + i] = color;
                }
            }
        }
    }
}
    /// <summary>
    /// The kernel strategy.
    /// </summary>
    public abstract class KernelStrategy<T, U, V, W, X, Y, Z> : Strategy<T, U, V, W, X, Y, Z> where T : struct, IIndex
                                                                                        where U : struct
                                                                                        where V : struct
                                                                                        where W : struct
                                                                                        where X : struct
                                                                                        where Y : struct
                                                                                        where Z : struct
    {
        /// <summary>
        /// Gets or sets the kernel.
        /// </summary>
        protected Action<T, U, V, W, X, Y, Z> kernel { get; set; }
        /// <summary>
        /// Initializes a new instance of the <see cref="KernelStrategy"/> class.
        /// </summary>
        public KernelStrategy()
        {
            kernel = HardwareAcceleratorManager.GPUAccelerator.LoadAutoGroupedStreamKernel<T, U, V, W, X, Y, Z>(strategy);
        }
    }

What i am doing wrong?

AtomicOperationCAS_Uint Failed Test for AMD

C:\Windows\TEMP\OCL13720T5.cl:69:16: error: no matching function for call to 'atomic_compare_exchange_strong'
bool var21 = atomic_compare_exchange_strong ((atomic_ulong*)var14, & var17, var20);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
c:\constructicon\builds\gfx\seven\19.50\stream\opencl\compiler\clc2\ocl-headers\build\wNow64a\B_rel\opencl20_builtins.h:13704:1: note: candidate function not viable: no known conversion from '__generic atomic_ulong *' to 'volatile __global atomic_long *' for 1st argument

ILGPU not picking up the amd gpu for the opencl accerlation

opencl information regarding my gpu:
Num OpenCL platforms: 1

CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
CL_PLATFORM_VERSION: OpenCL 2.1 AMD-APP (2841.5)
CL_PLATFORM_PROFILE: FULL_PROFILE
Num devices: 1
- CL_DEVICE_NAME: Ellesmere
- CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.
- CL_DRIVER_VERSION: 2841.5
- CL_DEVICE_PROFILE: FULL_PROFILE
- CL_DEVICE_VERSION: OpenCL 1.2 AMD-APP (2841.5)
- CL_DEVICE_TYPE: GPU
- CL_DEVICE_VENDOR_ID: 0x1002
- CL_DEVICE_MAX_COMPUTE_UNITS: 36
- CL_DEVICE_MAX_CLOCK_FREQUENCY: 1360MHz
- CL_DEVICE_ADDRESS_BITS: 32
- CL_DEVICE_MAX_MEM_ALLOC_SIZE: 3145728KB
- CL_DEVICE_GLOBAL_MEM_SIZE: 3072MB
- CL_DEVICE_MAX_PARAMETER_SIZE: 1024
- CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64 Bytes
- CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 16KB
- CL_DEVICE_ERROR_CORRECTION_SUPPORT: NO
- CL_DEVICE_LOCAL_MEM_TYPE: Local (scratchpad)
- CL_DEVICE_LOCAL_MEM_SIZE: 32KB
- CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 3145728KB
- CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
- CL_DEVICE_MAX_WORK_ITEM_SIZES: [1024 ; 1024 ; 1024]
- CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
- CL_EXEC_NATIVE_KERNEL: 10707028
- CL_DEVICE_IMAGE_SUPPORT: YES
- CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
- CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
- CL_DEVICE_IMAGE2D_MAX_WIDTH: 16384
- CL_DEVICE_IMAGE2D_MAX_HEIGHT: 16384
- CL_DEVICE_IMAGE3D_MAX_WIDTH: 2048
- CL_DEVICE_IMAGE3D_MAX_HEIGHT: 2048
- CL_DEVICE_IMAGE3D_MAX_DEPTH: 2048
- CL_DEVICE_MAX_SAMPLERS: 16
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR: 4
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT: 2
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
- CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
- CL_DEVICE_EXTENSIONS: 25
- Extensions:
  - cl_khr_fp64
  - cl_amd_fp64
  - cl_khr_global_int32_base_atomics
  - cl_khr_global_int32_extended_atomics
  - cl_khr_local_int32_base_atomics
  - cl_khr_local_int32_extended_atomics
  - cl_khr_int64_base_atomics
  - cl_khr_int64_extended_atomics
  - cl_khr_3d_image_writes
  - cl_khr_byte_addressable_store
  - cl_khr_fp16
  - cl_khr_gl_sharing
  - cl_amd_device_attribute_query
  - cl_amd_vec3
  - cl_amd_printf
  - cl_amd_media_ops
  - cl_amd_media_ops2
  - cl_amd_popcnt
  - cl_khr_d3d10_sharing
  - cl_khr_d3d11_sharing
  - cl_khr_dx9_media_sharing
  - cl_khr_image2d_from_buffer
  - cl_khr_spir
  - cl_khr_gl_event
  - cl_amd_liquid_flash

ILGPU does not work on Roslyn based

Code execution is: Line 20

Translated Error:
The type initiator of 'ILGPU.LLVM.LLVMMethods' has thrown an exception.
The type initiator of 'ILGPU.Util.DLLLoader' has thrown an exception.
Could not load file or assembly 'System.Runtime.InteropServices.RuntimeInformation, Version = 4.0.2.0, Culture = neutral, PublicKeyToken = b03f5f7f11d50a3a' or one of its dependencies. The system can not find the file specified.
   + CpuAnnInterface .. ctor ()
   + VI.NumSharp.ProcessingDevice.get_CPUArrayDevice ()
   + VI.NumSharp.ProcessingDevice.set_Device (VI.NumSharp.Device)

Any idea?

What is allowed or not in kernels?

I have a question, when you do a kernel, what is allowed?

for example can kernel:

call functions of static classes?
use Convert._(param) (for example Convert.Int32(foo))
use custom classes or structs?
use local parameters if it is called into a method?
use class variables if it is called into a class?
etc...

SPIR-V backend

Hi and thanks for a very useful tool! I come from a graphics background and I love that I can debug my kernels (shaders) on the CPU. But after the debugging phase I would still like to have a hardware independent shader that I can feed directly into some graphics API.
If I understand correctly, you can only interact with graphics API via the buffer mapping (like in your SharpDX Interop), which seems to introduce some overhead each frame.
One alternative way would be to generate SPIR-V output. This would open up the possibility to cross compile the SPIR-V output to HLSL, GLSL (using SPIRV-Cross) or feed it into Vulkan or other compatible APIs. So the cross compiled shader would also run on GPUs from AMD, INTEL, ...

Cheers,
Daniel

Failed Test for AMD GPU

Failed Tests:
CLAtomicCASOperations_Debug
CLAtomicCASOperations_Release
CLAtomicOperations_Debug
CLAtomicOperations_Release
CLCompareFloatOperations_Debug
CLCompareFloatOperations_Release
CLConvertFloatOperations_Debug
CLConvertFloatOperations_Release
CLUnaryIntOperations_Debug
CLUnaryIntOperations_Release
CLWrapOperations_Debug
CLWrapOperations_Release

"SimpleKernel" example throws exception

I've tried to run SimpleKernel example and has got
ILGPU.Runtime.Cuda.CudaException: 'a PTX JIT compilation failed'

Inside CudaKernel constructor, the function CudaAPI.Current.LoadModule(...) returned
CudaError cudaStatus = CUDA_ERROR_INVALID_PTX;

string errorLog = "
_ptxas application ptx input, line 5; fatal   : Unsupported .version 6.0; current version is '5.0'
ptxas fatal   : Ptx assembly aborted due to errors_
"

Indeed, on line 5 one can see ".version 6.0":

kernel.PTXAssembly =

//
// Generated by ILGPU v0.5.1
//

.version 6.0
.target sm_50
.address_size 64


.visible .entry ILGPUKernel(
.param .align 4 .b8 _index_1318[4],
.param .align 8 .b8 _dataView_1319[16],
.param .b32 _constant_1320
)
{
	.reg .pred	%p<2>;
	.reg .b32	%r<8>;
	.reg .b64	%rd<4>;

	ld.param.b32	%r2, [_index_1318];
	ld.param.b64	%rd1, [_dataView_1319];
	ld.param.b32	%r3, [_dataView_1319+8];
	ld.param.b32	%r4, [_constant_1320];
	mov.b32	%r5, %ctaid.x;
	mov.b32	%r6, %ntid.x;
	mov.b32	%r7, %tid.x;
	mad.lo.s32	%r1, %r5, %r6, %r7;
	setp.ge.s32	%p1, %r1, %r2;
	@%p1 ret;

	L\_13170:
	mul.wide.u32	%rd3, %r1, 4;
	add.u64	%rd2, %rd1, %rd3;
	add.s32	%r7, %r1, %r4;
	st.b32	[%rd2], %r7;
	ret;
}

If it is important, my graphic card is Quadro M2000M.

Is this a bug or do I misunderstand something?
Thank you.

PTX compilation failed with bytes

Hi,
I tested today the master branch (commit cf83683).
All the bugs I reported are corrected (great work!), except this one :

    [Test]
    public void CompilationTest3()
    {
        using (var context = new Context())
        {
            using (var gpu = new CudaAccelerator(context, 0))
            {
                var method = GetType().GetMethod(nameof(Kernel3), BindingFlags.NonPublic | BindingFlags.Static);
                var compiledKernel = gpu.Backend.Compile(method, KernelSpecialization.Empty);
                var loadedKernel = gpu.LoadKernel(compiledKernel);
                loadedKernel.Dispose();
            }
        }
    }

    private static void Kernel3(GroupedIndex index, ArrayView<float> array)
    {
        byte b = 1;
        for (int i = 0; i < 2; i++)
        {
            if (b > 1)
            {
                array[0] = 1;
            }
            b = 2;
        }
    }

The exception I get : ILGPU.Runtime.Cuda.CudaException : 'a PTX JIT compilation failed'

Kernel debugging and profiling on GPU hardware

Currently, breakpoint-based kernel debugging is only possible with the CPU accelerator. Debugging with the software emulation layer is very convenient due to the very good properties of the .Net debugging environments. However, profiling on the real GPU hardware (currently) is very cumbersome since you cannot link your profiling insights with the actual instructions (statements) in your input language. In addition, debugging on the GPU itself would also be very convenient for larger input data.

The next ILGPU release will support basic source-line-based GPU debugging and profiling on GPU hardware. It will support debug information compatible with NVIDIA Nsight and CUDA-GDB. This functionality will be extended in the future in order to inspect local variables and memory contents.

When AMD GPUs will be supported?

I know it is not the best thing to ask about fetures and especially when they will be supported, but i was wondering because, on the roadmap i haven't seen anything about it yet and for a project i am working on it is needed, so is there any plan for it or is it considered to be supported in the near future? or [...]

Any reply would be appreciated,
Thanks in Advance,
George. (:

A PTX JIT compilation failed

I have an exception: "A PTX JIT compilation failed" at

var myKernel = accelerator.LoadAutoGroupedStreamKernel<Index, ArrayView<int>>(MyKernel);.

I have windows 10, nVidia GeForce 940MX, Visual Studio 2017

        using (var context = new Context())
        {
            using (var accelerator = new CudaAccelerator(context))
            {
                var myKernel = accelerator.LoadAutoGroupedStreamKernel<Index, ArrayView<int>>(MyKernel);

                using (var buffer2 = accelerator.Allocate<int>(1000))
                using (var buffer = accelerator.Allocate<int>(1000))
                {

                    s1.Start();

                    myKernel(buffer.Length, buffer.View);

                    accelerator.Synchronize();
                    s1.Stop();

                    var data = buffer.GetAsArray();

                    for (var i = 0; i < 100; i++)
                    {
                        Console.Write(" " + data[i]);
                    }
                }
            }
        }

OpenCL Backend: Function Generators

The new OpenCL backend still lacks some current function generators (types CLFunctionGenerator and CLKernelFunctionGenerator). They must be generated similar to the PTXFunctionGenerator and PTXKernelFunctionGenerator classes. However, the parameters and variable assignment logic must be adjusted for all kernel arguments and shared-memory arrays.

ILGPU is generating wrong opencl code with float values (on AMD GPUs)

Program source:

static void MandelbrotSet(Index index, float px, float py, ArrayView<int> output)
        {
            int Iterations = 0;
            float x = 0.0f;
            float y = 0.0f;
            while ((x * x) + (y * y) < 4 && Iterations < MaxIterations)
            {
                float tmp = x * x + y * y + px;
                y = 2 * x * y + py;
                x = tmp;
                Iterations += 1;
            }
            output[index] = Iterations;
        }

Kernel Source:

struct kernel_type_0;
struct internal_type_10;
struct internal_type_11;

struct internal_type_11
{
	int* field_0;
	int field_1;
};
struct internal_type_10
{
	int field_0;
};
struct kernel_type_0
{
	int field_0;
	int field_1;
};

kernel void ILGPUKernel(
	global int* view_0,
	struct internal_type_10 var1,
	float var2,
	float var3,
	struct kernel_type_0 var4)
{
	// Map parameters

	float var5;
	var5 = var2;
	float var6;
	var6 = var3;
	struct internal_type_11 var7;

	// Assign views

	var7.field_0 = view_0+var4.field_0;
	var7.field_1 = var4.field_1;

	// Kernel indices

	struct internal_type_10 var0;
	var0.field_0 = get_global_id( 0 );
	bool var8 = var0.field_0>=var1.field_0;
	if (var8)
		 return ;


	L_1000: ;
	float var9;
	float var10;
	int var11;
	int var13 =  0;
	float var14 =  0f;
	float var15 =  0f;
	var9 = var14;
	var10 = var15;
	var11 = var13;
	 goto L_1001;

	L_1001: ;
	bool var12;
	float var16 = (float)((float) var9 *  (float)var9);
	float var17 = (float)((float) var10 *  (float)var10);
	float var18 = (float)((float) var16 +  (float)var17);
	float var19 =  4f;
	bool var20 = (float) var18 >= (float) var19;
	if (var20)
		 goto L_1003;
	 goto L_1002;

	L_1002: ;
	int var21 =  1000;
	bool var22 = (int) var11 < (int) var21;
	var12 = var22;
	 goto L_1004;

	L_1003: ;
	bool var23 =  0;
	var12 = var23;
	 goto L_1004;

	L_1004: ;
	if (var12)
		 goto L_1006;
	 goto L_1005;

	L_1005: ;
	int var24 =  var0.field_0;
	int* var25 =  & var7.field_0[var24];
	 *  var25 =  var11;
	 return ;

	L_1006: ;
	float var26 = (float)((float) var9 *  (float)var9);
	float var27 = (float)((float) var10 *  (float)var10);
	float var28 = (float)((float) var26 +  (float)var27);
	float var29 = (float)((float) var28 +  (float)var5);
	float var30 =  2f;
	float var31 = (float)((float) var30 *  (float)var9);
	float var32 = (float)((float) var31 *  (float)var10);
	float var33 = (float)((float) var32 +  (float)var6);
	int var34 =  1;
	int var35 = (int)((int) var11 +  (int)var34);
	var9 = var29;
	var10 = var33;
	var11 = var35;
	 goto L_1001;

}

It should be float var14 = 0.0f, not float var14 = 0f

PTX JIT compilation failed (Ver. 0.6.0)

Hi @m4rs-mt,

this will return an PTX JIT compilation error on Linux but works on Windows.

_kernelRGB888PlanarToBGR888xPacked = _accelerator.LoadAutoGroupedStreamKernel<Index, int, int, int, ArrayView, ArrayView>(RGB888PlanarToBGR888xPacked);

Kernel:

private static void RGB888PlanarToBGR888xPacked(Index pixelIndex, int width, int height, int bpp, ArrayView<byte> source, ArrayView<byte> destination)
        {
            var offset = width * height;
            var r = source[pixelIndex];
            var g = source[pixelIndex + offset];
            var b = source[pixelIndex + offset * 2];

            destination[pixelIndex * bpp] = b;
            destination[pixelIndex * bpp + 1] = g;
            destination[pixelIndex * bpp + 2] = r;
        }

Only in version 0.6.0. The same code works with 0.5.1.

Error message: "An item with the same key has been added"

Hello. I'm using 0.7.0.0 and the Cuda accelerator. I'm getting a runtime error with the message "An item with the same key has been added". The error doesn't occur when using the CPU accelerator.

I don't think this is an ILGPU bug, but I'm uncertain how to debug my code given the lack of specificity in the error message.

The error is generated on this line, the first (and only) time an action is generated from my kernel.

var landing_probability_kernel = accelerator.LoadStreamKernel<Index2, ArrayView2D, ArrayView2D, int, int>(EstimateLandingProbability);

My question is 'how should I debug this message'? What sorts of things should I look for?
(I'm including the code for completeness, but I'm not asking you to look at it if you can suggest a direction to look.)

One possibility I'm unsure about is that this kernel allocates shared memory and create a 2D view of if. I don't know for sure whether creating 2D views is allowed inside kernels.

    static void EstimateLandingProbability(
        Index2 index,
        ArrayView2D<byte> global,
        ArrayView2D<short> counts,
        int start_row,
        int start_col)  // landing_radius
    {
        const int radius = landing_radius;
        var row = index.Y;
        var col = index.X;

        var shared_width = patch_width + radius * 2;
        var shared_length = shared_width * shared_width;
        Debug.Assert(shared_length <= 64000);
        var shared_linear = SharedMemory.Allocate<byte>(shared_length);
        var shared = shared_linear.As2DView(shared_width, shared_width);

        // Load the slope buffer.  This loads a square section of global memory into shared memory four times with offsets.
        var offset = radius + radius;
        var row0 = start_row - radius;
        var col0 = start_col - radius;
        LoadSquare(global, shared, index, 0, 0, row0, col0);                             // top left
        LoadSquare(global, shared, index, 0, offset, row0, col0 + offset);               // top right
        LoadSquare(global, shared, index, offset, 0, row0 + offset, col0);               // bottom left
        LoadSquare(global, shared, index, offset, offset, row0 + offset, col0 + offset); // bottom right

        // Count
        var sum = 0;
        var d2 = radius * radius;
        var high = radius + radius;
        for (var y = -radius; y <= high; y++)
            for (var x = -radius; x <= high; x++)
                sum += x * x + y * y <= d2 ? shared[row + y + radius, col + x + radius] : 0;

        // Write the resulting counts to global memory
        var r1 = start_row + row;
        var c1 = start_col + col;
        Debug.Assert(r1 >= 0);
        Debug.Assert(c1 >= 0);
        Debug.Assert(r1 < dem_width);
        Debug.Assert(c1 < dem_width);
        counts[start_row + row, start_col + col] = (short)sum;
    }

    static void LoadSquare(ArrayView2D<byte> global, ArrayView2D<byte> shared, Index2 index, int shared_row0, int shared_col0, int global_row0, int global_col0)
    {
        var row = index.Y;
        var col = index.X;
        var target_row = shared_row0 + row;
        var target_col = shared_col0 + col;
        var source_row = global_row0 + row;
        var source_col = global_col0 + col;
        shared[target_row, target_col] = source_row < 0 || source_row >= dem_height || source_col < 0 || source_col >= dem_width 
            ? (byte)0 
            : global[source_row, source_col];
    }

ILGPU 0.5: KeyNotFoundException Exception in XMath.Pow call

Consider following kernel code:

    static void kernel(Index2 index, ArrayView2D<int> inp, ArrayView2D<int> outp, int shoulder, double sigmaI, double sigmaS)
    {
        if (shoulder > index.X || shoulder > index.Y ||
            inp.Extent.X - shoulder <= index.X || inp.Extent.Y - shoulder <= index.Y)
        {
            return;
        }

        double iFiltered = 1;
        int ipp = 1;
        double gi = XMath.Pow(ipp, 2) + XMath.Pow(sigmaI, 2);// gaussian(ipp, sigmaI);
        iFiltered = iFiltered / gi;
        outp[index] = (int)iFiltered;
    }

It always fails throwing KeyNotFoundException on NVIDIA GPU ILGPU 0.5, but it works on GPU ILGPU v0.3 and on CPU with all versions of ILGPU.

Question: Is there a way to transfer jagged array

Hi, I am interested in using your lib for my project, but I am stuck on how to pass jagged array to Kernel. Now I use ArrayView2d, but it comes with lots of extra elements. Is there any more simple way to allocate such data on accelerator?

P.S. As I see it is also possible achieve some thing like jagged array by using simple ArrayView and copy jagged array to it, but it will also be needed to transfer sizes of sub-arrays, that is not really comfortable

OpenCL Backend: Structured Control-Flow Generation

Most OpenCL drivers complain about general goto commands that could easily be generated from our intermediate representation. For this purpose, we have added the IfInfos and LoopInfos analyses. They allow us to deduce high-level control flow that could be used to create good-looking if(...)and for or while loops within the CLCodeGenerator class.

The missing functionality should be implemented within the method GenerateCodeInternal in the class CLCodeGenerator.

Why .NET Standard dropped?

As part of 0.4.0-beta release, I see that .NET Standard was dropped, and replaced to .NET Core App 2.0 which is more limiting in my opinion. What are the rationale why .NET Core App is strictly needed for this library?

Use GPU and CPU Accelerator in parallel?

Is possible use an accelerator that use all CPU cores and GPU cores to process the same kernel and improve the performance?

Home page does not exist

http://www.ilgpu.net mentioned as the home page, does not work. Probably it needs to be renewed

Provide Simple Matrix Multiplication kernel?

Hi, first of all I want to thank you for all you've done!

I've decided to use ILGPU in my deep learning library project. I hope it is as fast as c++ cuda. Except compiling time, I believe that it doesn't have any latency while accessing gpu, which was the issue I am so afraid of. I made a couple tests for this.

I am still confused if I should use ILGPU, because @m4rs-mt didn't provide some basic kernels I don't know how to improve my code. I want to know how to improve the performance of SGEMM. Can you provide at least simple Matrix Multiplication kernel? I need to benchmark against CUBLAS and try to improve the performance.

ILGPU.Runtime.Cuda.CudaException: 'unknown error'

Just installed ILGPU to test out and get that error. I was getting this error:
"a PTX JIT compilation failed"

and so I updated the Nvidia drivers to latest. Then I got the titled error at this line:

using (var accelerator = Accelerator.Create(context, acceleratorId))

Seems to happen on most projects if not all. Cannot proceed to test out ILGPU.

Review of ILGPU v0.4.0-beta

Hi,
I tried to migrate an application from ILGPU 0.3 to ILGPU 0.4.0-beta.
The migration process was quite easy.
I find the new API user friendly :
- The explicit use of Cuda Streams will make the development of multi-threaded applications easier
- The new way of declaring shared memory is easy to work with
What I personally don't like in the API is not new : it is the poor naming of the GroupedIndex struct members, that makes writing and reading kernels difficult for Cuda developers.

For example : this is a typical AleaGPU kernel, that every Cuda programmer can understand :

private static void Kernel(int[] result, int[] arg1, int[] arg2)
{
    var start = blockIdx.x * blockDim.x + threadIdx.x;
    var stride = gridDim.x * blockDim.x;
    for (var i = start; i < result.Length; i += stride)
    {
        result[i] = arg1[i] + arg2[i];
    }
}

Here is the equivalent in ILGPU 0.3 or 0.4beta :

private static void Kernel(GroupedIndex index, ArrayView<int> result, ArrayView<int> arg1, ArrayView<int> arg2)
{
	var threadIdx = index.GroupIdx;
	var blockIdx = index.GridIdx;
	var blockDim = ILGPU.Group.Dimension.X;
	var start = blockIdx * blockDim + threadIdx;
	var stride = Grid.Dimension.X * blockDim;

	for (var i = start; i < result.Length; i += stride)
	{
		result[i] = arg1[i] + arg2[i];
	}
}

Why the hell is the thread index called "GroupIdx", and the block index "GridIdx" ?

After migrating to ILGPU 0.4beta, I wasn't able to make the application working, because several kernels cannot compile any more. Here is an example of a simple kernel that doesn't work any more (ILGPU.Runtime.Cuda.CudaException : 'a PTX JIT compilation failed') :

[Test]
public void CompilationTest()
{
	using (var context = new Context())
	{
		using (var gpu = new CudaAccelerator(context, 0))
		{
			var kernel = gpu.LoadKernel<GroupedIndex, ArrayView<float>, float>(MyFirstKernel);
		}
	}
}

private static void MyFirstKernel(GroupedIndex index, ArrayView<float> array, float value)
{
	array[index.ComputeGlobalIndex()] =  value / 2;
}

Anyway, thank you for the hard work - ILGPU is a very good CUDA framework!

Debug Assertion Failed: Invalid Structure Type on LoadAutoGroupedStreamKernel

I am trying to implement a terrain erosion algorithm in which a decent chunk of variables and ArrayView2Ds need to be used. Unfortunately, when I try to run the code, an exception is thrown.

The main window in VS2019 shows that kernel32.pdb is not loaded, however I found out that the real stack trace is in the output window.

'SculptorCore.exe' (CoreCLR: clrhost): Loaded 'ILGPURuntime'. 
'SculptorCore.exe' (Win32): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\Microsoft.DiaSymReader.Native.amd64.dll'. 
'SculptorCore.exe' (Win32): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Diagnostics.StackTrace.dll'. 
'SculptorCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Diagnostics.StackTrace.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'SculptorCore.exe' (Win32): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Reflection.Metadata.dll'. 
'SculptorCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Reflection.Metadata.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
---- DEBUG ASSERTION FAILED ----
---- Assert Short Message ----
Invalid structure type
---- Assert Long Message ----

   at ILGPU.IR.Types.IRTypeContext.CreateType(Type type, MemoryAddressSpace addressSpace)
   at ILGPU.IR.IRContext.CreateType(Type type, MemoryAddressSpace addressSpace)
   at ILGPU.IR.Construction.IRBuilder.CreateType(Type type, MemoryAddressSpace addressSpace)
   at ILGPU.IR.Construction.IRBuilder.CreateType(Type type)
   at ILGPU.Frontend.Block.PopMethodArgs(MethodBase methodBase, Value instanceValue)
   at ILGPU.Frontend.CodeGenerator.MakeCall(Block block, IRBuilder builder, MethodBase target)
   at ILGPU.Frontend.CodeGenerator.TryGenerateCode(Block block, IRBuilder builder, ILInstruction instruction)
   at ILGPU.Frontend.CodeGenerator.GenerateCode()
   at ILGPU.Frontend.CodeGenerationPhase.GenerateCodeInternal(MethodBase method, Boolean isExternalRequest, HashSet`1 detectedMethods, Method& generatedMethod)
   at ILGPU.Frontend.ILFrontend.DoWork()
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()

The relevant code can be found here. I have made sure to link the code at the current commit so that future changes do not impact this question.

Through some trial and error with breakpoints I have figured out that the actual exception occurs at line 137, which is the compilation of the kernel.

I imagine that there is something I am doing wrong with my own datatypes, due to the Invalid structure type message, but I can't figure out what exactly that is.

Problems with ulong XOR using CudaAccelerator

Hello, I have weird issues using CudaAccelerator.
I wrote a very simple (perhaps naive) implementation of the DES ECB encryption, and with the CPUAccelerator it works as expected. But when I change the CPUAccelerator to the CudaAccelerator, I get wrong results. Here is my full code:

using ILGPU;
using ILGPU.Runtime;
using ILGPU.Runtime.CPU;
using ILGPU.Runtime.Cuda;
using System;
using System.Linq;
using System.Diagnostics;

namespace VVatashi.Cryptography
{
  public static class Program
  {
    private struct RoundKeys
    {
      public ulong Key0;
      public ulong Key1;
      public ulong Key2;
      public ulong Key3;
      public ulong Key4;
      public ulong Key5;
      public ulong Key6;
      public ulong Key7;
      public ulong Key8;
      public ulong Key9;
      public ulong Key10;
      public ulong Key11;
      public ulong Key12;
      public ulong Key13;
      public ulong Key14;
      public ulong Key15;
    }

    private static byte[] PC1 = new byte[] {
      56, 48, 40, 32, 24, 16,  8,  0,
      57, 49, 41, 33, 25, 17,  9,  1,
      58, 50, 42, 34, 26, 18, 10,  2,
      59, 51, 43, 35, 62, 54, 46, 38,
      30, 22, 14,  6, 61, 53, 45, 37,
      29, 21, 13,  5, 60, 52, 44, 36,
      28, 20, 12,  4, 27, 19, 11,  3,
    };

    private static byte[] PC2 = new byte[] {
      13, 16, 10, 23,  0,  4,  2, 27,
      14,  5, 20,  9, 22, 18, 11,  3,
      25,  7, 15,  6, 26, 19, 12,  1,
      40, 51, 30, 36, 46, 54, 29, 39,
      50, 44, 32, 47, 43, 48, 38, 55,
      33, 52, 45, 41, 49, 35, 28, 31,
    };

    private static byte[] IP = new byte[] {
      57, 49, 41, 33, 25, 17,  9, 1,
      59, 51, 43, 35, 27, 19, 11, 3,
      61, 53, 45, 37, 29, 21, 13, 5,
      63, 55, 47, 39, 31, 23, 15, 7,
      56, 48, 40, 32, 24, 16,  8, 0,
      58, 50, 42, 34, 26, 18, 10, 2,
      60, 52, 44, 36, 28, 20, 12, 4,
      62, 54, 46, 38, 30, 22, 14, 6,
    };

    private static byte[] IIP = new byte[] {
      39, 7, 47, 15, 55, 23, 63, 31,
      38, 6, 46, 14, 54, 22, 62, 30,
      37, 5, 45, 13, 53, 21, 61, 29,
      36, 4, 44, 12, 52, 20, 60, 28,
      35, 3, 43, 11, 51, 19, 59, 27,
      34, 2, 42, 10, 50, 18, 58, 26,
      33, 1, 41,  9, 49, 17, 57, 25,
      32, 0, 40,  8, 48, 16, 56, 24,
    };

    private static byte[] E = new byte[] {
      31,  0,  1,  2,  3,  4,
       3,  4,  5,  6,  7,  8,
       7,  8,  9, 10, 11, 12,
      11, 12, 13, 14, 15, 16,
      15, 16, 17, 18, 19, 20,
      19, 20, 21, 22, 23, 24,
      23, 24, 25, 26, 27, 28,
      27, 28, 29, 30, 31,  0,
    };

    private static byte[] P = new byte[] {
      15,  6, 19, 20, 28, 11, 27, 16,
       0, 14, 22, 25,  4, 17, 30,  9,
       1,  7, 23, 13, 31, 26,  2,  8,
      18, 12, 29,  5, 21, 10,  3, 24,
    };

    private static byte[] S = new byte[] {
      // S0
      14,  0,  4, 15, 13,  7,  1,  4,  2, 14, 15, 2, 11, 13,  8,  1,
       3, 10, 10,  6,  6, 12, 12, 11,  5,  9,  9, 5,  0,  3,  7,  8,
       4, 15,  1, 12, 14,  8,  8,  2, 13,  4,  6, 9,  2,  1, 11,  7,
      15,  5, 12, 11,  9,  3,  7, 14,  3, 10, 10, 0,  5,  6,  0, 13,
      // S1
      15,  3,  1, 13,  8,  4, 14,  7,  6, 15, 11,  2,  3,  8,  4, 14,
       9, 12,  7,  0,  2,  1, 13, 10, 12,  6,  0,  9,  5, 11, 10,  5,
       0, 13, 14,  8,  7, 10, 11,  1, 10,  3,  4, 15, 13,  4,  1,  2,
       5, 11,  8,  6, 12,  7,  6, 12,  9,  0,  3,  5,  2, 14, 15,  9,
      // S2
      10, 13,  0,  7,  9,  0, 14,  9,  6,  3,  3,  4, 15,  6, 5, 10,
       1,  2, 13,  8, 12,  5,  7, 14, 11, 12,  4, 11,  2, 15, 8,  1,
      13,  1,  6, 10,  4, 13,  9,  0,  8,  6, 15,  9,  3,  8, 0,  7,
      11,  4,  1, 15,  2, 14, 12,  3,  5, 11, 10,  5, 14,  2, 7, 12,
      // S3
       7, 13, 13,  8, 14, 11,  3,  5,  0,  6,  6, 15, 9,  0, 10,  3,
       1,  4,  2,  7,  8,  2,  5, 12, 11,  1, 12, 10, 4, 14, 15,  9,
      10,  3,  6, 15,  9,  0,  0,  6, 12, 10, 11,  1, 7, 13, 13,  8,
      15,  9,  1,  4,  3,  5, 14, 11,  5, 12,  2,  7, 8,  2,  4, 14,
      // S4
       2, 14, 12, 11,  4,  2,  1, 12,  7,  4, 10,  7, 11, 13,  6,  1,
       8,  5,  5,  0,  3, 15, 15, 10, 13,  3,  0,  9, 14,  8,  9,  6,
       4, 11,  2,  8,  1, 12, 11,  7, 10,  1, 13, 14,  7,  2,  8, 13,
      15,  6,  9, 15, 12,  0,  5,  9,  6, 10,  3,  4,  0,  5, 14,  3,
      // S5
      12, 10,  1, 15, 10,  4, 15,  2,  9, 7,  2, 12,  6,  9,  8,  5,
       0,  6, 13,  1,  3, 13,  4, 14, 14, 0,  7, 11,  5,  3, 11,  8,
       9,  4, 14,  3, 15,  2,  5, 12,  2, 9,  8,  5, 12, 15,  3, 10,
       7, 11,  0, 14,  4,  1, 10,  7,  1, 6, 13,  0, 11,  8,  6, 13,
      // S6
       4, 13, 11,  0,  2, 11, 14,  7, 15,  4,  0,  9, 8,  1, 13, 10,
       3, 14, 12,  3,  9,  5,  7, 12,  5,  2, 10, 15, 6,  8,  1,  6,
       1,  6,  4, 11, 11, 13, 13,  8, 12,  1,  3,  4, 7, 10, 14,  7,
      10,  9, 15,  5,  6,  0,  8, 15,  0, 14,  5,  2, 9,  3,  2, 12,
      // S7
      13,  1,  2, 15,  8, 13,  4,  8,  6, 10, 15,  3, 11, 7, 1,  4,
      10, 12,  9,  5,  3,  6, 14, 11,  5,  0,  0, 14, 12, 9, 7,  2,
       7,  2, 11,  1,  4, 14,  1,  7,  9,  4, 12, 10, 14, 8, 2, 13,
       0, 15,  6, 12, 10,  9, 13,  0, 15,  3,  3,  5,  5, 6, 8, 11,
    };

    private static ulong Permutate(ulong value, ArrayView<byte> table)
    {
      ulong result = 0;
      for (int i = table.Length - 1; i >= 0; --i) {
        result >>= 1;
        if ((value & (0x8000000000000000UL >> table[i])) != 0) {
          result |= 0x8000000000000000UL;
        }
      }

      return result;
    }

    private static ulong ShiftHalfKey(ulong key, byte count)
    {
      key = (key << count) | (key >> (28 - count));
      return key & 0xFFFFFFF000000000UL;
    }

    private static RoundKeys CreateRoundKeys(ulong key, ArrayView<byte> pc1, ArrayView<byte> pc2)
    {
      key = Permutate(key, pc1);
      ulong c0 = key & 0xFFFFFFF000000000UL;
      ulong d0 = (key << 28) & 0xFFFFFFF000000000UL;
      return new RoundKeys() {
        Key0 = Permutate(ShiftHalfKey(c0, 1) | (ShiftHalfKey(d0, 1) >> 28), pc2),
        Key1 = Permutate(ShiftHalfKey(c0, 2) | (ShiftHalfKey(d0, 2) >> 28), pc2),
        Key2 = Permutate(ShiftHalfKey(c0, 4) | (ShiftHalfKey(d0, 4) >> 28), pc2),
        Key3 = Permutate(ShiftHalfKey(c0, 6) | (ShiftHalfKey(d0, 6) >> 28), pc2),
        Key4 = Permutate(ShiftHalfKey(c0, 8) | (ShiftHalfKey(d0, 8) >> 28), pc2),
        Key5 = Permutate(ShiftHalfKey(c0, 10) | (ShiftHalfKey(d0, 10) >> 28), pc2),
        Key6 = Permutate(ShiftHalfKey(c0, 12) | (ShiftHalfKey(d0, 12) >> 28), pc2),
        Key7 = Permutate(ShiftHalfKey(c0, 14) | (ShiftHalfKey(d0, 14) >> 28), pc2),
        Key8 = Permutate(ShiftHalfKey(c0, 15) | (ShiftHalfKey(d0, 15) >> 28), pc2),
        Key9 = Permutate(ShiftHalfKey(c0, 17) | (ShiftHalfKey(d0, 17) >> 28), pc2),
        Key10 = Permutate(ShiftHalfKey(c0, 19) | (ShiftHalfKey(d0, 19) >> 28), pc2),
        Key11 = Permutate(ShiftHalfKey(c0, 21) | (ShiftHalfKey(d0, 21) >> 28), pc2),
        Key12 = Permutate(ShiftHalfKey(c0, 23) | (ShiftHalfKey(d0, 23) >> 28), pc2),
        Key13 = Permutate(ShiftHalfKey(c0, 25) | (ShiftHalfKey(d0, 25) >> 28), pc2),
        Key14 = Permutate(ShiftHalfKey(c0, 27) | (ShiftHalfKey(d0, 27) >> 28), pc2),
        Key15 = Permutate(c0 | (d0 >> 28), pc2),
      };
    }

    private static ulong Encrypt(
      ulong key,
      ulong value,
      ArrayView<byte> e,
      ArrayView<byte> s,
      ArrayView<byte> p
    ) {
      value = Permutate(value, e) ^ key;

      ulong b0 = (value >> 58) & 0x3FUL;
      ulong b1 = (value >> (58 - 6)) & 0x3FUL;
      ulong b2 = (value >> (58 - 6 * 2)) & 0x3FUL;
      ulong b3 = (value >> (58 - 6 * 3)) & 0x3FUL;
      ulong b4 = (value >> (58 - 6 * 4)) & 0x3FUL;
      ulong b5 = (value >> (58 - 6 * 5)) & 0x3FUL;
      ulong b6 = (value >> (58 - 6 * 6)) & 0x3FUL;
      ulong b7 = (value >> (58 - 6 * 7)) & 0x3FUL;

      ulong sb0 = s[(int)b0];
      ulong sb1 = s[(int)b1 + 64];
      ulong sb2 = s[(int)b2 + 64 * 2];
      ulong sb3 = s[(int)b3 + 64 * 3];
      ulong sb4 = s[(int)b4 + 64 * 4];
      ulong sb5 = s[(int)b5 + 64 * 5];
      ulong sb6 = s[(int)b6 + 64 * 6];
      ulong sb7 = s[(int)b7 + 64 * 7];

      ulong result = (sb7 << 60);
      result = (sb6 << 60) | (result >> 4);
      result = (sb5 << 60) | (result >> 4);
      result = (sb4 << 60) | (result >> 4);
      result = (sb3 << 60) | (result >> 4);
      result = (sb2 << 60) | (result >> 4);
      result = (sb1 << 60) | (result >> 4);
      result = (sb0 << 60) | (result >> 4);

      return Permutate(result, p);
    }

    private static void Kernel(
      Index index,
      ArrayView<byte> pc1,
      ArrayView<byte> pc2,
      ArrayView<byte> ip,
      ArrayView<byte> iip,
      ArrayView<byte> e,
      ArrayView<byte> s,
      ArrayView<byte> p,
      ArrayView<ulong> keys,
      ArrayView<ulong> blocks,
      ArrayView<ulong> output
    ) {
      RoundKeys roundKeys = CreateRoundKeys(keys[index], pc1, pc2);

      ulong block = Permutate(blocks[index], ip);

      ulong a0 = block & 0xFFFFFFFF00000000UL;
      ulong b0 = (block << 32) & 0xFFFFFFFF00000000UL;

      ulong b1 = Encrypt(roundKeys.Key0, b0, e, s, p) ^ a0;
      ulong b2 = Encrypt(roundKeys.Key1, b1, e, s, p) ^ b0;
      ulong b3 = Encrypt(roundKeys.Key2, b2, e, s, p) ^ b1;
      ulong b4 = Encrypt(roundKeys.Key3, b3, e, s, p) ^ b2;
      ulong b5 = Encrypt(roundKeys.Key4, b4, e, s, p) ^ b3;
      ulong b6 = Encrypt(roundKeys.Key5, b5, e, s, p) ^ b4;
      ulong b7 = Encrypt(roundKeys.Key6, b6, e, s, p) ^ b5;
      ulong b8 = Encrypt(roundKeys.Key7, b7, e, s, p) ^ b6;
      ulong b9 = Encrypt(roundKeys.Key8, b8, e, s, p) ^ b7;
      ulong b10 = Encrypt(roundKeys.Key9, b9, e, s, p) ^ b8;
      ulong b11 = Encrypt(roundKeys.Key10, b10, e, s, p) ^ b9;
      ulong b12 = Encrypt(roundKeys.Key11, b11, e, s, p) ^ b10;
      ulong b13 = Encrypt(roundKeys.Key12, b12, e, s, p) ^ b11;
      ulong b14 = Encrypt(roundKeys.Key13, b13, e, s, p) ^ b12;
      ulong b15 = Encrypt(roundKeys.Key14, b14, e, s, p) ^ b13;
      ulong b16 = Encrypt(roundKeys.Key15, b15, e, s, p) ^ b14;

      ulong result = b16 | (b15 >> 32);
      output[index] = Permutate(result, iip);
    }

    private static void Main(string[] args)
    {
      using (var context = new Context())
      using (var accelerator = new CPUAccelerator(context))
      {
        const int count = 10;
        var kernel = accelerator.LoadAutoGroupedStreamKernel<
          Index,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<byte>,
          ArrayView<ulong>,
          ArrayView<ulong>,
          ArrayView<ulong>
        >(Kernel);
        using (var pc1Buffer = accelerator.Allocate<byte>(PC1.Length))
        using (var pc2Buffer = accelerator.Allocate<byte>(PC2.Length))
        using (var ipBuffer = accelerator.Allocate<byte>(IP.Length))
        using (var iipBuffer = accelerator.Allocate<byte>(IIP.Length))
        using (var eBuffer = accelerator.Allocate<byte>(E.Length))
        using (var sBuffer = accelerator.Allocate<byte>(S.Length))
        using (var pBuffer = accelerator.Allocate<byte>(P.Length))
        using (var keysBuffer = accelerator.Allocate<ulong>(count))
        using (var blocksBuffer = accelerator.Allocate<ulong>(count))
        using (var outputBuffer = accelerator.Allocate<ulong>(count))
        {
          pc1Buffer.CopyFrom(PC1, 0, 0, PC1.Length);
          pc2Buffer.CopyFrom(PC2, 0, 0, PC2.Length);
          ipBuffer.CopyFrom(IP, 0, 0, IP.Length);
          iipBuffer.CopyFrom(IIP, 0, 0, IIP.Length);
          eBuffer.CopyFrom(E, 0, 0, E.Length);
          sBuffer.CopyFrom(S, 0, 0, S.Length);
          pBuffer.CopyFrom(P, 0, 0, P.Length);

          // Just for testing.
          ulong[] keys = Enumerable.Range(0, count).Select(x => (ulong)x).ToArray();
          keysBuffer.CopyFrom(keys, 0, 0, keys.Length);
          ulong[] blocks = Enumerable.Range(0, count).Select(x => (ulong)x).ToArray();
          blocksBuffer.CopyFrom(blocks, 0, 0, blocks.Length);

          var sw = new Stopwatch();
          sw.Start();
          kernel(
            count,
            pc1Buffer.View,
            pc2Buffer.View,
            ipBuffer.View,
            iipBuffer.View,
            eBuffer.View,
            sBuffer.View,
            pBuffer.View,
            keysBuffer.View,
            blocksBuffer.View,
            outputBuffer.View
          );
          accelerator.Synchronize();
          sw.Stop();

          Console.WriteLine("Kernel executed in {0} ms", sw.ElapsedMilliseconds);

          ulong[] output = outputBuffer.GetAsArray();
          for (int i = 0; i < 10; ++i) {
            Console.WriteLine("{0} {1} {2}", keys[i], blocks[i], output[i]);
          }
        }
      }
    }
  }
}

After lots of debugging I discovered that it starts getting wrong results on the line ulong b1 = Encrypt(roundKeys.Key0, b0, e, s, p) ^ a0;
Strange that Encrypt(roundKeys.Key0, b0, e, s, p) and a0 separately calculated correctly, but b1 always evaluates to 0.

I am running it on the Arch linux 5.0.10, .NET Core 2.2.105, NVIDIA GeForce GTX650 GPU with the 418.56 driver, CUDA 10.1, if it matters.

Support for Quantum Device

How about to support a quantum devices for parallel programing, using : https://github.com/Microsoft/Quantum/

I could not compile the tag v0.7.0

Hi,

Might I get your help in compiling the code?
I have downloaded Visual Studio 2019 and checkout the tag v0.7.0
I got so many compilation errors. At first glance, I thought this might be because I have a different development environment. However I found this piece of code which make me suspect that this code is not actually ready for compilation.

As you can see, the Acclerator.DefaultStream is trying to access to a non-static field by using static access, it is weird that you can compiled it and pushed it to Nuget.

Hope to hear from you soon.
By the way, your work is awesome. I have tried cross platform GPU solutions made by Julia's CuArray, Python's Numba, and I found ILGPU is a lot easier to use.

Thank you.

ILGPU 0.5 : misaligned address with byte arrays

Hi,
I just tested IlGPU 0.5.
Some of my kernels cannot be executed with the CUDA backend, because of "misaligned address" CudaException when i'm using byte arrays.
Here is a minified sample :

[Test]
public void Kernel6Test()
{
	using (var context = new Context())
	{
		using (var gpu = new CudaAccelerator(context))
		{
			var kernel = gpu.LoadKernel<GroupedIndex, ArrayView<byte>>(Kernel6);
			var gpu_array = gpu.Allocate<byte>(2);
			kernel(gpu.DefaultStream, new GroupedIndex(1, 1), gpu_array);
			var result = gpu_array.GetAsArray(gpu.DefaultStream);
			Assert.AreEqual(1, result[1]);
		}
	}
}

private static void Kernel6(GroupedIndex index, ArrayView<byte> gpu_array)
{
	gpu_array[1] = 1;
}

OpenCL Backend: Argument Mapping

The new CLArgumentMapper maps managed values from the .Net world to the OpenCL driver world. Since we have to map all intrinsic view types (which are an abstract concept) to actual kernel types, the CLTypeGenerator class should do the job for us. However, it is not currently known whether OpenCL drivers accept structures with nested pointer declarations (they will not be passed as pointers but as constants - which should be acceptable...). We need to investigate this issue and change the classes CLArgumentMapper/CLTypeGenerator accordingly (if necessary).

[Suggestion] Indexing in 3D view via Index2

Just as suggestion, want to have ability to indexing to 3D view with Index2 and variable, instead of deconstruction of Index2 into index.X, index.Y like its done in many shader languages.

Example:

void Func(ArrayView3D<T> array)
{
    Index2 index = new Index2();
    // suggestion
    array[index, 1] = somevalue;
    array[1, index] = somevalue;
    //instead of
    array[index.X, index.Y, 1] = somevalue;
    array[1, index.X, index.Y] = somevalue;
}

ReinterpretCast in ILGPU 0.6

Hello,
In the process of migrating from ILGpu 0.3 to ILGpu 0.6, the following "trick" I used to reinterpret int to float and float to int doesn't work any more :

/// <summary>
/// https://stackoverflow.com/questions/173133/how-to-reinterpret-cast-a-float-to-an-int-is-there-a-non-static-conversion-oper
/// </summary>
public static class ReinterpretCastExtensions {
    public static unsafe float AsFloat( this int n ) => *(float*)&n;
    public static unsafe int AsInt( this float n ) => *(int*)&n;
}

class ILGpuTest
{

    [Test]
    public void ReinterpretCastTest()
    {
        using (var context = new Context())
        {
            using (var gpu = new CudaAccelerator(context, 0))
            {
                var cpuArray = new float[2];
                cpuArray[0] = 3.14f;
                cpuArray[1] = 1888888888.AsFloat();
                var gpuArray = gpu.Allocate<float>(2);
                gpuArray.CopyFrom(gpu.DefaultStream, cpuArray, 0, 0, 2);
                var kernel = gpu.LoadKernel<GroupedIndex, ArrayView<float>>(ReinterpretCast_Kernel);
                kernel(gpu.DefaultStream, new GroupedIndex(1, 1), gpuArray);
                var result = gpuArray.GetAsArray(gpu.DefaultStream);
                Assert.AreEqual(4.14f, result[0], 0.000001f);
                Assert.AreEqual(1888888889, result[1].AsInt());
            }
        }
    }

    private static void ReinterpretCast_Kernel(GroupedIndex index, ArrayView<float> array)
    {
        array[0]++;
        int myInt = array[1].AsInt();
        array[1] = (myInt + 1).AsFloat();
    }
}

This kernel was working in ILGPU 0.3 but does not compile in ILGPU 0.6.
Is there a workaround I can use to reinterpret float to int in the kernel ?
Could such a feature be implemented in a future version of ILGPU ? (Maybe like the Reinterpret methods of Alea GPU)

no enought of syncronization primitives

I can't find __syncthreads

CudaAccelerator empty result

Hi @m4rs-mt,

found another issue, the BicubicFilter works like a charm with CPU accelerator, but with Cuda it will return an zeroed destination array.

May i found some kind of limitation, but there is no error or anything else to debug further.

Any hint from your side what could be wrong?

Thanks in advance.

public static class BicubicFilter
    {
        public static void Kernel(Index2 pixelIndex, int originalWidth, int originalHeight, int bpp, int destinationWidth, int destinationHeight, float scaleFactor, ArrayView<byte> source, ArrayView<byte> destination)
        {
            int destLineStride = destinationWidth * bpp;
            int srcLineStride = originalWidth * bpp;
            int destX = pixelIndex.X;
            int destY = pixelIndex.Y;

            float v = (float)destY / (float)(destinationHeight - 1);
            float u = (float)destX / (float)(destinationWidth - 1);


            float x2 = (u * (float)originalWidth) - 0.5f;
            int xint = (int)x2;
            float xfract = x2 - XMath.Floor(x2);

            float y2 = (v * (float)originalHeight) - 0.5f;
            int yint = (int)y2;
            float yfract = y2 - XMath.Floor(y2);



            for (int i = 0; i < bpp; ++i)
            {
                // 1st row
                var p00 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint - 1, yint - 1, i);
                var p01 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 0, yint - 1, i);
                var p02 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 1, yint - 1, i);
                var p03 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 2, yint - 1, i);

                // 2nd row
                var p10 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint - 1, yint + 0, i);
                var p11 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 0, yint + 0, i);
                var p12 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 1, yint + 0, i);
                var p13 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 2, yint + 0, i);

                // 3rd row
                var p20 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint - 1, yint + 1, i);
                var p21 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 0, yint + 1, i);
                var p22 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 1, yint + 1, i);
                var p23 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 2, yint + 1, i);

                // 4th row
                var p30 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint - 1, yint + 2, i);
                var p31 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 0, yint + 2, i);
                var p32 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 1, yint + 2, i);
                var p33 = GetPixelClamped(source, originalWidth, originalHeight, bpp, srcLineStride, xint + 2, yint + 2, i);


                float col0 = CubicHermite(p00, p10, p20, p30, xfract);
                float col1 = CubicHermite(p01, p11, p21, p31, xfract);
                float col2 = CubicHermite(p02, p12, p22, p32, xfract);
                float col3 = CubicHermite(p03, p13, p23, p33, xfract);
                float value = CubicHermite(col0, col1, col2, col3, yfract);
                value = XMath.Clamp(value, 0.0f, 255.0f);
                destination[(destY * destLineStride) + (destX * bpp) + i] = (byte)value;
            }
        }

        private static byte GetPixelClamped(ArrayView<byte> src, int originalWidth, int originalHeight, int bpp, int stride, int x, int y, int colorIndex)
        {
            x = XMath.Clamp(x, 0, originalWidth - 1);
            y = XMath.Clamp(y, 0, originalHeight - 1);
            return src[(y * stride) + (x * bpp) + colorIndex];
        }

        private static float CubicHermite(float A, float B, float C, float D, float t)
        {
            float a = -A / 2.0f + (3.0f * B) / 2.0f - (3.0f * C) / 2.0f + D / 2.0f;
            float b = A - (5.0f * B) / 2.0f + 2.0f * C - D / 2.0f;
            float c = -A / 2.0f + C / 2.0f;
            float d = B;

            return a * t * t * t + b * t * t + c * t + d;
        }
    }

XMath does not include Atan2

I'm using version 5.1 installed in a c# project by NuGet.
The upgrade guide for release 5.0 says that GPUMath was replaced by XMath. In doing so, some functions were apparently omitted, including Acos, Asin, Atan and ATan2. The MathIntrinsicKind enumeration still contains that seem to correspond to these functions.

What drove the removal of these functions?
Might they be included again in a future release?
(I'm hoping for .NET Core support but need these functions.)

Btw, this project is very nice to use. Thank you.

Won;t compile on VS2017

Hi, thank you for making this fabulous project. I downloaded master branch and v.0.3.0 tagged code. When I tried to compile it I've got many errors related to class visibility and using non-nullable value type in templates.

1>------ Build started: Project: ILGPU, Configuration: Debug Any CPU ------
1>ArrayViewExtensions.cs(242,38,242,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(242,38,242,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(254,38,254,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(254,38,254,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(264,38,264,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(264,38,264,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(308,29,308,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(308,29,308,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(152,38,152,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(152,38,152,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(173,38,173,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(173,38,173,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(163,38,163,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(163,38,163,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>ArrayViewExtensions.cs(222,29,222,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Grid.cs(70,48,70,60): error CS0246: The type or namespace name 'GroupedIndex' could not be found (are you missing a using directive or an assembly reference?)
1>ArrayViewExtensions.cs(222,29,222,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Grid.cs(80,49,80,62): error CS0246: The type or namespace name 'GroupedIndex2' could not be found (are you missing a using directive or an assembly reference?)
1>Grid.cs(70,48,70,60): error CS0246: The type or namespace name 'GroupedIndex' could not be found (are you missing a using directive or an assembly reference?)
1>Grid.cs(90,49,90,62): error CS0246: The type or namespace name 'GroupedIndex3' could not be found (are you missing a using directive or an assembly reference?)
1>Grid.cs(80,49,80,62): error CS0246: The type or namespace name 'GroupedIndex2' could not be found (are you missing a using directive or an assembly reference?)
1>Runtime\Accelerator.cs(267,32,267,40): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer' is less accessible than method 'Accelerator.Allocate(int)'
1>Runtime\Accelerator.cs(279,34,279,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer2D' is less accessible than method 'Accelerator.Allocate(Index2)'
1>Grid.cs(90,49,90,62): error CS0246: The type or namespace name 'GroupedIndex3' could not be found (are you missing a using directive or an assembly reference?)
1>Runtime\Accelerator.cs(292,34,292,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer2D' is less accessible than method 'Accelerator.Allocate(int, int)'
1>Runtime\Accelerator.cs(267,32,267,40): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer' is less accessible than method 'Accelerator.Allocate(int)'
1>Runtime\Accelerator.cs(304,34,304,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer3D' is less accessible than method 'Accelerator.Allocate(Index3)'
1>Runtime\Accelerator.cs(279,34,279,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer2D' is less accessible than method 'Accelerator.Allocate(Index2)'
1>Runtime\Accelerator.cs(318,34,318,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer3D' is less accessible than method 'Accelerator.Allocate(int, int, int)'
1>Runtime\Accelerator.cs(292,34,292,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer2D' is less accessible than method 'Accelerator.Allocate(int, int)'
1>Runtime\MemoryBufferExtensions.cs(92,30,92,54): error CS0540: 'MemoryBuffer2D.IMemoryBuffer<T, Index2>.View': containing type does not implement interface 'IMemoryBuffer<T, Index2>'
1>Runtime\MemoryBufferExtensions.cs(92,30,92,54): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\Accelerator.cs(304,34,304,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer3D' is less accessible than method 'Accelerator.Allocate(Index3)'
1>Runtime\MemoryBufferExtensions.cs(436,29,436,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\Accelerator.cs(318,34,318,42): error CS0050: Inconsistent accessibility: return type 'MemoryBuffer3D' is less accessible than method 'Accelerator.Allocate(int, int, int)'
1>Runtime\MemoryBufferExtensions.cs(442,29,442,41): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(92,30,92,54): error CS0540: 'MemoryBuffer2D.IMemoryBuffer<T, Index2>.View': containing type does not implement interface 'IMemoryBuffer<T, Index2>'
1>Runtime\MemoryBufferExtensions.cs(92,55,92,59): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(454,30,454,54): error CS0540: 'MemoryBuffer3D.IMemoryBuffer<T, Index3>.View': containing type does not implement interface 'IMemoryBuffer<T, Index3>'
1>Runtime\MemoryBufferExtensions.cs(92,30,92,54): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(454,30,454,54): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(436,29,436,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(835,29,835,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(442,29,442,41): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(848,29,848,41): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(92,55,92,59): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(454,55,454,59): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(454,30,454,54): error CS0540: 'MemoryBuffer3D.IMemoryBuffer<T, Index3>.View': containing type does not implement interface 'IMemoryBuffer<T, Index3>'
1>Runtime\MemoryBufferExtensions.cs(24,29,24,52): error CS0540: 'MemoryBuffer.IMemoryBuffer<T, Index>.View': containing type does not implement interface 'IMemoryBuffer<T, Index>'
1>Runtime\MemoryBufferExtensions.cs(454,30,454,54): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(24,29,24,52): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(24,53,24,57): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(835,29,835,39): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(561,21,561,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyFrom(T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(848,29,848,41): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView'
1>Runtime\MemoryBufferExtensions.cs(583,21,583,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyFrom(AcceleratorStream, T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(454,55,454,59): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(715,21,715,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyTo(T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(24,29,24,52): error CS0540: 'MemoryBuffer.IMemoryBuffer<T, Index>.View': containing type does not implement interface 'IMemoryBuffer<T, Index>'
1>Runtime\MemoryBufferExtensions.cs(738,21,738,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyTo(AcceleratorStream, T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(187,21,187,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyFrom(T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(24,29,24,52): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'IMemoryBuffer<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(212,21,212,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyFrom(AcceleratorStream, T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(24,53,24,57): error CS0453: The type 'T' must be a non-nullable value type in order to use it as parameter 'T' in the generic type or method 'ArrayView<T, TIndex>'
1>Runtime\MemoryBufferExtensions.cs(328,21,328,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyTo(T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(561,21,561,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyFrom(T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(351,21,351,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyTo(AcceleratorStream, T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(583,21,583,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyFrom(AcceleratorStream, T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(715,21,715,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyTo(T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(738,21,738,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer3D.CopyTo(AcceleratorStream, T[][][], Index3, Index3, Index3)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(187,21,187,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyFrom(T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(212,21,212,29): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyFrom(AcceleratorStream, T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(328,21,328,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyTo(T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Runtime\MemoryBufferExtensions.cs(351,21,351,27): warning CS3019: CLS compliance checking will not be performed on 'MemoryBuffer2D.CopyTo(AcceleratorStream, T[][], Index2, Index2, Index2)' because it is not visible from outside this assembly
1>Done building project "ILGPU.csproj" -- FAILED.
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

Invalid resource handle (CUDA_ERROR_INVALID_HANDLE)

Hi @m4rs-mt,

i try to run ILGPU (0.7.0) combined with ML.NET (Tensorflow.GPU).
Both separate are working correctly.

If i create a ILGPU context and load all my kernels and after that create a ML context it seems that ILGPU loses it's access to the GPU.

If i create the ML context first and after that the ILGPU context, everything is working.
Is there any way/workaround so that ILGPU keeps it access to the GPU?

non-nullable value type error

Hi guys,

I am trying to use ILGPU and kinect together, but having hard time.
The problem is :

kernel = gpu.LoadAutoGroupedStreamKernel<Index, CameraSpacePoint[], ArrayView<float>>(TO_GPU);

When I try to pass the frame information to the kernel, it says :

Error CS0453 The type 'CameraSpacePoint[]' must be a non-nullable value type in order to use it as parameter 'T1' in the generic type or method 'KernelLoaders.LoadAutoGroupedStreamKernel<TIndex, T1, T2>(Accelerator, Action<TIndex, T1, T2>)'

Is there a way around ?

Unexpected behaviour of GetSubView

I'm working on HyperSpectralImages, so I have many 3D arrays. And, for calculations, I need to get vector of values of some pixel.

For example:

imageView.GetSubView(Index3.Zero, new Index3(1, 1, imageView.Depth))

So, I assume, that extent with this size will return me the view like Z column, over values in [0,0,0],[0,0,1],[0,0,2] and etc. But, I getting the values in X row, not Z, like [0,0,0], [1,0,0], [2,0,0];

I assuming this as bug, rather than undocumented feature.

If it should work like this, @m4rs-mt can you provide some workaround for my case?
(because I'm using this for my diploma and it's critical for me)

Error on Geforce GTX 750 Ti: the function PowF does not have an intrinsic implementation for this backend

I am using ILGPU v0.7.1 and ILGPU.Algorithms v0.7.0. I got this exception when ILGPU compiles my code that have a call to XMath.Pow method.

Please kindly note that I have called context.EnableAlgorithm(), you can see the Intrinsic implementation of PowF function is inside the IntrinsicManager.

Thank you.

Incorrect calculations

Error occurs on Cuda accelerator, CPU is working perfectly fine.
Below calculations were exact on version 3. After upgrade to current version of ILGPU they are no longer correct.

Example 1

Function:

static void SigmoidKernel(Index2 index, ArrayView2D<double> input, ArrayView2D<double> output)
{
    output[index] = 1.0 / (XMath.Exp(-input[index]) + 1.0);
}

Test:
Expected: Double[,] [0,5, 0,999898970806092, 0,731058578630005, 0,880797077977882, 0,952574126822433]
Actual: Double[,] [0,5, 0,999898970832629, 0,731058573739949, 0,880797079331918, 0,952574124795022]

Example 2

Kernel:

public static void SumOfSquares(Index2 index, ArrayView2D<double> weights, ArrayView<double> target)
{
    double quantity = weights[index] * weights[index];
    Atomic.Add(ref target[0], quantity);
}

Kernel usage:

public double SumOfSquares(IMatrix<double> input)
{
    using (MemoryBuffer<double> buffer = this.accelerator.Allocate<double>(1))
    {
	var sumOfSquaresKernel = this.accelerator.LoadAutoGroupedStreamKernel<Index2, ArrayView2D<double>, ArrayView<double>>(DNNKernels.SumOfSquares);
	sumOfSquaresKernel(this.CreateIndex(input), input.MatrixStorage.Cast<ILGPUMatrixStorage<double>>().Buffer, buffer.View);
	var result = buffer.GetAsArray()[0];

	return result;
    }
}

Test:
Input matrix:
(0.551760, 0.993597, 0.753529);
(0.904038, 0.947857, 0.576675);
(0.061601, 0.785984, 0.617554);

Expected result to be 4.9106901777409995, but found 4.910690177741.

CopyTo performance

Hi @m4rs-mt,

I'm using your library to write an image operation method to convert different pixel formats (for example planar/interlaced to packed).

The conversion itself cost around 1ms (3840x2160 RGB image) but the CopyTo method takes around 15ms.
CopyFrom instead is also only around 1ms.

The code is converting each pixel with a simple kernel.
Using CudaAccelerator with Nvidia Quadro M2200.

Code snipped:

private ArrayPoolItem<byte> ConvertToBGR888xPacked(Image image, bool addAlphaChannel)
        {
            var size = image.Width * image.Height;
            var planarBufferSize = size * image.BytesPerPixel;
            var packedBufferSize = size * (addAlphaChannel ? image.BytesPerPixel + 1 : image.BytesPerPixel);

// ILGPU relevant code
            _lastInputBuffer8Bit.CopyFrom(image.GetData<byte>(), 0, 0, planarBufferSize);
            var sw = Stopwatch.StartNew();
            _kernelRGB888PlanarToBGR888xPacked(image.Width * image.Height, image.Width, image.Height, addAlphaChannel ? image.BytesPerPixel + 1 : image.BytesPerPixel, _lastInputBuffer8Bit.View, _lastOutputBuffer8Bit.View);
            _accelerator.Synchronize();
            sw.Stop();
            Console.WriteLine($"Sync: {sw.ElapsedMilliseconds}");
            sw.Restart();
            var item = new ArrayPoolItem<byte>(packedBufferSize, _bytePool);
            _lastOutputBuffer8Bit.CopyTo(item.Value, 0, 0, packedBufferSize);
            sw.Stop();
            Console.WriteLine($"Copy: {sw.ElapsedMilliseconds}");
            return item;
        }

Is there any way to increase the performance of CopyTo?

Thanks in advance.

ILGPU.Algorithms library

We are currently working on a new ILGPU.Algorithms library that will replace the existing (outdated) Lightning library. It offers the same high-level functionality as Scan and RadixSort and offers additional warp-wide and group-wide functions like Reduce or AllReduce. We will also move the actual XMath functions to the new library.

This leads to significant changes in the current way intrinsic functions are implemented. I will update the entire internal intrinisc-processing pipeline and add new transformation passes to specialize device-specific functions. This also provides the ability to add custom (device-specific) functions from the outside via user-defined code generator hooks.

Note that ILGPU.algorithms will be fully compatible with the upcoming ILGPU v0.7 version and provide full cross-platform compatibility.

Rework static analysis

As we have just switched to VS2019 (312280c) we should definitely rework the internal static-program analysis pipeline. The current ILGPU project relies on an (now) obsolete version of the well known Post-build Code Analysis. We should definitely upgrade the analysis to a newer version.

m4rs-mt / ilgpu Goto Github PK

ilgpu's People

Contributors

Stargazers

Watchers

Forkers

ilgpu's Issues

Example __launch_bounds__

Example __constant__

Example 1

Example 2

Recommend Projects

Recommend Topics

Recommend Org

Example constant