juliagpu / amdgpu.jl Goto Github PK
View Code? Open in Web Editor NEWAMD GPU (ROCm) programming in Julia
License: Other
AMD GPU (ROCm) programming in Julia
License: Other
To fix issues as raised in JuliaGPU/ROCArrays.jl#4 (comment)
It should be possible to support non-bitstype arguments and possibly on-device allocations with a bit of elbow grease, as long as we allocate all non-bitstype structures entirely on HSA finegrained memory blocks (even when they reference other device memory blocks). We'll probably need to provide:
Hey!
I wanted to try AMDGPU.jl, but I fail at the first step at running a simple test code.
The problem:
I installed: ROCm install Ubuntu
Then: ]add AMDGPU
Then trieing to run this code: Running-a-simple-kernel example
The run result:
ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
[1] getproperty at ./Base.jl:33 [inlined]
[2] getindex at ./refvalue.jl:32 [inlined]
[3] HSAArray(::Array{Float64,1}) at /home/user/.julia/packages/AMDGPU/JBXsp/src/hsaarray.jl:34
[4] top-level scope at /home/user/repo/amd_test/tests/test_AMD.jl:9
[5] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088
[6] include_string(::Module, ::String, ::String) at ./loading.jl:1096
on the first AMDGPU line:
a_d = AMDGPU.HSAArray(a)
I have ubuntu 20.04 and running this test on multiple Radeon VII.
I think the problem should be pretty trivial since ]test AMDGPU throws the same error.
ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
[1] getproperty at ./Base.jl:33 [inlined]
[2] getindex at ./refvalue.jl:32 [inlined]
[3] get_default_agent() at /home/user/.julia/packages/AMDGPU/JBXsp/src/agent.jl:109
[4] top-level scope at /home/user/.julia/packages/AMDGPU/JBXsp/test/runtests.jl:28
[5] include(::String) at ./client.jl:457
[6] top-level scope at none:6
in expression starting at /home/user/.julia/packages/AMDGPU/JBXsp/test/runtests.jl:28
ERROR: Package AMDGPU errored during testing
Does anyone know what do I do wrong?
For some reason, 0-argument kernels end up with an HSA_STATUS_ERROR_INVALID_ARGUMENT
error.
So that users don't accidentally specify a too-large groupsize, we should throw an error if it's greater than a UInt16
, and possibly even query the agent to find out the real max value.
My understanding is that the 580 is going out of support, but for what is worth, here is a test run and a console session with failures.
Is there any expectation for these tests to ever pass on 580?
Let me know how I can help fix these issues (if possible). I have zero knowledge of the low-level implementation of the gpu support.
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.0-beta1 (2021-01-08)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using AMDGPU; using LinearAlgebra
julia> N = 100;
julia> m = rand(Float64, N, N); a = rand(Float64, N); b = rand(Float64, N);
julia> m_g = ROCArray(m); a_g = ROCArray(a); b_g = ROCArray(b);
julia> versioninfo()
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 7 1700 Eight-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, znver1)
julia> mul!(b_g, m_g, a_g)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
Memory access fault by GPU node-1 (Agent handle: 0x19b4290) on address 0x640000. Reason: Page not present or supervisor privilege.
signal (6): Aborted
in expression starting at REPL[4]:1
Allocations: 34952292 (Pool: 34939863; Big: 12429); GC: 39
fish: “~/localcompiles/julia-1.6.0-bet…” terminated by signal SIGABRT (Abort)
Test Summary: | Pass Error Broken Total
AMDGPU | 932 15 81 1028
Core | 1 1
HSA | 16 6 22
HSA Status Error | 1 1
Agent | 5 5
Memory | 10 6 16
Pointer-based | 3 3
Array-based | 2 2
Type-based | 1 1
Pointer information | 1 1
Page-locked memory (OS allocations) | 5 5
Exceptions | 3 3
Mutable structs | 1 1
Codegen | 3 3
Device Functions | 175 77 252
ROCArray | 737 9 3 749
GPUArrays test suite | 737 9 746
math | 8 8
indexing scalar | 249 249
input output | 5 5
value constructors | 36 36
indexing multidimensional | 25 9 34
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
empty array | 8 7 15
1D | 1 1 2
2D with other index Colon() | 2 2 4
2D with other index 1:5 | 2 2 4
2D with other index 5 | 2 2 4
GPU source | 2 1 3
CPU source | 2 1 3
JuliaGPU/CUDA.jl#461: sliced setindex | 1 1
interface | 7 7
conversions | 72 72
constructors | 335 335
ROCm External Libraries | 3 3
ERROR: LoadError: Some tests did not pass: 932 passed, 0 failed, 15 errored, 81 broken.
in expression starting at /home/stefan/.julia/packages/AMDGPU/UpYiP/test/runtests.jl:29
ERROR: Package AMDGPU errored during testing
~> /opt/rocm/bin/rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 1700 Eight-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 1700 Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3000
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32878744(0x1f5b098) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32878744(0x1f5b098) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx803
Uuid: GPU-XX
Marketing Name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1366
BDFID: 2304
Internal Node ID: 1
Compute Unit: 36
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
~> /opt/rocm/opencl/bin/clinfo
Number of platforms: 2
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
Platform Name: Clover
Platform Vendor: Mesa
Platform Extensions: cl_khr_icd
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3212.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: Clover
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 1366Mhz
Address bits: 64
Max memory allocation: 6871947673
Image support: No
Max size of kernel argument: 1024
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 27487790692
Constant buffer size: 67108864
Max number of constant args: 16
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 0
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7f589bdbab60
Name: Radeon RX 580 Series (POLARIS10, DRM 3.40.0, 5.4.0-65-generic, LLVM 11.0.1)
Vendor: AMD
Device OpenCL C version: OpenCL C 1.1
Driver version: 20.3.4 - kisak-mesa PPA
Profile: FULL_PROFILE
Version: OpenCL 1.1 Mesa 20.3.4 - kisak-mesa PPA
Extensions: cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp64
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
Device Topology: PCI[ B#9, D#0, F#0 ]
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1366Mhz
Address bits: 64
Max memory allocation: 7301444400
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 26591
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8589934592
Constant buffer size: 7301444400
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 3006477104
Max global variable size: 7301444400
Max global variable preferred total size: 8589934592
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7f589388acf0
Name: gfx803
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3212.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 1.2
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
The short of it: our GC doesn't know when a kernel holds a reference to HSA-allocated objects, so long-running kernels could end up seeing their resources finalized from under them. Fixing this should be reasonably simple, if slightly more expensive: we add references to them (either directly in the HSAQueue
object or even globally) within @roc
, and then all wait
calls will remove those references once the kernel holding them completes. This would imply that direct usages of rocfunction
+roccall
will have to manually preserve such objects, but I can't think of any easy way around that.
This is slightly tricky/undefined since, when two or more HSAArray
's are broadcast together, one could allocate the new HSAArray
on any of their devices. Maybe we should just fall back to Array
if not operating in-place?
julia> versioninfo()
Julia Version 1.5.4-pre.0
Commit 599ecd8210* (2020-11-10 10:50 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 7 3700X 8-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, znver2)
Environment:
JULIA_PKG_DEVDIR = /home/vchuravy/src
Status `/tmp/jl_Mx1zez/Manifest.toml`
[21141c5a] AMDGPU v0.2.1
[621f4979] AbstractFFTs v0.5.0
[79e6a3ab] Adapt v2.3.0
[56f22d72] Artifacts v1.3.0
[b99e7846] BinaryProvider v0.5.10
[fa961155] CEnum v0.4.1
[34da2185] Compat v3.25.0
[e66e0078] CompilerSupportLibraries_jll v0.3.4+0
[187b0558] ConstructionBase v1.0.0
[864edb3b] DataStructures v0.18.8
[7a1cc6ca] FFTW v1.2.4
[f5851436] FFTW_jll v3.3.9+6
[1a297f60] FillArrays v0.10.1
[0c68f7d7] GPUArrays v5.1.0 ⚲
[61eb1bfa] GPUCompiler v0.8.3
[1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
[692b3bcd] JLLWrappers v1.1.3
[929cbde3] LLVM v3.5.1
[856f044c] MKL_jll v2020.2.254+0
[1914dd2f] MacroTools v0.5.6
[ca575930] NetworkOptions v1.2.0
[efe28fd5] OpenSpecFun_jll v0.5.3+4
[bac558e1] OrderedCollections v1.3.2
[189a3867] Reexport v0.2.0
[ae029012] Requires v1.1.1
[6c6a2e73] Scratch v1.0.3
[efcf1570] Setfield v0.7.0
[276daf66] SpecialFunctions v1.1.0
[a759f4b9] TimerOutputs v0.5.7
[2a0f44e3] Base64
[ade2ca70] Dates
[8bb1440f] DelimitedFiles
[8ba89e20] Distributed
[9fa8497b] Future
[b77e0a4c] InteractiveUtils
[76f85450] LibGit2
[8f399da3] Libdl
[37e2e46d] LinearAlgebra
[56ddb016] Logging
[d6f4376e] Markdown
[a63ad114] Mmap
[44cfe95a] Pkg
[de0858da] Printf
[3fa0cd96] REPL
[9a3f8284] Random
[ea8e919c] SHA
[9e88b42a] Serialization
[1a1011a3] SharedArrays
[6462fe0b] Sockets
[2f01184e] SparseArrays
[10745b16] Statistics
[8dfed614] Test
[cf7118a7] UUIDs
[4ec0a83e] Unicode
unhandled address space
UNREACHABLE executed at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1201!
signal (6): Aborted
in expression starting at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:29
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
llvm_unreachable_internal at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Support/ErrorHandling.cpp:209
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1201 [inlined]
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Target/AMDGPU/SIISelLowering.cpp:1124
isLegalAddressingMode at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/CodeGen/BasicTTIImpl.h:234 [inlined]
getGEPCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/Analysis/TargetTransformInfoImpl.h:765
getUserCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Analysis/TargetTransformInfo.cpp:209
getUserCost at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/include/llvm/Analysis/TargetTransformInfo.h:330 [inlined]
isFreeInLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:1284 [inlined]
isNotUsedOrFreeInLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:1312 [inlined]
sinkRegion at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:535
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:382
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:219 [inlined]
runOnLoop at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Transforms/Scalar/LICM.cpp:202
runOnFunction at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/Analysis/LoopPass.cpp:225
runOnFunction at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1648
runOnModule at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1685
runOnModule at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1750 [inlined]
run at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/LegacyPassManager.cpp:1863
LLVMRunPassManager at /home/vchuravy/src/julia/deps/srccache/llvm-9.0.1/lib/IR/Core.cpp:4022
macro expansion at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/util.jl:114 [inlined]
LLVMRunPassManager at /home/vchuravy/.julia/packages/LLVM/MZvb3/lib/libLLVM_h.jl:2881 [inlined]
run! at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/passmanager.jl:39
#62 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/optim.jl:24 [inlined]
ModulePassManager at /home/vchuravy/.julia/packages/LLVM/MZvb3/src/passmanager.jl:33
unknown function (ip: 0x7f55b990b811)
optimize! at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/optim.jl:19
macro expansion at /home/vchuravy/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
macro expansion at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:114 [inlined]
macro expansion at /home/vchuravy/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
#codegen#87 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:106
codegen##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:63 [inlined]
#compile#85 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
compile at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
#rocfunction_compile#226 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:327 [inlined]
rocfunction_compile##kw at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:324
unknown function (ip: 0x7f5484ed104c)
#check_cache#106 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
unknown function (ip: 0x7f5484ed0d63)
check_cache##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:15
unknown function (ip: 0x7f5484ecee53)
broadcast_kernel at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:60 [inlined]
#cached_compilation#107 at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:0
unknown function (ip: 0x7f5484ecec23)
cached_compilation##kw at /home/vchuravy/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65
unknown function (ip: 0x7f5484ecea3c)
#rocfunction#223 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:316
rocfunction at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:314 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/execution.jl:184 [inlined]
#gpu_call#263 at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/array.jl:15
unknown function (ip: 0x7f5484ece6e8)
gpu_call##kw at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/src/array.jl:14 [inlined]
#gpu_call#1 at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/device/execution.jl:67 [inlined]
gpu_call##kw at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/device/execution.jl:46 [inlined]
copyto! at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:68 [inlined]
copyto! at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/src/host/broadcast.jl:76 [inlined]
materialize! at ./broadcast.jl:848 [inlined]
materialize! at ./broadcast.jl:845
unknown function (ip: 0x7f5484ece305)
#18 at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/test/testsuite/gpuinterface.jl:4
#176#test_interface at /home/vchuravy/.julia/packages/GPUArrays/eVYIC/test/testsuite.jl:43
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:77 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:77 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:65 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
macro expansion at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:63 [inlined]
macro expansion at /home/vchuravy/builds/julia-1.5/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1115 [inlined]
top-level scope at /home/vchuravy/.julia/packages/AMDGPU/lrlUy/test/runtests.jl:31
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:834
jl_parse_eval_all at /home/vchuravy/src/julia/src/ast.c:913
jl_load_rewrite at /home/vchuravy/src/julia/src/toplevel.c:914
include at ./client.jl:457
jl_apply at /home/vchuravy/src/julia/src/julia.h:1690 [inlined]
do_call at /home/vchuravy/src/julia/src/interpreter.c:117
eval_value at /home/vchuravy/src/julia/src/interpreter.c:206
eval_stmt_value at /home/vchuravy/src/julia/src/interpreter.c:157 [inlined]
eval_body at /home/vchuravy/src/julia/src/interpreter.c:552
jl_interpret_toplevel_thunk at /home/vchuravy/src/julia/src/interpreter.c:660
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:840
jl_toplevel_eval_flex at /home/vchuravy/src/julia/src/toplevel.c:790
jl_toplevel_eval_in at /home/vchuravy/src/julia/src/toplevel.c:883
eval at ./boot.jl:331
exec_options at ./client.jl:272
_start at ./client.jl:506
jfptr__start_31419 at /home/vchuravy/builds/julia-1.5/usr/lib/julia/sys.so (unknown line)
jl_apply at /home/vchuravy/src/julia/ui/../src/julia.h:1690 [inlined]
true_main at /home/vchuravy/src/julia/ui/repl.c:106
main at /home/vchuravy/src/julia/ui/repl.c:227
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
_start at /home/vchuravy/builds/julia-1.5/usr/bin/julia (unknown line)
Allocations: 227026595 (Pool: 226953636; Big: 72959); GC: 246
ERROR: Package AMDGPU errored during testing (received signal: 6)
It should start at a bit longer than the minimum possible kernel launch-and-complete latency, and then go up to a user-defined maximum.
If I try to build AMDGPU:
(@v1.5) pkg> build AMDGPU
Building AMDGPU → `~/.julia/packages/AMDGPU/nnddY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
│ Inconsistency detected by ld.so: dl-close.c: 223: _dl_close_worker: Assertion `(*lp)->l_idx >= 0 && (*lp)->l_idx < nloaded' failed!
└ @ Pkg.Operations ~/Documents/languages/julia/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949
Now
julia> using BinaryProvider
shell> cat .julia/packages/AMDGPU/nnddY/deps/rocm-external/ext.jl
# autogenerated file, do not edit
const ext_libs_configured = false
julia> include("/home/chriselrod/.julia/packages/AMDGPU/nnddY/deps/build.jl")
paths = ["/opt/rocm/hsa/lib"]
[ Info: Found useable ld.lld at /opt/rocm/llvm/bin/ld.lld
WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
WARNING: replacing module Previous.
shell> cat .julia/packages/AMDGPU/nnddY/deps/rocm-external/ext.jl
# autogenerated file, do not edit
const librocfft = "/opt/rocm/lib/librocfft.so"
const librocalution = "/opt/rocm/lib/librocalution.so"
const libmiopen = "/opt/rocm/lib/libMIOpen.so"
const ext_libs_configured = true
const libhip = "libamdhip64"
const librocrand = "/opt/rocm/rocrand/lib/librocrand.so"
const librocsparse = "/opt/rocm/lib/librocsparse.so"
const librocblas = "/opt/rocm/lib/librocblas.so"
If I don't first using BinaryProvider
the script exits on the inlcude without generating a mroe complete ext.jl
.
As noted in ROCm/ROCgdb#5, we load libhsa-runtime64.so
, which could be any major version of ROCR (bad idea). We should instead explicitly load libhsa-runtime64.so.1
, which will then let us use ROCgdb to debug AMDGPU.jl!
For some reason the tests allocate memory indefinitely. I suspect that it's mostly coming from cached executables, which should probably be kept in LRU fashion.
Read-only memory can be allocated via the HSA runtime, and can be potentially much faster for reads than regular global memory. We should support working with this memory via the Mem
API.
As pointed out in #68 (comment), objects like RuntimeEvent{HSAStatusSignal}
print as some monstrosity that can easily be mistaken for an error. We should make sure that all user-facing objects print decently.
Initially I was prompted no HSA lib, so I installed https://aur.archlinux.org/packages/hsa-rocr/, then
(@v1.4) pkg> build AMDGPU
Building AMDGPU → `~/.julia/packages/AMDGPU/6zgIY/deps/build.log`
┌ Error: Error building `AMDGPU`:
│ [ Info: libhsa-runtime64.so: true
│ ERROR: LoadError: LoadError: could not load library "/opt/rocm/hsa/lib/libhsa-runtime64.so"
│ /home/akako/Downloads/julia-1.4.1/bin/../lib/julia/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /opt/rocm/hsa/lib/libhsa-runtime64.so)
│ Stacktrace:
Since at the moment many of the dependencies required by AMDGPU.jl are not distributed as artifacts and system configuration can be further complicated by version/architecture mismatches, it would be useful to have a Docker image including Julia with AMDGPU installed, (alternatively the user can use their own Julia environment with a volume exposing ~/.julia
to the container).
Since JuliaGPU/AMDGPUnative.jl#62 was merged, we should document the fact that math intrinsics exist and show how to use them in a kernel.
Wavefronts could race on making the hostcall or filling the buffer
When a kernel traps or otherwise does a bad thing, it will (usually) inactivate its associated queue. However, the soft-wait wait()
call will just keep spinning, waiting for a dead queue to signal it. We should explicitly check the queue status while spinning, and if it becomes inactivated, throw an appropriate error message.
It looks like the new buildkite CI isn't testing rocBLAS, rocFFT, and rocRNG bindings (probably my fault). We should ensure that if we're running under CI, that all of those libraries are available.
The errors in question: (minus all the file calls)
Got exception outside of a @test
Conversion of boxed type Array{Float32(Also Int32,Int64,Complex{Float32}),1} is not allowed
Test threw exception
Expression: compare((a->begin
a[view(i, 1, :), :]
end), AT, a)
GPU compilation of kernel index_kernel(AMDGPU.ROCKernelContext, ROCDeviceArray{Float64,2,AMDGPU.AS.Global}, ROCDeviceArray{Float64,2,AMDGPU.AS.Global}, Tuple{Int64,Int64}, Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},Base.Slice{Base.OneTo{Int64}}}) failed
KernelError: passing and using non-bitstype argument
Argument 6 to your kernel function is of type Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true},Base.Slice{Base.OneTo{Int64}}}, which is not isbits:
.1 is of type SubArray{Int64,1,Array{Int64,2},Tuple{Int64,Base.Slice{Base.OneTo{Int64}}},true} which is not isbits.
.parent is of type Array{Int64,2} which is not isbits.
Expression: typeof(A[other, []]) == typeof(AT(Ac[other, []]))
HSA error (code #4097, HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.)
Test threw exception
Expression: x[:, :, 2] == y
Not implemented
Test result:
Test Summary: | Pass Error Broken Total
AMDGPU | 889 22 71 982
Core | 20 20
HSA | 32 1 33
Codegen | 3 3
Device Functions | 110 68 178
ROCArray | 723 22 2 747
GPUArrays test suite | 723 22 745
math | 8 8
indexing scalar | 243 6 249
errors and warnings | 12 12
getindex with Float32 | 34 34
getindex with Float64 | 34 34
getindex with Int32 | 34 34
getindex with Int64 | 34 34
getindex with Complex{Float32} | 34 34
getindex with Complex{Float64} | 34 34
setindex! with Float32 | 1 1 2
setindex! with Float64 | 1 1 2
setindex! with Int32 | 1 1 2
setindex! with Int64 | 1 1 2
setindex! with Complex{Float32} | 1 1 2
setindex! with Complex{Float64} | 1 1 2
issue #42 with Float32 | 3 3
issue #42 with Float64 | 3 3
issue #42 with Int32 | 3 3
issue #42 with Int64 | 3 3
issue #42 with Complex{Float32} | 3 3
issue #42 with Complex{Float64} | 3 3
get/setindex! | 3 3
input output | 5 5
value constructors | 36 36
indexing multidimensional | 17 16 33
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
sliced setindex, CPU source | 1 1
empty array | 8 7 15
1D | 1 1 2
2D with other index Colon() | 2 2 4
2D with other index 1:5 | 2 2 4
2D with other index 5 | 2 2 4
GPU source | 3 3
CPU source | 3 3
interface | 7 7
conversions | 72 72
constructors | 335 335
ROCm External Libraries | 2 2
GPU:RX480
ROCm:3.9.x Built from source (might've messed up)
rocminfo output:
*******
Agent 2
*******
Name: gfx803
Uuid: GPU-XX
Marketing Name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26591(0x67df)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1266
BDFID: 256
Internal Node ID: 1
Compute Unit: 36
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Running on GLIBC void linux. AMD RX570 8gb sapphire.
Using my fork for using Yggdrasil HSA artifacts: https://github.com/0x0f0f0f/AMDGPU.jl/tree/artifacts
gen(x) = rand(x,x)
ROCArray(gen(10)) * ROCArray(gen(10))
Memory access fault by GPU node-1 (Agent handle: 0x556f0ed38c80) on address 0xa0000. Reason: Page not present or supervisor privilege.
signal (6): Aborted
in expression starting at REPL[9]:1
Allocations: 40373585 (Pool: 40359774; Big: 13811); GC: 45
Aborted
https://github.com/ROCmSoftwarePlatform/MIOpen
This one should be doable in pieces, as MIOpen is a pretty large library. Having even partial MIOpen support will make FluxML/Flux.jl#938 more useful.
Currently only special globals are initialized during executable compilation and linking. We should allow users to provide key-value pairs to @roc
that map a global name to an initialization function.
We should be able to guess how well a given kernel can occupy a given piece of hardware. We should then be able to allow @roc groupsize=auto ...
to automatically select a groupsize when it's irrelevant to the given kernel.
Void Linux, glibc. Built libhsakmt.so
and libhsa-runtime64.so
manually.
(@v1.5) pkg> build AMDGPU
Building AMDGPU → `~/.julia/packages/AMDGPU/lrlUy/deps/build.log`
(@v1.5) pkg> ^C
julia> using AMDGPU
FATAL ERROR: Symbol "ccalllib_libhsa-runtime64445"not found
signal (6): Aborted
in expression starting at REPL[2]:1
raise at /builddir/glibc-2.30/signal/../sysdeps/unix/sysv/linux/raise.c:51
abort at /builddir/glibc-2.30/stdlib/abort.c:79
unknown function (ip: 0x7f0b696b417a)
unknown function (ip: 0x7f0b697b9a31)
unknown function (ip: 0x7f0b697b9712)
unknown function (ip: 0x7f0b697ba5f5)
unknown function (ip: 0x7f0b697bb458)
unknown function (ip: 0x7f0b6973c28c)
unknown function (ip: 0x7f0b696ee04f)
unknown function (ip: 0x7f0b69704a05)
unknown function (ip: 0x7f0b69705cc8)
unknown function (ip: 0x7f0b69706051)
unknown function (ip: 0x7f0b69709669)
unknown function (ip: 0x7f0b6970fa2f)
unknown function (ip: 0x7f0b6970fb7d)
unknown function (ip: 0x7f0b697103a9)
unknown function (ip: 0x7f0b697ba02d)
unknown function (ip: 0x7f0b697bb458)
unknown function (ip: 0x7f0b6973c28c)
jl_apply_generic at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b697709a9)
jl_init_restored_modules at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a3d3fb3)
unknown function (ip: 0x7f0b5a4d83cf)
unknown function (ip: 0x7f0b5a60d63d)
unknown function (ip: 0x7f0b5a558c1e)
unknown function (ip: 0x7f0b5a5597bc)
unknown function (ip: 0x7f0b697705fd)
unknown function (ip: 0x7f0b69771b32)
unknown function (ip: 0x7f0b69771c98)
jl_toplevel_eval_in at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a2d9c81)
unknown function (ip: 0x7f0b5a03162e)
unknown function (ip: 0x7f0b5a032270)
unknown function (ip: 0x7f0b5a03239a)
unknown function (ip: 0x7f0b5a048ae7)
run_repl at /builddir/julia-1.5.2/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
unknown function (ip: 0x7f0b5a1dbfdf)
unknown function (ip: 0x7f0b5a1dc0b8)
unknown function (ip: 0x7f0b697498c0)
jl_f__apply_latest at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f0b5a2054a7)
unknown function (ip: 0x7f0b5a210b07)
unknown function (ip: 0x7f0b5a2123ae)
unknown function (ip: 0x7f0b5a212505)
unknown function (ip: 0x5600ebef4755)
unknown function (ip: 0x5600ebef4332)
__libc_start_main at /builddir/glibc-2.30/csu/../csu/libc-start.c:308
unknown function (ip: 0x5600ebef43d9)
Allocations: 4838761 (Pool: 4836877; Big: 1884); GC: 6
Aborted
Dear @jpsamaroo,
I want to know what is the best source to setup this library? I tried to follow the basic setup mentioned in the docs, which is a little bit confusing for me and I don't know if am missed something or not but trieing hard for a hour already without success. 😞
Is it possible to ask a cleaner install instruction list?
I would be glad if I could use it, it looks damn promising!
Currently, we only support using HSARuntime.jl as our device runtime, which currently only allows use of this package by Linux users with recent kernels. However, @vchuravy has made the suggestion that we could use OpenCL.jl as an alternative backend, since it supports loading arbitrarily-generated GPU object code. Because AMD OpenCL drivers are available on Windows and Mac systems (and potentially others, like the BSDs), we should be able to use this as a workaround for users on such systems until ROCm is made available.
So that other calls can wait for completion.
Reported by @chriselrod
LLVM and the ROCm device libs expose the necessary functions to access the owning queue for a kernel and place packets on it. We should implement the equivalent of CUDAnative's dynamic_cufunction
to make use of this.
I get the following build error on Manjaro:
I added the rocm
libraries from arch4edu
and replaced libstdc++.so
provided by Julia with the one provided by the system by copying it. I get the same error with the Julia 1.5.2 binaries and current Julia master.
I set up the environment as follows:
LD_LIBRARY_PATH="/opt/rocm/hsa/lib/"
LD_LIBRARY_PATH="/opt/rocm/lib/:$LD_LIBRARY_PATH"
PATH="/opt/rocm/llvm/bin/:$PATH"
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.0-DEV.1046 (2020-09-25)
_/ |\__'_|_|_|\__'_| | Commit 55aeb2ff01 (6 days old master)
|__/ |
(@v1.6) pkg> add AMDGPU#master
...
(@v1.6) pkg> build AMDGPU
Building AMDGPU → `~/.julia/packages/AMDGPU/ztzIl/deps/build.log`
ERROR: Error building `AMDGPU`:
: CommandLine Error: Option 'disable-symbolication' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
To be thread-safe, we need at least one lock around all runtime operations which mutate global state (such as DEFAULT_AGENT
/DEFAULT_QUEUE
).
We should be able to pass equivalents of most of CUDAnative's tests, so let's do that!
To support running operations on either the GPU or CPU, since HSAArrays already use "unified memory" by default. Ref: https://github.com/JuliaGPU/CuArrays.jl/issues/303
hope you complete this package -
as im sure you know - fails to compile on OSX.
Error: Error building AMDGPUnative
:
│ ERROR: LoadError: Your platform ("x86_64-apple-darwin18.6.0", parsed as "x86_64-apple-darwin14-gcc8-cxx03") is not supported by this package!
│ Stacktrace:
│ [1] error(::String) at ./error.jl:33
│ [2] top-level scope at /Users/cbrown/.julia/packages/AMDGPUnative/gE2NQ/deps/build.jl:26
│ [3] include(::String) at ./client.jl:439
│ [4] top-level scope at none:5
│ in expression starting at /Users/cbrown/.julia/packages/AMDGPUnative/gE2NQ/deps/build.jl:22
└ @ Pkg.Operations /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.4/Pkg/src/Operations.jl:892
The LLVM AMDGPU target has features like XNACK that we might want to enable in certain cases, like wavefront debugging. We should document each known feature and provide a way to set them, globally and/or per-kernel.
This issue is used to trigger TagBot; feel free to unsubscribe.
If you haven't already, you should update your TagBot.yml
to include issue comment triggers.
Please see this post on Discourse for instructions and more details.
If you'd like for me to do this for you, comment TagBot fix
on this issue.
I'll open a PR within a few hours, please be patient!
Typically one cannot statically determine the size and shape (or even type) of the returned value of a function, thus making it hard to trivially support returning values from kernels. Additionally, kernels (in the hardware sense) do not have return values; they just have arguments, which might be mutated. However, with the use of hostcall and some LLVM magic, we should be able to allow kernels to optionally return a value into a buffer which is dynamically allocated at runtime, and returned to the host once the kernel finishes executing.
For debugging purposes, it would be helpful to keep executables around so that they can be inspected.
Currently we invoke ld.lld
manually to link our kernel's .o file into the final executable. We should figure out how to embed lld (or call the appropriate C++ functions in LLVM) so that we don't rely on the lld
binary existing on the user's system.
For interactions with HIP, which uses an implicit, incrementing device ID similar to CUDA, we should provide functions that can map from HSA's agents to HIP's integer device IDs, and back.
When GPU kernels crash or don't do what we expect them to do, it can be very frustrating to figure out what went wrong (especially given that there isn't currently a non-deprecated debugger for ROCm). However, when a kernel crashes, it's possible to still execute code in the kernel context via trap handlers. We should implement support for loading code into the trap handler, to allow the user to get extra information on what went wrong. This would handle cases that software exception handling can't catch, like how gdb can debug a program that just generated a segmentation fault.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.