Giter Site home page Giter Site logo

cricket's Introduction

Cricket

pipeline status

Cricket is a virtualization layer for CUDA application that enables remote execution and checkpoint/restart without the need to recompile applications. Cricket isolates CUDA applications from the CUDA APIs by using ONC Remote Procedure Calls. User code and CUDA APIs are thus executed in separate processes.

virtualization layer

For Cricket to be able to insert the virtualization layer, the CUDA application has to link dynamically to the CUDA APIs. For this, you have to pass -cudart shared to nvcc during linking.

  • For experimental pytorch support see here.
  • For using Cricket from Rust see here.

Supported transports for cudaMemcpy:

  • TCP (slow, for pageable memory)
  • Infiniband (fast, for pinned memory)
  • Shared Memory (fastest, for pinned memory and no remote execution)

Dependencies

Cricket requires

  • CUDA Toolkit (E.g. CUDA 12.1)
  • rpcbind
  • libcrypto
  • libtirpc

libtirpc is built as part of the main Makefile.

On the system where the Cricket server should be executed, the appropriate NVIDIA drivers should be installed.

Building

git clone https://github.com/RWTH-ACS/cricket.git
cd cricket && git submodule update --init
LOG=INFO make

Environment variables for Makefile:

  • LOG: Log level. Can be one of DEBUG, INFO, WARNING, ERROR.
  • WITH_IB: If set to YES build with Infiniband support.
  • WITH_DEBUG: Use gcc debug flags for compilation

Running a CUDA Application

By default Cricket uses TCP/IP as a transport for the Remote Procedure Calls. This enables both remote execution, where server and client execute on different systems and local execution, where server and client execute on the same system. To support Cricket, the CUDA libraries must be linked dynamically to the CUDA application. For the runtime library, this can be done using the '-cudart shared' flag of nvcc.

The Cricket library has to be preloaded to the CUDA Application. For starting the server:

<path-to-cricket>/bin/cricket-rpc-server [optional rpc id]

The client can be started like this:

CRICKET_RPCID=[optional rpc id] REMOTE_GPU_ADDRESS=<address-of-server> LD_PRELOAD=<path-to-cricket>/bin/cricket-client.so <cuda-binary>

Example: Running a test application locally

/opt/cricket/bin/cricket-rpc-server
REMOTE_GPU_ADDRESS=127.0.0.1 LD_PRELOAD=/opt/cricket/bin/cricket-client.so /opt/cricket/tests/test_kernel

Example: Running the nbody CUDA sample using Cricket on a remote system

Compile the application

cd /nfs_share/cuda/samples/5_Simulations/nbody
make NVCCFLAGS="-m64 -cudart shared" GENCODE_FLAGS="-arch=sm_61"

Start the Cricket server

/opt/cricket/bin/cricket-rpc-server

Run the application

REMOTE_GPU_ADDRESS=remoteSystem.my-domain.com LD_PRELOAD=/nfs_share/cricket/bin/cricket-client.so /nfs_share/cuda/samples/5_Simulations/nbody/nbody -benchmark

Contributing

File structue

  • cpu: The virtualization layer
  • gpu: experimental in-kernel checkpoint/restart
  • submodules: Submodules are located here.
    • cuda-gdb: modified GDB for use with CUDA. This is only required for in-kernel checkpoint/restart
    • libtirpc: Transport Indepentend Remote Procedure Calls is requried for the virtualization layer
  • tests: various CUDA applications to test cricket.
  • utils: A Dockerfile for for our CI.s

Please agree to the DCO by signing off your commits.

Publications

Eiling et. al: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support. Concurrency and Computation: Practice and Experience. 2022. https://doi.org/10.1002/cpe.6474

Eiling et. al: Checkpoint/Restart for CUDA Kernels. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23). 2023. ACM. https://doi.org/10.1145/3624062.3624254

Eiling et. al: GPU Acceleration in Unikernels Using Cricket GPU Virtualization. In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23). 2023. ACM. https://doi.org/10.1145/3624062.3624236

Eiling et. al: An Open-Source Virtualization Layer for CUDA Applications. In Euro-Par 2020: Parallel Processing Workshops. 2021. Lecture Notes in Computer Science, vol 12480. Springer. https://doi.org/10.1007/978-3-030-71593-9_13

Acknowledgments

Funded by the European Union—NextGenerationEU Funded by the European Union—NextGenerationEU Sponsored by the Federal Ministry of Education and Research

cricket's People

Contributors

n-eiling avatar jounathaen avatar fstracke avatar fuentesgrau avatar philipp-fensch avatar mkroening avatar skirsten avatar zhuangqh avatar

Stargazers

shellCat avatar bjwswang avatar  avatar Saul O'Driscoll avatar  avatar  avatar Ruiqi Lei avatar mengkzhaoyun avatar Manuel avatar MengQ avatar Feng Jiang avatar  avatar Hongbo avatar Twiliness avatar  avatar  avatar  avatar Superskyyy avatar  avatar Zhe Zhou avatar  avatar magisterbrownie avatar Yihong Li avatar  avatar cggwz avatar Yash Anand avatar  avatar Andrei Vagin avatar Tudor Andrei Dumitrascu avatar ramkumar avatar Andreas Jansson avatar  avatar guoyibin avatar  avatar AICDG avatar 杨现 avatar Ruslan Kuprieiev avatar Niranjan Anandkumar avatar Renhao Zhang avatar  avatar CHEN Dong avatar  avatar Peihao Yang avatar Ian Eyberg avatar Felix Wege avatar Avice avatar Luis Capelo avatar JaceLau avatar  avatar Stephan Seitz avatar Sissel avatar Joy Dee avatar Duterfresh avatar  avatar Xuanteng Huang avatar Leonard O' Sullivan avatar XiaHan avatar Natsu avatar EAirPeter avatar Rommel Quintanilla avatar  avatar  avatar TAKATSUTO Atsushi avatar Callum Hart avatar Felicitas Pojtinger avatar  avatar Garry Fang avatar  avatar  avatar 9CAT avatar Shawn Chen avatar Tom O'Neill avatar BenDerPan avatar Yue Cheng avatar Luc Alapini avatar Yi Wang avatar  avatar Radostin Stoyanov avatar lizhicheng avatar River Brave avatar  avatar  avatar TJ avatar  avatar  avatar  avatar yiftan avatar Zhuobin Huang avatar Sergei Bastrakov avatar Nexuss avatar Mingcong Han avatar EnanaShinonome avatar  avatar  avatar  avatar Zhu Zhihao avatar XinYao avatar Qizhen WENG avatar  avatar  avatar

Watchers

James Cloos avatar Pekka Jääskeläinen avatar Like Xu avatar Stefan Lankes avatar Leonard O' Sullivan avatar Radostin Stoyanov avatar Markus Mirz avatar  avatar Peihao Yang avatar  avatar  avatar Philipp Huth avatar Avice avatar  avatar

cricket's Issues

Segmentation violation during checkpoint restore (cpu mode)

Hi!

I want to try cricket for C/R in cpu mode (no in-kernel checkpointing). However, when I run restore it fails with segfault.

(gdb) r
Starting program: /home/alexndrfrolov/cricket/cpu/cricket-rpc-server 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
welcome to cricket!
+03:00:00.000003 INFO:  restoring previous state was enabled by setting CRICKET_RESTORE
+03:00:00.000146 DEBUG: restoring rpc_id from ckp/rpc_id
+03:00:00.000189 DEBUG: using prog=99, vers=1   in cpu-server.c:220
+03:00:00.000200 INFO:  using TCP...
+03:00:00.000766 INFO:  listening on port 49338
+03:00:00.001007 DEBUG: sched_none_init
[New Thread 0x7fffb47ff000 (LWP 2666702)]
+03:00:00.673881 DEBUG: restoring api records from ckp/api_records
+03:00:00.673948 DEBUG: function: 50 

Thread 1 "cricket-rpc-ser" received signal SIGSEGV, Segmentation fault.
0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffb824dd31 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000055555557c922 in loggf (level=3 '\003', formatstr=0x5555555973d8 "rpc_register_function(fatCubinHandle: %p, hostFun: %p, deviceFun: %s, deviceName: %s, thread_limit: %d)") at log.c:98
#3  0x000055555557a38f in rpc_register_function_1_svc (fatCubinHandle=94419555140752, hostFun=94419554144212, deviceFun=0x56340424fd90 <error: Cannot access memory at address 0x56340424fd90>, 
    deviceName=0x5634044f2e90 <error: Cannot access memory at address 0x5634044f2e90>, thread_limit=-1, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu-server-driver.c:111
#4  0x0000555555564913 in _rpc_register_function_1 (argp=0x555555972740, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu_rpc_prot_svc_mod.c:46
#5  0x0000555555583534 in cr_call_record (record=0x555555974830) at cr.c:714
#6  0x0000555555583889 in cr_restore_resources (path=0x5555555963fb "ckp", record=0x555555974830, rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, 
    rm_arrays=0x5555555a61a0 <rm_arrays>, rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:772
#7  0x0000555555583d55 in cr_restore (path=0x5555555963fb "ckp", rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, rm_arrays=0x5555555a61a0 <rm_arrays>, 
    rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:870
#8  0x00005555555710c1 in server_runtime_restore (path=0x5555555963fb "ckp") at cpu-server-runtime.c:141
#9  0x0000555555570e3b in server_runtime_init (restore=1) at cpu-server-runtime.c:87
#10 0x000055555556ed54 in cricket_main (prog_num=99, vers_num=1) at cpu-server.c:284
#11 0x0000555555592752 in main (argc=1, argv=0x7fffffffe3d8) at server-exe.c:11

After a little debugging, I have found out that the problem comes from using rpc_register_function_1_svc in restore process (see gdb trace). In the comments it is said that it does not support checkpoint/restore. But I have not found how to avoid it, because it is called from the __cudaRegisterFunction at the client side.

Does it mean that C/R does not work in Cricket for cpu at the moment? Thank you!

svc_register failure

I have compiled the code. However, when I am running the following command, the program will complain with the failure of svc_register

LD_LIBRARY_PATH=/data/cricket/submodules/libtirpc/install/lib LD_PRELOAD=./bin/cricket-server.so ./tests/bin/matrixMul
[root@ef4ef0c6e8b4 cricket]# LD_LIBRARY_PATH=/data/cricket/submodules/libtirpc/install/lib LD_PRELOAD=./bin/cricket-server.so ./tests/bin/matrixMul
05/28/23 15:34:42.373892 (INFO):        using TCP...
05/28/23 15:34:42.374555 (INFO):        listening on port 56469
05/28/23 15:34:42.374728   ERROR: unable to register (RPC_PROG_PROG, RPC_PROG_VERS), ret: 0.    in cpu-server.c(245)
svc_register: Cannot assign requested address # this line is printed by my perror function

I ran the command in a privileged docker container.

Generating bindings for Cricket - missing headers

Hello,

I'm working on generating bindings in Go for Cricket, but I can't do anything without access to some headers that are listed in .gitignore - namely cpu_rpc_prot.h.

Out of curiosity why are these excluded? Also can I work around this by using cpu_rpc_prot.x? Not really sure what the .x file extension is or what the file itself is doing.

Thanks!

run llama.cpp failed

Using cricket ,I can run nbody success. but failed to run llama.cpp in a virtual machine.

liucong@liucong:~$ ldd nbody 
	linux-vdso.so.1 (0x00007ffed65af000)
	libGL.so.1 => /lib/x86_64-linux-gnu/libGL.so.1 (0x000074b5e1d24000)
	libGLU.so.1 => /lib/x86_64-linux-gnu/libGLU.so.1 (0x000074b5e1cce000)
	libglut.so.3 => /lib/x86_64-linux-gnu/libglut.so.3 (0x000074b5e1c81000)
	libcudart.so.11.0 => /lib/x86_64-linux-gnu/libcudart.so.11.0 (0x000074b5e1800000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000074b5e1400000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000074b5e1b98000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000074b5e1b78000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000074b5e1000000)
	libGLdispatch.so.0 => /lib/x86_64-linux-gnu/libGLdispatch.so.0 (0x000074b5e1ac0000)
	libGLX.so.0 => /lib/x86_64-linux-gnu/libGLX.so.0 (0x000074b5e17cc000)
	libOpenGL.so.0 => /lib/x86_64-linux-gnu/libOpenGL.so.0 (0x000074b5e17a0000)
	libX11.so.6 => /lib/x86_64-linux-gnu/libX11.so.6 (0x000074b5e1660000)
	libXi.so.6 => /lib/x86_64-linux-gnu/libXi.so.6 (0x000074b5e1aaa000)
	libXxf86vm.so.1 => /lib/x86_64-linux-gnu/libXxf86vm.so.1 (0x000074b5e1aa3000)
	/lib64/ld-linux-x86-64.so.2 (0x000074b5e1e07000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000074b5e165b000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000074b5e1656000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x000074b5e1651000)
	libxcb.so.1 => /lib/x86_64-linux-gnu/libxcb.so.1 (0x000074b5e13d6000)
	libXext.so.6 => /lib/x86_64-linux-gnu/libXext.so.6 (0x000074b5e163a000)
	libXau.so.6 => /lib/x86_64-linux-gnu/libXau.so.6 (0x000074b5e1634000)
	libXdmcp.so.6 => /lib/x86_64-linux-gnu/libXdmcp.so.6 (0x000074b5e162c000)
	libbsd.so.0 => /lib/x86_64-linux-gnu/libbsd.so.0 (0x000074b5e13be000)
	libmd.so.0 => /lib/x86_64-linux-gnu/libmd.so.0 (0x000074b5e13af000)

liucong@liucong:~$ REMOTE_GPU_ADDRESS=10.42.143.229 LD_PRELOAD=/home/liucong/cricket-client.so ./nbody --benchmark
+08:00:00.000007 INFO:	connection to host "10.42.143.229"
+08:00:00.000239 INFO:	connecting via TCP...
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ada" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4060 Ti]
34816 bodies, total time for 10 iterations: 5123.033 ms
= 2.366 billion interactions per second
= 47.322 single-precision GFLOP/s at 20 flops per interaction
+08:00:05.780127 INFO:	api-call-cnt: 76
+08:00:05.780172 INFO:	memcpy-cnt: 3342360
liucong@liucong:~$ cd llama.cpp_cuda/
liucong@liucong:~/llama.cpp_cuda$ ldd build/bin/llama-server 
	linux-vdso.so.1 (0x00007ffcaf1d4000)
	libllama.so => /home/liucong/llama.cpp_cuda/build/src/libllama.so (0x00007e91d900c000)
	libggml.so => /home/liucong/llama.cpp_cuda/build/ggml/src/libggml.so (0x00007e91cb000000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007e91cac00000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007e91d8f16000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007e91d8ef6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007e91ca800000)
	/lib64/ld-linux-x86-64.so.2 (0x00007e91d931c000)
	libcudart.so.11.0 => /lib/x86_64-linux-gnu/libcudart.so.11.0 (0x00007e91ca400000)
	libcublas.so.11 => /lib/x86_64-linux-gnu/libcublas.so.11 (0x00007e91c0800000)
	libcuda.so.1 => /lib/x86_64-linux-gnu/libcuda.so.1 (0x00007e91bea00000)
	libgomp.so.1 => /lib/x86_64-linux-gnu/libgomp.so.1 (0x00007e91d8eaa000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007e91d8ea3000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007e91d8e9e000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007e91d8e99000)
	libcublasLt.so.11 => /lib/x86_64-linux-gnu/libcublasLt.so.11 (0x00007e91a9400000)

liucong@liucong:~/llama.cpp_cuda$ REMOTE_GPU_ADDRESS=10.42.143.229 LD_PRELOAD=/home/liucong/cricket-client.so ./build/bin/llama-server -m /home/liucong/tinygrad/weights/gemma-1.1-7b-it.Q4_K_M.gguf -ngl 33 --host 0.0.0.0
+17:24:50.965811 ERROR: section .nv.info. not found	in cpu-elf2.c:687
+17:24:50.965964 ERROR: get_parm_for_kernel failed for kernel _Z7acc_f32PKfS0_Pfiiiiiii	in cpu-elf2.c:993
corrupted size vs. prev_size
Aborted (core dumped)
liucong@liucong:~/llama.cpp_cuda$ 

Question about gpu checkpoint/restore

Hi, CPU part has recorded cuda_launch_kernel and cricket can relaunch it in restore mode. Why do gpu code part need checkpoint/restore every sms/warps stats such as pc, callstack...
If we only checkpoint/restore global memory in GPU and relaunch kernel, is it theoretically work?

compile error

There are a lot of error like this.
src/main.c:790:1: error: jump to label ‘detach’
790 | detach:
| ^~~~~~
src/main.c:263:14: note: from here
263 | goto detach;
| ^~~~~~
src/main.c:302:10: note: crosses initialization of ‘bool found_callstack’
302 | bool found_callstack = false;
| ^~~~~~~~~~~~~~~
src/main.c:299:14: note: crosses initialization of ‘uint64_t start_address’
299 | uint64_t start_address = 0;
| ^~~~~~~~~~~~~
src/main.c:298:14: note: crosses initialization of ‘uint64_t cur_address’
298 | uint64_t cur_address = 0;
| ^~~~~~~~~~~
src/main.c:297:14: note: crosses initialization of ‘uint32_t predicate’
297 | uint32_t predicate = 1;
| ^~~~~~~~~
src/main.c:790:1: error: jump to label ‘detach’
790 | detach:

Pytorch not working with CUDA 11.2 and CUDA 11.7

          Hi ,I have some problems when running cricket in pytorch. I have pulled the latest code,and build pytorch locally with modify change the doces mentioned.

my CUDA is 11.2 and cudnn is 8.9.2 in ths Tesla P4,but get this problem:

server:
+08:01:00.423212 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445168 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.445403 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.447247 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:00.448076 WARNING: duplicate resource! The first resource will be overwritten in resource-mg.c:145 +08:01:07.164339 ERROR: cuda_device_prop_result size mismatch in cpu-server-runtime.c:367 +08:02:22.370950 INFO: RPC deinit requested. +08:08:54.324012 INFO: have a nice day!
client:
`+08:00:36.417392 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.418684 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
+08:00:36.420058 WARNING: could not find .nv.info section. This means this binary does not contain any kernels. in cpu-elf2.c:922
call failed: RPC: Timed out
call failed: RPC: Timed out
call failed: RPC: Timed out
+08:02:01.851255 ERROR: something went wrong in cpu-client-runtime.c:444
Traceback (most recent call last):
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 242, in _lazy_init
queued_call()
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 125, in _check_capability
capability = get_device_capability(d)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 39, in
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error:

CUDA call was originally invoked at:

[' File "/home/lwh/cricket/tests/test_apps/pytorch_minimal.py", line 31, in \n import torch\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/init.py", line 798, in \n _C._initExtension(manager_path())\n', ' File "", line 991, in _find_and_load\n', ' File "", line 975, in _find_and_load_unlocked\n', ' File "", line 671, in _load_unlocked\n', ' File "", line 843, in exec_module\n', ' File "", line 219, in _call_with_frames_removed\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 179, in \n _lazy_call(_check_capability)\n', ' File "/root/anaconda3/envs/py3.8/lib/python3.8/site-packages/torch/cuda/init.py", line 177, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n']
+08:02:27.007890 ERROR: call failed. in cpu-client.c:213
+08:02:27.012036 INFO: api-call-cnt: 14
+08:02:27.012051 INFO: memcpy-cnt: 0`

Is my CUDA version wrong? or other reasons?

Originally posted by @Tlhaoge in #6 (comment)

error registering fatbin: 222 in cpu-client.c:380

Hello,the following error occurred when I tried to run the test program cricket.testapp.
image
But when I run it directly without using the server, it can run successfully.
Is it a problem with my driver version?
image
image
Thank you so much!

checkpoint/restore problem

When I use the test script for c/r.
The output is

[New Thread 0x7f37c96c5000 (LWP 38397)]
[New Thread 0x7f37a12e0000 (LWP 38398)]
[Detaching after fork from child process 38399]
[Thread 0x7f37c96c5000 (LWP 38397) exited]

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]                                                                                    
0x00007f37a2fae0a0 in kernel(unsigned short*, unsigned short*, unsigned short*, char, short, int, long long)<<<(1,1,1),(32,1,1)>>> ()                                                        
[DEBUG] gdb_init out.
Cuda api initialized and attached!
got API
Device "NVIDIA A100-SXM4-80GB":
        index: 0
        type: "GA100GL-A"
        SM type: "sm_80"
        lanes: 32
        predicates 8
        registers: 255
        SMs: 108
        warps: 64

checkpointing kernel with name: "_Z6kernelPtS_S_csix"
stack-size: 336, param-addr: 352, param-size: 40, param-num: 7
SM 0: 1 - 0000000000000000000000000000000000000000000000000000000000000001
03/12/23 10:42:43.888585   DEBUG: relative 6a0, virtual 7f37a2fae0a0    in /cricket-cr.c(670)
SM 0 warp 0 (active): 55555555 - 01010101010101010101010101010101
SM 0 warp 0 (valid): ffffffff - 11111111111111111111111111111111
03/12/23 10:42:43.888637   DEBUG: function "_Z6kernelPtS_S_csix" has no room (0 slots)  in /cricket-cr.c(831)                                                                                
03/12/23 10:42:43.888647   ERROR: There is no room in the top level function (i.e. the kernel). This kernel can thus never be restored!        in /cricket-cr.c(835)                         
cricket-checkpoint: could not make checkpointable.

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.
0x00007ffed552baea in clock_gettime ()
[Inferior 1 (process 38378) detached]

Compile Error:invalid application of ‘sizeof’ to incomplete type ‘struct cudaArraySparseProperties’

hi, I get the following error when compiling:
...
make -C cpu
make[1]: Entering directory /root/bihui/cricket/cpu' rpcgen -C -M -N -c -o cpu_rpc_prot_xdr.c cpu_rpc_prot.x rpcgen -C -M -N -h -o cpu_rpc_prot.h cpu_rpc_prot.x gcc -std=gnu99 -I../submodules/libtirpc/install/include/tirpc -I/usr/local/cuda/include -c -fpic -o cpu_rpc_prot_xdr.o cpu_rpc_prot_xdr.c -L../submodules/libtirpc/install/lib -L/usr/local/cuda/lib64 -ltirpc -ldl -lcrypto rpcgen -C -M -N -m -o cpu_rpc_prot_svc.c cpu_rpc_prot.x ./generate_dispatch.sh gcc -std=gnu99 -I../submodules/libtirpc/install/include/tirpc -I/usr/local/cuda/include -c -fpic -o cpu_rpc_prot_svc_mod.o cpu_rpc_prot_svc_mod.c -L../submodules/libtirpc/install/lib -L/usr/local/cuda/lib64 -ltirpc -ldl -lcrypto gcc -std=gnu99 -I../submodules/libtirpc/install/include/tirpc -I/usr/local/cuda/include -c -fpic -o cpu-server.o cpu-server.c -L../submodules/libtirpc/install/lib -L/usr/local/cuda/lib64 -ltirpc -ldl -lcrypto gcc -std=gnu99 -I../submodules/libtirpc/install/include/tirpc -I/usr/local/cuda/include -c -fpic -o cpu-utils.o cpu-utils.c -L../submodules/libtirpc/install/lib -L/usr/local/cuda/lib64 -ltirpc -ldl -lcrypto gcc -std=gnu99 -I../submodules/libtirpc/install/include/tirpc -I/usr/local/cuda/include -c -fpic -o cpu-server-runtime.o cpu-server-runtime.c -L../submodules/libtirpc/install/lib -L/usr/local/cuda/lib64 -ltirpc -ldl -lcrypto cpu-server-runtime.c: In function ‘cuda_device_get_texture_lmw_1_svc’: cpu-server-runtime.c:235:19: warning: implicit declaration of function ‘cudaDeviceGetTexture1DLinearMaxWidth’ [-Wimplicit-function-declaration] result->err = cudaDeviceGetTexture1DLinearMaxWidth(&result->u64_result_u.u64, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cpu-server-runtime.c: In function ‘cuda_event_record_with_flags_1_svc’: cpu-server-runtime.c:677:15: warning: implicit declaration of function ‘cudaEventRecordWithFlags’; did you mean ‘cudaEventCreateWithFlags’? [-Wimplicit-function-declaration] *result = cudaEventRecordWithFlags( ^~~~~~~~~~~~~~~~~~~~~~~~ cudaEventCreateWithFlags cpu-server-runtime.c: In function ‘cuda_array_get_sparse_properties_1_svc’: cpu-server-runtime.c:931:14: error: invalid application of ‘sizeof’ to incomplete type ‘struct cudaArraySparseProperties’ sizeof(struct cudaArraySparseProperties); ^~~~~~ cpu-server-runtime.c:933:14: error: invalid application of ‘sizeof’ to incomplete type ‘struct cudaArraySparseProperties’ sizeof(struct cudaArraySparseProperties)); ^~~~~~ cpu-server-runtime.c:936:19: warning: implicit declaration of function ‘cudaArrayGetSparseProperties’; did you mean ‘cuda_array_get_sparse_properties_1’? [-Wimplicit-function-declaration] result->err = cudaArrayGetSparseProperties( ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ cuda_array_get_sparse_properties_1 make[1]: *** [cpu-server-runtime.o] Error 1 make[1]: Leaving directory /root/bihui/cricket/cpu'
make: *** [cpu] Error 2

How to solve this?

Question about cuda test samples

Firstly, thank you for awesome project!
I have some doubts about cuda sample. In the first sample LD_PRELOAD=/opt/cricket/bin/cricket-server.so /opt/cricket/tests/test_kernel, you start server by test_kernel. However, I cant find any info about test_kernel. Is it a general refer
for any test cases or we can get it in other ways?

compile error

hi, I get the following error when compiling:
In file included from ../../gdb/python/py-arch.c:24:0:
../../gdb/python/python-internal.h:232:63: error: ‘PyObject* (* PyUnicode_FromFormat)(const char*, ...)’ redeclared as different kind of symbol
extern PyObject * (gdbpy_StringFromFormat) (const char , ...);
^
In file included from /root/anaconda3/include/python3.7m/Python.h:80:0,
from ../../gdb/python/python-internal.h:88,
from ../../gdb/python/py-arch.c:24:
/root/anaconda3/include/python3.7m/unicodeobject.h:879:24: error: previous declaration of ‘PyObject
PyUnicode_FromFormat(const char
, ...)’
PyAPI_FUNC(PyObject *) PyUnicode_FromFormat(

How to solve this?

Intrinsic kernel issued error in CUDA 12 environment.

First of all, this is a great project, and I'm doing some experiments based on cricket in CUDA 12 environment. Thank you for all the contributions.
Recently, I got a error when I use __dsqrt_rn() intrinsic in my kernel.
Following is my error message:

+16:00:00.210589 ERROR: section .nv.info.__internal_7_ not found        in cpu-elf2.c:687
+16:00:00.210594 ERROR: get_parm_for_kernel failed for kernel $__internal_7_$__cuda_sm20_dsqrt_rn_f64_mediumpath_v1     in cpu-elf2.c:984
+16:00:00.210597 ERROR: error getting parameter info    in cpu-elf2.c:508
+16:00:00.210598 ERROR: error getting fatbin info       in cpu-client.c:366

I dive a little, found that it's because __dsqrt_rn would create symbol $__internal_7_$__cuda_sm20_dsqrt_rn_f64_mediumpath_v1 under .nv.info section, just like a noinline device function. But there's no section like .nv.info.__internal_7......., seems like my nvcc only create section with prefix .nv.info for __global__ function, but not for cuda intrinsic or device function.
This issue is caused by https://github.com/RWTH-ACS/cricket/blob/master/cpu/cpu-elf2.c#L636

My current workaround is to pass the get_parm_for_kernel function in cpu-elf2.c for kernel name start with $, like in line https://github.com/RWTH-ACS/cricket/blob/master/cpu/cpu-elf2.c#L983

        if((ki->name)[0]!='$'){
            if (get_parm_for_kernel(elf, ki, memory, memsize) != 0) {
                LOGE(LOG_ERROR, "get_parm_for_kernel failed for kernel %s", kernel_str);
                goto cleanup;
            }
        }

but I'm not sure if this is reasonable, because I'm new to this project, and I don't know if it's okay when we have no parameter information for cuda intrinsic. But after this workaround, my code works anyway.

So, I'm now sure if this is caused by the elf file format in newer version of NVCC. Or it's caused by any other reasons. I don't have another CUDA env.

Following is my CUDA codes:

#include <stdio.h>

#define TYPE double
__device__ void my_func(TYPE* a) {
        atomicAdd(a, 1);
        __syncthreads();
}
__global__ void cuda_hello(TYPE* a){
        //printf("Hello World from GPU!\n");
        *a = 99.0;
        atomicAdd(a, 1);
        *a = __dsqrt_rn(*a);
        my_func(a);
         //__syncthreads();
}

int main() {
        printf("Hellow World from CPU\n");
        TYPE *a;
        TYPE *d_a;
        a = (TYPE*)malloc(sizeof(TYPE) * 1);
        *a = 0;
        printf("before: %f\n", *a);
        cudaMalloc((void**)&d_a, sizeof(TYPE) * 1);
        cudaMemcpy(d_a, a, sizeof(TYPE) * 1, cudaMemcpyHostToDevice);

        printf("-----launching kernel----\n");
        cuda_hello<<<1,1>>>(d_a);
        cudaMemcpy(a, d_a, sizeof(TYPE) * 1, cudaMemcpyDeviceToHost);

        printf("after: %f\n", *a);
        return 0;
}

How to get and use CUDA Debugger API

The in-kernel checkpoint/restart relies on CUDA Debugger API to work, so
I tried to read Debugger API :: CUDA Toolkit Documentation (nvidia.com) but found it was unclearly documented, especially how to initialize the api, and how to attach and detach a program. I made some attempts but never successfully initialized CUDA Debugger API. I also looked into the code of cuda-gdb, but I found it didn't use CUDA Debugger API, but implemented its own set of debugger api by communicating directly to the driver and filled its own functions into the CUDBGAPI_st struct.

I looked into the code of the project and found the code called gdb_init and then just called cudbgGetAPI. I'm not sure what is happening here. How do you intergrate a part of cuda-gdb into your project and get CUDA Debugger API? Could you please explain this (and anything about CUDA Debugger API) for me? Does the raw CUDA Debugger API the same as the cuda-gdb implementation or it just can't work without cuda-gdb?

Segmentation fault for a test cuda program

I was testing the following CUDA program:

// add.cu
#include <iostream>
#include <math.h>

// function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for(int i=index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1)/blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);

  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << " N=" <<N<<std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}
nvcc add.cu -cudart shared

But there's segmentation fault when I run it:

LD_PRELOAD=./bin/cricket-server.so ./a.out
01/20/23 14:02:01.480782 (INFO):	using TCP...
01/20/23 14:02:01.482162 (INFO):	listening on port 57051
01/20/23 14:02:01.482537 (INFO):	found CUDA initialization function at 0x401222
01/20/23 14:02:01.699124 (INFO):	waiting for RPC requests...
01/20/23 14:02:37.322563 (INFO):	string: "hello"
REMOTE_GPU_ADDRESS=127.0.0.1 LD_PRELOAD=./bin/cricket-client.so ./a.out
01/20/23 14:02:37.319732 (INFO):	connection to host "127.0.0.1"
01/20/23 14:02:37.319890    INFO: test
	in cpu-client.c(75)
01/20/23 14:02:37.320907 (INFO):	connecting via TCP...
__cudaRegisterFunction(fatCubinHandle=(nil), hostFun=0x40119e, devFunc=_Z3addiPfS_, deviceName=_Z3addiPfS_, thread_limit=-1, tid=[(nil)], bid=[(nil)], bDim=[(nil)], gDim=[(nil)], wSize=(nil))
01/20/23 14:02:37.325969   ERROR: stat on (null) failed.	in cpu-utils.c(337)
Segmentation fault

I am able to successfully run the ./tests/bin/nbody test program. Wondering what might be the issue here?

Hard-Link in Checkpoint/Restart initialization method.

Though it doesn't seem to cause the application to immediately crash, there are some hard-coded paths in debug code when initializing the Module doing Checkpoint/Restart. This could cause undefined behavior or crash the application for other users.

execute_command("set exec-wrapper env 'LD_PRELOAD=/home/eiling/projects/cricket/bin/libtirpc.so.3:/home/eiling/projects/cricket/cpu/cricket-server.so'", !batch_flag);

It should be possible to move the debug library directory to an environment variable pretty easily. Or is there any reason to keep this path hard-coded?

Failed to run test program

Hello, I have compiled successfully, but an error occurred when running the test program. I need your help.

CUDA version: 11.8
cuDNN version: 8.9.03

server ./bin/cricket-rpc-server

welcome to cricket!
+08:00:00.000003 INFO:  using TCP...
+08:00:00.000967 INFO:  listening on port 56951
+08:00:00.121489 INFO:  waiting for RPC requests...
+08:00:08.068937 INFO:  string: "hello"

+08:00:08.070386 ERROR: cuModuleLoadData failed: 209    in cpu-server-driver.c:47
+08:00:08.070760 ERROR: (nil) not found in resource manager - we cannot call a function from an unknown module. in cpu-server-driver.c:115
+08:00:08.071011 INFO:  RPC deinit requested.

client: REMOTE_GPU_ADDRESS=127.0.0.1 LD_PRELOAD=./bin/cricket-client.so ./bin/tests/kernel.testapp

+08:00:00.000007 INFO:  connection to host "127.0.0.1"
+08:00:00.000364 INFO:  connecting via TCP...
+08:00:00.003835 ERROR: error registering fatbin: 209   in cpu-client.c:380
+08:00:00.004118 ERROR: error registering function: -1  in cpu-client.c:344
+08:00:00.004158 WARNING: fatCubinHandle is NULL - so we have nothing to unload. (This is okay if this binary does not contain a kernel.)in cpu-client.c:398
+08:00:00.004376 INFO:  api-call-cnt: 0
+08:00:00.004409 INFO:  memcpy-cnt: 0

When I compile and test on two servers, I get the same error. Is there any solution for this situation? Thank you so much

Question about c/r in gpu part

I have read through the gpu part code and found that we would patch the binary file to make a jumptable during the restoring phase.

I think it's for restoring the CRS stack of warps at the time when partially executed kernel was checkpointed. Is that right?

compile and runtime dockerfile

I failed to compile this project on my local machine with cuda.
could you provide a dockerfile to show the compile prerequests?

restore kernel.testapp but result mismatch

Hi, thank you for awesome project!
I am testing the gpu checkpoint/restore functionality.
Cuda version: 11.4, GPU: Tesla P4

I can successfully create a gpu checkpoint using script ./tests/gpu/checkpoint.sh
6372b906aed3ff0beafe29382c5c9cd

When restore ./restore.sh /root/cricket/tests/gpu/../../tests/test_apps/kernel.testapp, the result is mismatch.
Could you help me check what went wrong? Thanks

[root@VM-104-13-tencentos gpu]# ./restore.sh /root/cricket/tests/gpu/../../tests/test_apps/kernel.testapp
using /root/cricket/tests/gpu/../../tests/test_apps/kernel.testapp
kernel.testapp: no process found
patched_binary: no process found
10/08/23 21:01:10.897894 WARNING: no room in ".text._Z15kernel_no_paramv" available     in /cricket-elf.c(1037)
patch /tmp/cricket-ckp/patched_binary @ 12496 with data of size 144
Temporary breakpoint 1 at 0x401b62
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 230609]
[New Thread 0x7fffc297e000 (LWP 230618)]
elapsed time: 0.051518
found 1 devices
A[2048]={15183, 5909, 39324, 16294, 25624, 63281, 1133, 9314, 52216, 12031, 2445, 21597, 13509, 26926, 26717, 17449, 43697, 9177, 38203, 37564, 12105, 14935, 31848, 46153, 29089, 43698, 2093, 2036, 22994, 54249, 10114, 26076, ... }
x[64]={17933, 27638, 54613, 24189, 9017, 1175, 50149, 57193, 19645, 55487, 22299, 17105, 43083, 11672, 34953, 11345, 33499, 59732, 33792, 8338, 37813, 38540, 51229, 58687, 14390, 5712, 18699, 57188, 8470, 21585, 22370, 26403, }
About to initialize CUDA context...
[New Thread 0x7fffc217d000 (LWP 230619)]
warning: Breakpoint address adjusted from 0x00a5a5a0 to 0x00a5a5a8.
warning: Breakpoint address adjusted from 0x00a5a5a0 to 0x00a5a5a8.
dev_A = 0x7fff9d200000
CUDA context initialized
Mallocs done
Kernel Test: Create a checkpoint now!
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
warning: Breakpoint 1 address previously adjusted from 0x00a5a5a0 to 0x00a5a5a8.

Thread 1 "patched_binary" hit Temporary breakpoint 1, 0x0000000000a5a5a8 in kernel(unsigned short*, unsigned short*, unsigned short*, char, short, int, long long)<<<(1,1,1),(32,1,1)>>> ()
cricket: identified device:
Device "Tesla P4":
        index: 0
        type: "GP104GL-A"
        SM type: "sm_61"
        lanes: 32
        predicates 8
        registers: 256
        SMs: 20
        warps: 64

cricket: all warps are now stopped!
start_address: 50, virt: a5f1f8, relative: d38
rb a5a5a8
        rb address: a5a5a8
sm 0: resuming warps 1 until a5a5e8
        rb address: a5a5e8
SMs at jmptable
sm 0
        warp 0
        cur_address: 50, rb address: 50
                c_level: 0, fn: _Z6kernelPtS_S_csix
                restoring ssy
10/08/23 21:01:11.663916 (INFO):        restored ssy
        cur_address: 78, rb address: 78
                success (ssy)
        success (sync) (d8 = d38) 
        success (sync2) (d8 = d8) 
valid: ffffffff, active: aaaaaaaa, goal: aaaaaaaa
        jumping to checkpointed PC d38
        success (pc) (d38 = d38) 
sm 1
sm 2
sm 3
sm 4
sm 5
sm 6
sm 7
sm 8
sm 9
sm 10
sm 11
sm 12
sm 13
sm 14
sm 15
sm 16
sm 17
sm 18
sm 19
cricket: stack-size: 336, param-addr: 320, param-size: 40
cricket: restored global variables
10/08/23 21:01:11.718882 WARNING: no checkpoint file for parameter 0    in /cricket-cr.c(1123)
10/08/23 21:01:11.718909 WARNING: no checkpoint file for parameter 1    in /cricket-cr.c(1123)
10/08/23 21:01:11.718928 WARNING: no checkpoint file for parameter 2    in /cricket-cr.c(1123)
10/08/23 21:01:11.718946 WARNING: no checkpoint file for parameter 6    in /cricket-cr.c(1123)
cricket: restored parameters
cricket: restored shared memory
server pid: 230604
cricket: restored warp D0S0W0
complete time:
        PROFILE patch: 0.000453 s
        PROFILE runattach: 0.703087 s
        PROFILE tojmptbl: 0.716825 s
        PROFILE jmptbl: 0.103585 s
        PROFILE globals: 0.000549 s
        PROFILE inkernel 0.863229 s
        PROFILE complete: 1.684641 s
resuming device...
0x00007fffe7d3f9e0 in ?? () from /lib64/libcuda.so.1
[Inferior 1 (process 230604) detached]
sudo: /home/eiling/tmp/criu/criu/criu: command not found
[root@VM-104-13-tencentos gpu]# dev_res: 0x7fff9d200a00
continuing
Result:
41978 103 0 0 0 0 0 0 56 0 1 0 1 0 1 0 44576 103 0 0 35040 103 0 0 0 0 0 0 3216 103 0 0 
mismatch: res[0]=41978 != expected_result[0]=41184
mismatch: res[1]=103 != expected_result[1]=11744
mismatch: res[2]=0 != expected_result[2]=60768
mismatch: res[3]=0 != expected_result[3]=1760
mismatch: res[4]=0 != expected_result[4]=60896
mismatch: res[5]=0 != expected_result[5]=4832
mismatch: res[6]=0 != expected_result[6]=61792
mismatch: res[7]=0 != expected_result[7]=49376
mismatch: res[8]=56 != expected_result[8]=45920
mismatch: res[9]=0 != expected_result[9]=30176
mismatch: res[10]=1 != expected_result[10]=41312
mismatch: res[11]=0 != expected_result[11]=42336
mismatch: res[12]=1 != expected_result[12]=14432
mismatch: res[13]=0 != expected_result[13]=53088
mismatch: res[14]=1 != expected_result[14]=50016
mismatch: res[15]=0 != expected_result[15]=4576
mismatch: res[16]=44576 != expected_result[16]=63456
mismatch: res[17]=103 != expected_result[17]=2016
mismatch: res[18]=0 != expected_result[18]=23008
mismatch: res[19]=0 != expected_result[19]=46944
mismatch: res[20]=35040 != expected_result[20]=25824
mismatch: res[21]=103 != expected_result[21]=18016
mismatch: res[22]=0 != expected_result[22]=34016
mismatch: res[23]=0 != expected_result[23]=48480
mismatch: res[24]=0 != expected_result[24]=37984
mismatch: res[25]=0 != expected_result[25]=26976
mismatch: res[26]=0 != expected_result[26]=64992
mismatch: res[27]=0 != expected_result[27]=51424
mismatch: res[28]=3216 != expected_result[28]=51040
mismatch: res[29]=103 != expected_result[29]=55904
mismatch: res[30]=0 != expected_result[30]=20576
mismatch: res[31]=0 != expected_result[31]=42336
---> result incorrect!
elapsed time: 11.382555

CUDA Version error in container.

Hi, I build docker images by your dockerfile.
Start with command docker run --gpus=all --network=host --rm --name test --privileged --shm-size 32G -it <docker_image_name> bash
However, with nvidia-smi in container, CUDA Version will be Error. Do you have same problem?
image
Other gpu images does not have problem.

g++ goto statement: jump to label error

Hello,

Could you tell which g++ version you used for compiling cricket?
We use g++ 9.4.0 (Ubuntu) but it complains the "error: jump to label" error on each "goto" statement.

Thanks!

Execute the test program and report an error: result incorrect!

I executed the following command on my machine and got an error.( Machine configuration:Centos 7 / kernel : 3.10.0-693.2.2.el7.x86_64 / CUDA-11.3 / NVIDIA Tesla P100 )

$ LD_PROLOAD=/root/bihui/cricket/bin/cricket-server.so /root/bihui/cricket/tests/test_apps/kernel.testapp
elapsed time: 0.294615
found 1 devices
A[2048]={15183, 5909, 39324, 16294, 25624, 63281, 1133, 9314, 52216, 12031, 2445, 21597, 13509, 26926, 26717, 17449, 43697, 9177, 38203, 37564, 12105, 14935, 31848, 46153, 29089, 43698, 2093, 2036, 22994, 54249, 10114, 26076, ... }
x[64]={17933, 27638, 54613, 24189, 9017, 1175, 50149, 57193, 19645, 55487, 22299, 17105, 43083, 11672, 34953, 11345, 33499, 59732, 33792, 8338, 37813, 38540, 51229, 58687, 14390, 5712, 18699, 57188, 8470, 21585, 22370, 26403, }
About to initialize CUDA context...
dev_A = 0x7f6a5e800000
CUDA context initialized
Mallocs done
Kernel Test: Create a checkpoint now!
dev_res: 0x7f6a5e800a00
continuing
Result:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
mismatch: res[0]=0 != expected_result[0]=41184
mismatch: res[1]=0 != expected_result[1]=11744
mismatch: res[2]=0 != expected_result[2]=60768
mismatch: res[3]=0 != expected_result[3]=1760
mismatch: res[4]=0 != expected_result[4]=60896
mismatch: res[5]=0 != expected_result[5]=4832
mismatch: res[6]=0 != expected_result[6]=61792
mismatch: res[7]=0 != expected_result[7]=49376
mismatch: res[8]=0 != expected_result[8]=45920
mismatch: res[9]=0 != expected_result[9]=30176
mismatch: res[10]=0 != expected_result[10]=41312
mismatch: res[11]=0 != expected_result[11]=42336
mismatch: res[12]=0 != expected_result[12]=14432
mismatch: res[13]=0 != expected_result[13]=53088
mismatch: res[14]=0 != expected_result[14]=50016
mismatch: res[15]=0 != expected_result[15]=4576
mismatch: res[16]=0 != expected_result[16]=63456
mismatch: res[17]=0 != expected_result[17]=2016
mismatch: res[18]=0 != expected_result[18]=23008
mismatch: res[19]=0 != expected_result[19]=46944
mismatch: res[20]=0 != expected_result[20]=25824
mismatch: res[21]=0 != expected_result[21]=18016
mismatch: res[22]=0 != expected_result[22]=34016
mismatch: res[23]=0 != expected_result[23]=48480
mismatch: res[24]=0 != expected_result[24]=37984
mismatch: res[25]=0 != expected_result[25]=26976
mismatch: res[26]=0 != expected_result[26]=64992
mismatch: res[27]=0 != expected_result[27]=51424
mismatch: res[28]=0 != expected_result[28]=51040
mismatch: res[29]=0 != expected_result[29]=55904
mismatch: res[30]=0 != expected_result[30]=20576
mismatch: res[31]=0 != expected_result[31]=42336
---> result incorrect!
elapsed time: 0.090015

cudaFuncGetAttributes does not work but cuFuncGetAttribute does

As mentioned in #15

cudaFuncGetAttributes exec fail in server error is cudaErrorInvalidDeviceFunction,if use cuFuncGetAttribute replace is ok

this likely occurs because runtime API functions do not exist on the server side but are translated to driver API cuModule and cuFunction objects. We need to translate these for every runtime API function that references (if any?) modules or functions.

Fail to start the provided docker

This is a great project. Thank you for all the contributions.
I utilized the Dockerfile under utils folder to build a Docker (only replace “nvidia-driver-NVML-530.30.02” with “nvidia-driver-NVML” for compile) and run it with the following command.

docker build -f Dockerfile -t cricket .
docker run -itd --gpus all --net host --ipc host --name cricket cricket
docker attach cricket

however, after i built cricket with the following command in docker:

git clone https://github.com/RWTH-ACS/cricket.git
cd cricket && git submodule update --init
LOG=INFO make

and start the starting the cricket server via ./bin/cricket-rpc-server, the following errors show up:

welcome to cricket!
+00:00:00.000002 INFO:  using TCP...
+00:00:00.000612 INFO:  listening on port 61869
+00:00:00.000687 ERROR: unable to register (RPC_PROG_PROG, RPC_PROG_VERS).      in cpu-server.c:257

It seems that I failed to start RPC, but there is no problem with the network connection of this docker. Could you please tell me if I started my docker in the wrong way or started the cricket in the wrong way?

Thanks a lot

run cricket/tests/gpu/checkpoint.sh get error

hello, I run the cricket/tests/gpu/checkpoint.sh script and get an error. like this:
`0x00007ffff7ddc140 in _start () from /lib64/ld-linux-x86-64.so.2

[Inferior 1 (process 16963) detached]

04/20/22 15:06:25.475429 (DEBUG): the command is 'kernel.testapp'

04/20/22 15:06:25.475654 DEBUG: using prog=-246232657, vers=674482919, derived from "/proc/self/exe" in cpu-server.c(200)

04/20/22 15:06:25.475671 (INFO): using TCP...

04/20/22 15:06:25.476136 (INFO): listening on port 60942

04/20/22 15:06:25.476267 (DEBUG): opening '/home/eiling/projects/cricket/tests/bin/kernel.testapp'

04/20/22 15:06:25.476433 (INFO): found CUDA initialization function at 0x4018aa

04/20/22 15:06:25.476481 (INFO): waiting for RPC requests...

server pid: 16963

client pid: 16983

04/20/22 15:06:27.138071 (INFO): connection to host "localhost"

04/20/22 15:06:27.138345 (DEBUG): the command is "kernel.testapp"

04/20/22 15:06:27.138356 DEBUG: using prog=-246232657, vers=674482919 in cpu-client.c(87)

04/20/22 15:06:27.138366 (INFO): connecting via TCP...

04/20/22 15:06:27.139252 (INFO): string: "hello"

04/20/22 15:06:27.142453 (DEBUG): found kernel "_Z15kernel_no_paramv" [param_num: 0, param_size: -1445615080]

04/20/22 15:06:27.142590 (DEBUG): found kernel "_Z6kernelPtS_S_csix" [param_num: 7, param_size: 40]

04/20/22 15:06:27.144101 (DEBUG): child exit code: 0

04/20/22 15:06:27.144126 DEBUG: __cudaRegisterFatBinary in cpu-client.c(270)

__cudaRegisterFunction(fatCubinHandle=(nil), hostFun=0x401812, devFunc=_Z15kernel_no_paramv, deviceName=_Z15kernel_no_paramv, thread_limit=-1, tid=[(nil)], bid=[(nil)], bDim=[(nil)], gDim=[(nil)], wSize=(nil))

04/20/22 15:06:27.144141 DEBUG: searching for _Z15kernel_no_paramv in 2 entries in cpu-utils.c(193)

04/20/22 15:06:27.144147 DEBUG: _Z15kernel_no_paramv in cpu-utils.c(198)

__cudaRegisterFunction(fatCubinHandle=(nil), hostFun=0x4016cd, devFunc=_Z6kernelPtS_S_csix, deviceName=_Z6kernelPtS_S_csix, thread_limit=-1, tid=[(nil)], bid=[(nil)], bDim=[(nil)], gDim=[(nil)], wSize=(nil))

04/20/22 15:06:27.144156 DEBUG: searching for _Z6kernelPtS_S_csix in 2 entries in cpu-utils.c(193)

04/20/22 15:06:27.144161 DEBUG: _Z15kernel_no_paramv in cpu-utils.c(198)

04/20/22 15:06:27.144166 DEBUG: _Z6kernelPtS_S_csix in cpu-utils.c(198)

04/20/22 15:06:27.144234 DEBUG: cudaGetDeviceCount in cpu-server-runtime.c(302)

elapsed time: 0.151057

found 1 devices

A[2048]={15183, 5909, 39324, 16294, 25624, 63281, 1133, 9314, 52216, 12031, 2445, 21597, 13509, 26926, 26717, 17449, 43697, 9177, 38203, 37564, 12105, 14935, 31848, 46153, 29089, 43698, 2093, 2036, 22994, 54249, 10114, 26076, ... }

x[64]={17933, 27638, 54613, 24189, 9017, 1175, 50149, 57193, 19645, 55487, 22299, 17105, 43083, 11672, 34953, 11345, 33499, 59732, 33792, 8338, 37813, 38540, 51229, 58687, 14390, 5712, 18699, 57188, 8470, 21585, 22370, 26403, }

About to initialize CUDA context...

04/20/22 15:06:27.295370 DEBUG: cudaMalloc in cpu-server-runtime.c(1164)

dev_A = 0x7fff95e00000

CUDA context initialized

04/20/22 15:06:27.436070 DEBUG: cudaMalloc in cpu-server-runtime.c(1164)

04/20/22 15:06:27.436169 DEBUG: cudaMalloc in cpu-server-runtime.c(1164)

04/20/22 15:06:27.436273 DEBUG: cudaMalloc in cpu-server-runtime.c(1164)

Mallocs done

04/20/22 15:06:27.436385 DEBUG: cudaMemcpyHtoD in cpu-server-runtime.c(1326)

04/20/22 15:06:27.436531 DEBUG: cudaMemcpyHtoD in cpu-server-runtime.c(1326)

04/20/22 15:06:27.436597 DEBUG: cudaMemcpyHtoD in cpu-server-runtime.c(1326)

04/20/22 15:06:27.436660 DEBUG: cudaLaunchKernel(0x4016cd) in cpu-client-runtime.c(936)

04/20/22 15:06:27.436684 (DEBUG): calling kernel "_Z6kernelPtS_S_csix" (param_size: 40, param_num: 7)

04/20/22 15:06:27.436734 DEBUG: arg: 0x7fff95e00000 (-1780482048) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436745 DEBUG: arg: 0x7fff95e00800 (-1780480000) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436750 DEBUG: arg: 0x7fff95e00a00 (-1780479488) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436755 DEBUG: arg: (nil) (0) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436761 DEBUG: arg: (nil) (0) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436766 DEBUG: arg: (nil) (0) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436770 DEBUG: arg: (nil) (0) in cpu-server-runtime.c(850)

04/20/22 15:06:27.436775 DEBUG: cudaLaunchKernel(func=0x4016cd, gridDim=[1,1,1], blockDim=[32,1,1], args=0xd625a0, sharedMem=0, stream=(nil)) in cpu-server-runtime.c(853)

04/20/22 15:06:27.439502 DEBUG: cudaLaunchKernel result: 0 in cpu-server-runtime.c(864)

Kernel Test: Create a checkpoint now!

04/20/22 15:06:27.439597 DEBUG: cudaDeviceSynchronize in cpu-server-runtime.c(287)

local still running: 0/128

local still running: 1/128

local still running: 2/128

local still running: 3/128

local still running: 4/128

local still running: 5/128

local still running: 6/128

local still running: 7/128

local still running: 8/128

local still running: 9/128

local still running: 10/128

local still running: 11/128

local still running: 12/128

local still running: 13/128

local still running: 14/128

local still running: 15/128

local still running: 16/128

local still running: 17/128

local still running: 18/128

criu doing

c1.sh: line 47: 16983 Killed LD_PRELOAD=${CRICKET_CLIENT}:/opt/cricket/bin/libtirpc.so ${CUDA_APP}

criu done

cricket checkpoint doing

[New LWP 16989]

[New LWP 16990]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

0x00007ffff7ffa7c2 in clock_gettime ()

local still running: 19/128

local still running: 20/128

local still running: 21/128

[Detaching after fork from child process 17041]

04/20/22 15:06:30.558914 (DEBUG): the command is 'cudbgprocess'

04/20/22 15:06:30.558961 (DEBUG): skipping RPC server

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

0x0000000000d5d960 in kernel(unsigned short*, unsigned short*, unsigned short*, char, short, int, long long)<<<(1,1,1),(32,1,1)>>> ()

Cuda api initialized and attached!

got API

Device "Tesla V100-SXM2-16GB":

    index: 0

    type: "GV100GL-A"

    SM type: "sm_70"

    lanes: 32

    predicates 7

    registers: 256

    SMs: 80

    warps: 64

checkpointing kernel with name: "_Z6kernelPtS_S_csix"

stack-size: 336, param-addr: 352, param-size: 40, param-num: 7

SM 0: 1 - 0000000000000000000000000000000000000000000000000000000000000001

SM 0 warp 0 (active): 55555555 - 01010101010101010101010101010101

SM 0 warp 0 (valid): ffffffff - 11111111111111111111111111111111

04/20/22 15:06:30.811030 ERROR: There is no room in the top level function (i.e. the kernel). This kernel can thus never be restored! in /cricket-cr.c(835)

cricket-checkpoint: could not make checkpointable.

local still running: 22/128

Thread 1 "kernel.testapp" received signal SIGURG, Urgent I/O condition.

0x00007ffff7ffa7c2 in clock_gettime ()

[Inferior 1 (process 16963) detached]

cricket done
`

how can i solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.