tkestack / vcuda-controller Goto Github PK

View Code? Open in Web Editor NEW

470.0 23.0 150.0 185 KB

License: Other

CMake 0.33% Dockerfile 0.29% Shell 0.26% C 99.12%

vcuda-controller's People

Stargazers

Watchers

Forkers

gaocegege knightxun solarisyan riverzhang lovejoy xuxinkun ruyingzhe lebinhe lihao89 zw0610 ai-cloud-kubernetes huanwei kevinfeng luoyancn zerosnake0 cheyang xq2005 oneandwholly joyme123 pendoragon olli-ai alex-zl yangkf1985 walter153 chenyu85 mxc8996 zhouzijiang aland-zhang archichris githubstack fighterhit bnulwh jiadongwang94 raz-bn moying050 rotorliu linquanisaac pokerfacesad grgalex allenzfan wenxinax shuhanyan threestoneliu c0refast charles820 janychan aimaid jadeluo sirzen97 mckayhou snailtolight felix0080 chenhaiwu hwdef volcano-sh-fork gxsaccount heluocs francis0407 yinmao hfkou 501176225 hsimwong jackjiang-hpc shenchucheng lwangbm mu-l pidb cynron hehe04 acelj pint1022 yelianjin gaecom ogre0403 mengyays guofengtd zwpaper william-wang dmagine matinjugou qianshanz yushuai1 duan2yu haoduoyu1203 jianzi123 tweakzx houl2025 shinytang6 lishiyucn kchstack cc8848 sun-helloworld orwenxiang upuuuuuu ruizhang1230 zerofishnoodles yejiqiu genie88 hzliangbin lqy-cmss

vcuda-controller's Issues

nvprof hangs in vcuda

nvprof on vcuda hangs, log loops as following,

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit
/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex
/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses
/tmp/cuda-control/src/hijack_call.c:380 pid: 14777
/tmp/cuda-control/src/hijack_call.c:380 pid: 14801
/tmp/cuda-control/src/hijack_call.c:380 pid: 14817
/tmp/cuda-control/src/hijack_call.c:385 read 3 items from /etc/vcuda/5d522d78e5429b9d305a4fbab92e203a6dd777f1dd985f30309ca907c031be5c/pids.config
/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization
/tmp/cuda-control/src/hijack_call.c:348 try to find 1920151404 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 1601205536 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 980579683 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 0 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 536865808 from pid tables
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown
/tmp/cuda-control/src/hijack_call.c:151 delta: 2228224, curr: 2228224
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 90,  share: 2228224, cur: 2228224
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit
/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex
/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses
/tmp/cuda-control/src/hijack_call.c:380 pid: 14777
/tmp/cuda-control/src/hijack_call.c:380 pid: 14801
/tmp/cuda-control/src/hijack_call.c:380 pid: 14817

The command in Dockerfile, and I build image as tensorflow/tf-mul:nvprof

FROM tensorflow/tensorflow:latest-gpu

ADD mul.py /mul.py
ENTRYPOINT ["nvprof", "python", "/mul.py"]
CMD []

The yaml is,

apiVersion: v1
kind: Pod
metadata:
  name: vcudal-prof
  labels:
    gpu-model: 2080t
spec:
  restartPolicy: Never
  enableServiceLinks: false
  containers:
  - name: test1
    image: tensorflow/tf-mul:nvprof
    securityContext:
      privileged: true
    env:
    - name: LOGGER_LEVEL
      value: "10"
    resources:
      limits:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 8

if I change to tencent.com/vcuda-core: 100 the mode in exclusive, everything is ok.
Any helps are welcome!

Failed to parse cgroups path

in conatiner cgroups path

root@vcuda-test-5cdd4c8cb8-bjn2l:/notebooks# cat /proc/self/cgroup
12:hugetlb:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
11:blkio:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
10:cpuset:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
9:pids:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
8:net_cls,net_prio:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
7:freezer:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
6:cpu,cpuacct:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
5:perf_event:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
4:memory:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
3:devices:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
2:rdma:/
1:name=systemd:/kubepods.slice/kubepods-pod019c1fe8_0d92_4aa0_b61c_4df58bdde71c.slice/cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98.scope
0::/system.slice/containerd.service

loader.c

int get_cgroup_data(const char *pid_cgroup, char *pod_uid, char *container_id,
                    size_t size) {
  // ...

  /**
   * remove unnecessary chars from $container_id and $pod_uid
   */
  if (is_systemd) {
    prune_pos = strstr(container_id, "-");
    if (!prune_pos) {
      LOGGER(4, "no - prefix");
      goto DONE;
    }
    memmove(container_id, prune_pos + 1, strlen(container_id));

    prune_pos = strstr(pod_uid, "-pod");
    if (!prune_pos) {
      LOGGER(4, "no pod string");
      goto DONE;
    }
    prune_pos += strlen("-pod");
    memmove(pod_uid, prune_pos, strlen(prune_pos));
    pod_uid[strlen(prune_pos)] = '\0';
    prune_pos = pod_uid;
    while (*prune_pos) {
      if (*prune_pos == '_') {
        *prune_pos = '-';
      }
      ++prune_pos;
    }
  } else {
    memmove(pod_uid, pod_uid + strlen("/pod"), strlen(pod_uid));
  }

function get_cgroup_data cannot handle cri-containerd-9e073649debeec6d511391c9ec7627ee67ce3a3fb508b0fa0437a97f8e58ba98 .

get "invalid device context" or "segmentation fault" problems in aarch64 machine

thanks for ur excellent codes and open source spirit.

Enviroment Info
gpu-manager version: built on master
vcuda version: built on master
nvidia driver: 470.199.02 or 470.42.01 or 460.106.00
cpu : aarch64
gpu: Tesla T4

details
these days i get "invalid device context" or "segmentation fault" problems in my aarch64 machine.
when every app init ,it reports 5 functions not found

when i use CUDA samples

when use pytorch demo

when change to 460.x driver. it reports segmentation fault

but, it will works if i give whole gpu rates to one pod(set vcore=100)
the last, it does well in x86 machine (same 470.x driver and T4 gpu card) at the same time.

so, are there any diffrences between aarch64 driver and x86 drivers?
can any gentleman give advice on this?
need ur help

Which GPU are you using ?

Hi! Thanks for the great working. I'm using Tesla T4 to test gpu-manager and get low GPU utilization. I found you mention modification according to GPU in #12. So I wonder which gpu you are using?

need update for cuda11.4？

I tried to use vcuda on Driver Version: 470.57.02, the program may fail without warning. Does it need to be updated for cuda11.4？Thanks!

The program based on TensorRT is slowing down with vcuda

some questions about initialization

why call cuInit() in initialization() and call initialization() in cuInit() ?
why call initialization() many times while only one call is need? (even use g_init_once to ensure it)

untraceable GPU memory allocation

Describe the bug

When I was testing triton inference server 19.10, GPU memory usage increases when the following two functions are called:

cuCtxGetCurrent
cuModuleGetFunction

It seems when loading cuda module, some data is transmitted into GPU memory without any function calls described within Memory Manage.

Despite the fact that any following cuMemAlloc call will be prevented if untraceable GPU memory allocation has already surpassed the limit set by user, it still seems a flaw that the actual GPU memory usage may exceed limit.

Environment
OS: Linux kube-node-zw 3.10.0-1062.18.1.el7.x86_64 # 1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

GPU Info: NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2

how to view logs output by vcuda-controller

I've noticed that vcuda-controller is compiled into .so library, and it is mounted to pod which uses GPU.

Also, some logs is printed to stderr. But how can i check the logs. I set "LOGGER_LEVEL" to "6" as an env of the pod, make no sense.

Why are these symbols marked as "deprecated"?

With poor amount of mentioning over Internet, the symbols' existence in original libcuda.so can still be confirmed by

nm -D /lib64/libcuda.so | grep " T "

The driver version is 535.

Here are the deprecated symbols:

vcuda-controller/include/cuda-helper.h

Lines 674 to 676 in 72e0115

    
           // Deprecated 
        
           // CUDA_ENTRY_ENUM(cuMemGetAttribute), 
        
           // CUDA_ENTRY_ENUM(cuMemGetAttribute_v2),

ECC status shows ERR！in vcuda container

test environment:
GPU GN7（T4）on tencent cloud，k8s 1.16，gpu-manager image ccr.ccs.tencentyun.com/tkeimages/gpu-manager:latest

nvidia-smi result ：

nvidia-smi -q -i 0 result:

more details :
root@gg-d678bdd9d-5wnfh:/usr/local/apache2# nvidia-smi -q -i 0

==============NVSMI LOG==============

Timestamp : Mon Jul 20 08:48:45 2020
Driver Version : 418.67
CUDA Version : 10.1

Attached GPUs : 1
GPU 00000000:00:08.0
Product Name : Tesla T4
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0325118127917
GPU UUID : GPU-276df694-0844-f357-45d1-e6a1ee805e17
Minor Number : 0
VBIOS Version : 90.04.38.00.03
MultiGPU Board : No
Board ID : 0x8
GPU Part Number : 900-2G183-0000-001
Inforom Version
Image Version : G183.0200.00.02
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : Pass-Through
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x08
Domain : 0x0000
Device Id : 0x1EB810DE
Bus Id : 00000000:00:08.0
Sub System Id : 0x12A210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 15079 MiB
Used : 0 MiB
Free : 15079 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Function Not Found
Pending : Function Not Found
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 32 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 85 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 9.54 W
Power Limit : 70.00 W
Default Power Limit : 70.00 W
Enforced Power Limit : 70.00 W
Min Power Limit : 60.00 W
Max Power Limit : 70.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Default Applications Clocks
Graphics : 585 MHz
Memory : 5001 MHz
Max Clocks
Graphics : 1590 MHz
SM : 1590 MHz
Memory : 5001 MHz
Video : 1470 MHz
Max Customer Boost Clocks
Graphics : 1590 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

Failed to compile

bash hack/build.sh manager client

When executing the above commandto compile the package, the following error occurs：

How to use the vcuda by docker run

I tried to build vcuda by running IMAGE_FILE={xx} ./build-img.sh and get the /usr/bin/nvml-monitor以及/usr/lib64/libcuda-control.so. Now I want to run the container by docker run xx, so what should I do and should I set the --runtime=nvidia?

进入到 pod 里面执行 nvidia-smi ，显示的显存和 node 节点是一致， pod 里面应该显示给pod 分配的资源吧

[Question] Why the `blocks` not actually used in `rate_limiter` ?

Hi, I am diving in the vcuda library codes recently. Some details in the code are a bit difficult to understand.

vcuda-controller/src/hijack_call.c

Line 162 in 3bf3a8b

static void rate_limiter(int grids, int blocks) {

Why the blocks not actually used in rate_limiter, it seems not used actually used, just for a printing ?

static void rate_limiter(int grids, int blocks) {
  int before_cuda_cores = 0;
  int after_cuda_cores = 0;
  int kernel_size = grids;

  LOGGER(5, "grid: %d, blocks: %d", grids, blocks);
  LOGGER(5, "launch kernel %d, curr core: %d", kernel_size, g_cur_cuda_cores);
  if (g_vcuda_config.enable) {
    do {
CHECK:
      before_cuda_cores = g_cur_cuda_cores;
      LOGGER(8, "current core: %d", g_cur_cuda_cores);
      if (before_cuda_cores < 0) {
        nanosleep(&g_cycle, NULL);
        goto CHECK;
      }
      after_cuda_cores = before_cuda_cores - kernel_size;
    } while (!CAS(&g_cur_cuda_cores, before_cuda_cores, after_cuda_cores));
  }
}

Usage is not clear

Hi! I want to try and use your project; however, it is not clear to me how to use the vcuda-controller.
The image I get from the script ./build-img.sh should be used as a base image for my GPU application? Should it be deployed on my k8s cluster?

I tried to use the vcuda-controller as a base image for a simple GPU CUDA stress test
using this Docker file:

FROM nvidia/cuda:8.0-devel as build

RUN apt-get update && apt-get install -y --no-install-recommends \
        wget && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /root
RUN wget http://wili.cc/blog/entries/gpu-burn/gpu_burn-0.7.tar.gz && tar xzf gpu_burn-0.7.tar.gz && make




FROM tkestack.io/gaia/vcuda:latest

COPY --from=build /root/gpu_burn /root/gpu_burn
ENTRYPOINT [ "/root/gpu_burn" ]
CMD [ "10" ]   # burn for 10 secs

and I get this error:

/root/gpu_burn: error while loading shared libraries: libcublas.so.8.0: cannot open shared object file: No such file or directory

also when trying to run the vcuda-controller image as is on my k8s cluster (GPU-manager and GPU-admission are also present) using the example YAML from the GPU-manager repo:


apiVersion: v1
kind: Pod
metadata:
  name: test
  labels:
    app: test
spec:
  containers:
    - name: test
      image: razbne/vcuda
      command: ['/usr/local/nvidia/bin/nvidia-smi']
      resources:
        requests:
          tencent.com/vcuda-core: 10
          tencent.com/vcuda-memory: 10
        limits:
          tencent.com/vcuda-core: 10
          tencent.com/vcuda-memory: 10

verfing GPU is attached:

[root@test/]# lspci
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
0001:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

I can not manage to make Nvidia-smi work, and it just hangs without any output
I will be happy if you can share more information about how to use the vcuda-controller

A question about the function nvmlDeviceGetProcessUtilization

Hi , I have test it cross multi process (2 process ) , and I found that the function nvmlDeviceGetProcessUtilization get the result is always one result ,and the result process is alternating，that means one time a CUDA program is running and the other is stopped, is that right? and that may cause the control incorrectly ?

support cuda 11.7

we are using vcuda to do gpu sharing, but cuda version is 11.7, want to know how to use with cuda 11.7, thanks

CUBLAS_STATUS_ALLOC_FAILED (cuda11.4.4，cuBLAS 11.6.5.2)

当上层调用cublasCreate的时候，程序会alloc失败，报：CUBLAS_STATUS_ALLOC_FAILE

Problems caused by launching multiple pods at the same time

Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?

In vcuda loader.c, I add ferror to print errno related error message, I get it

But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.

can't find function libcuda.so.440.100 in cuEGLInit

So what specific driver and cuda version should I install?

Also this log may be incorrect:

original

  if (unlikely(!cuda_library_entry[idx].fn_ptr)) {
    LOGGER(4, "can't find function %s in %s", cuda_filename,
           cuda_library_entry[idx].name);
  }

can't find function libcuda.so.440.100 in cuEGLInit

fix

  if (unlikely(!cuda_library_entry[idx].fn_ptr)) {
    LOGGER(4, "can't find function '%s' in %s", cuda_library_entry[idx].name, 
           cuda_filename);
  }

can't find function 'cuEGLInit' in libcuda.so.440.100

(pid=343086) WARNING:root:remote calling tf.config.list_physical_devices('GPU')
(pid=343086) 2020-09-11 07:52:40.449785: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
(pid=343086) /tmp/cuda-control/src/loader.c:941 config file: /etc/vcuda/19e8c670ddea331f7e12e46fa90a0fdb8510192d98ac939c35bbc25df2e17b07/vcuda.config
(pid=343086) /tmp/cuda-control/src/loader.c:942 pid file: /etc/vcuda/19e8c670ddea331f7e12e46fa90a0fdb8510192d98ac939c35bbc25df2e17b07/pids.config
(pid=343086) /tmp/cuda-control/src/loader.c:946 register to remote: pod uid: 98cd4783-221f-4080-9aec-f6448dc9bbc2, cont id: 19e8c670ddea331f7e12e46fa90a0fdb8510192d98ac939c35bbc25df2e17b07
(pid=343086) /tmp/cuda-control/src/loader.c:1044 pod uid          : 98cd4783-221f-4080-9aec-f6448dc9bbc2
(pid=343086) /tmp/cuda-control/src/loader.c:1045 limit            : 50
(pid=343086) /tmp/cuda-control/src/loader.c:1046 container name   : nvidia2
(pid=343086) /tmp/cuda-control/src/loader.c:1047 total utilization: 50
(pid=343086) /tmp/cuda-control/src/loader.c:1048 total gpu memory : 2684354560
(pid=343086) /tmp/cuda-control/src/loader.c:1049 driver version   : 440.100
(pid=343086) /tmp/cuda-control/src/loader.c:1050 hard limit mode  : 0
(pid=343086) /tmp/cuda-control/src/loader.c:1051 enable mode      : 1
(pid=343086) /tmp/cuda-control/src/loader.c:767 Start hijacking
(pid=343086) /tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.440.100 in cuEGLInit
(pid=343086) /tmp/cuda-control/src/hijack_call.c:481 cuInit error no CUDA-capable device is detected
(pid=343086) *** Aborted at 1599810760 (unix time) try "date -d @1599810760" if you are using GNU date ***
(pid=343086) PC: @                0x0 (unknown)
(pid=343086) *** SIGABRT (@0x53c2e) received by PID 343086 (TID 0x7f419a0f4740) from PID 343086; stack trace: ***
(pid=343086)     @     0x7f4199cd18a0 (unknown)
(pid=343086)     @     0x7f419990cf47 gsignal
(pid=343086)     @     0x7f419990e8b1 abort
(pid=343086)     @     0x7f418a733d01 google::LogMessage::Flush()
(pid=343086)     @     0x7f418a733dd1 google::LogMessage::~LogMessage()
(pid=343086)     @     0x7f418a71ed29 ray::RayLog::~RayLog()
(pid=343086)     @     0x7f418a3bc32d ray::CoreWorkerProcess::~CoreWorkerProcess()
(pid=343086)     @     0x7f418a3bc3da std::unique_ptr<>::~unique_ptr()
(pid=343086)     @     0x7f41999110f1 (unknown)
(pid=343086)     @     0x7f41999111ea exit
(pid=343086)     @     0x7f410af2b497 initialization
(pid=343086)     @     0x7f4199cce827 __pthread_once_slow
(pid=343086)     @     0x7f410af2ce3b cuInit
(pid=343086)     @     0x7f412fd55da0 cuInit
(pid=343086)     @     0x7f412fc8f19f stream_executor::gpu::(anonymous namespace)::InternalInit()
(pid=343086)     @     0x7f412fc8f42d stream_executor::gpu::GpuDriver::Init()
(pid=343086)     @     0x7f4146f77162 stream_executor::gpu::CudaPlatform::VisibleDeviceCount()
(pid=343086)     @     0x7f414694852f tensorflow::BaseGPUDeviceFactory::CacheDeviceIds()
(pid=343086)     @     0x7f414694862f tensorflow::BaseGPUDeviceFactory::ListPhysicalDevices()
(pid=343086)     @     0x7f4146a74e9d tensorflow::DeviceFactory::ListAllPhysicalDevices()
(pid=343086)     @     0x7f415c47a35d tensorflow::TF_ListPhysicalDevices()
(pid=343086)     @     0x7f415c474646 _ZZN8pybind1112cpp_function10initializeIRPFNS_6objectEvES2_JEJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESM_
(pid=343086)     @     0x7f415c49e239 pybind11::cpp_function::dispatcher()
(pid=343086)     @     0x556b0ee23c94 _PyMethodDef_RawFastCallKeywords
(pid=343086)     @     0x556b0ee23db1 _PyCFunction_FastCallKeywords
(pid=343086)     @     0x556b0ee8f5be _PyEval_EvalFrameDefault
(pid=343086)     @     0x556b0ee2320b _PyFunction_FastCallKeywords
(pid=343086)     @     0x556b0ee8ae70 _PyEval_EvalFrameDefault
(pid=343086)     @     0x556b0edd3b00 _PyEval_EvalCodeWithName
(pid=343086)     @     0x556b0ee23497 _PyFunction_FastCallKeywords
(pid=343086)     @     0x556b0ee8ae70 _PyEval_EvalFrameDefault
(pid=343086)     @     0x556b0edd32b9 _PyEval_EvalCodeWithName

why is nvmlDeviceSetComputeMode disabled when vcuda is enabled?

nvmlReturn_t nvmlDeviceSetComputeMode(nvmlDevice_t device,
                                      nvmlComputeMode_t mode)
{
  if (g_anycuda_config.enable)
  {
    return NVML_ERROR_NOT_SUPPORTED;
  }

  return NVML_ENTRY_CALL(nvml_library_entry, nvmlDeviceSetComputeMode, device,
                         mode);
}

Add build guide and update the image name

Describe the bug

There is no build guide in README
The build result is the docker image, which is called: tkestack.io/gaia/vcuda, need to modify

Environment

OS: Linux VM_149_11_centos 3.10.107-1-tlinux2_kvm_guest-0049 #1 SMP Tue Jul 30 23:46:29 CST 2019 x86_64 x86_64 x86_64 GNU/Linux
golang: go version go1.12.4 linux/amd64

Doubts about cuArray3DCreate_helper calculating memory usage

I see the function

vcuda-controller/src/hijack_call.c

Lines 713 to 731 in 5ec3fcb

    
           static CUresult cuArray3DCreate_helper( 
        
               const CUDA_ARRAY3D_DESCRIPTOR *pAllocateArray) { 
        
             size_t used = 0; 
        
             size_t base_size = 0; 
        
             size_t request_size = 0; 
        
             CUresult ret = CUDA_SUCCESS; 
        
             if (g_vcuda_config.enable) { 
        
               base_size = get_array_base_size(pAllocateArray->Format); 
        
               request_size = base_size * pAllocateArray->NumChannels * 
        
                              pAllocateArray->Height * pAllocateArray->Width; 
        
               atomic_action(pid_path, get_used_gpu_memory, (void *) &used); 
        
               if (unlikely(used + request_size > g_vcuda_config.gpu_memory)) { 
        
                 ret = CUDA_ERROR_OUT_OF_MEMORY; 
        
                 goto DONE; 
        
               } 
        
             }

so, should it be change to

    request_size = base_size * pAllocateArray->NumChannels *
                   pAllocateArray->Height * pAllocateArray->Width * pAllocateArray->Depth;

torch:1.12.0+cu113 驱动:530.41.03，gpu manager调度成功之后，使用cuda报错

一个问题是使用nvidia-smi显示的数据有问题

另一个问题是，在使用cuda的时候报错：RuntimeError: CUDA error: invalid device context

nvmlDeviceGetProcessUtilization disappers from 440

When I check with NVML API in driver 440, which is the latest version, it seems function nvmlDeviceGetProcessUtilization no longer exists. (or maybe relocated to another section?)

Will this break the functionality of vcuda-controller？

P4 GPU can not get enough gpu utilization

I created a pod using 33 vcuda-core with P4 GPU, but the GPU utilization keep stay under 3.

after setting the LOGGER_LEVEL to 6, I can see this logs:

...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17190
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17189
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 17188
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4, curr core: 19275
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4, curr core: 19271
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 2, curr core: 19267
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 19265
...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 11371
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 6763
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 2155
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: -2453
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 14667
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 10059
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 5451
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 843
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: -3765
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 1
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 1
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 14667
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 10059
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 5451
...
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 677
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 4608, curr core: 676
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: -3932
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 5, curr core: 20497
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 1, curr core: 20492
/tmp/cuda-control/src/hijack_call.c:168 launch kernel 2, curr core: 20491

curr core keeps going from 20497 to negative, and utilization stays not larger than 2.

after checking the code, I also found P4 has 20 sm and it is calculated:

here are my questions:

why use the red square part for the increment calculation?
do you meet this before, because I found the green square you enlarged the increment
how can we fix this? to make p4 can gain enough gpu utilizations

some issues on cuda-11.4

I want to ask some question, and look forward to your reply.
Cugetprocaddress is added in cuda-11.4, which can prevent our hijacking code from being executed. Have you solved this problem? If so, how?

	// Deprecated
	// CUDA_ENTRY_ENUM(cuMemGetAttribute),
	// CUDA_ENTRY_ENUM(cuMemGetAttribute_v2),

	static CUresult cuArray3DCreate_helper(
	const CUDA_ARRAY3D_DESCRIPTOR *pAllocateArray) {
	size_t used = 0;
	size_t base_size = 0;
	size_t request_size = 0;
	CUresult ret = CUDA_SUCCESS;

	if (g_vcuda_config.enable) {
	base_size = get_array_base_size(pAllocateArray->Format);
	request_size = base_size * pAllocateArray->NumChannels *
	pAllocateArray->Height * pAllocateArray->Width;

	atomic_action(pid_path, get_used_gpu_memory, (void *) &used);

	if (unlikely(used + request_size > g_vcuda_config.gpu_memory)) {
	ret = CUDA_ERROR_OUT_OF_MEMORY;
	goto DONE;
	}
	}

tkestack / vcuda-controller Goto Github PK

vcuda-controller's People

Stargazers

Watchers

Forkers

vcuda-controller's Issues

Recommend Projects

Recommend Topics

Recommend Org