Giter Site home page Giter Site logo

Comments (13)

iamdroppy avatar iamdroppy commented on July 4, 2024 2

Update: it's all working now, tomorrow I'll edit this with how I managed to accomplish.

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024 1

As promised (apologies for the delay).

/code/gpu

.env:

DD_PLATFORM=./../..
DD_SERVER_TAG=latest
DD_SERVER_IMAGE=gpu_torch
DD_PLATFORM_UI_TAG=latest
DD_JUPYTER_TAG=latest
DD_FILEBROWSER_TAG=latest

docker-compose.yml:

version: '2.3'
services:

  #
  # Platform Data
  #
  # Get data from dockerhub to run various services
  #

  platform_data:
    image: jolibrain/platform_data:latest
    user: ${CURRENT_UID}
    volumes:
      - ${DD_PLATFORM}:/platform


  #
  # Deepdetect
  #

  deepdetect:
    image: jolibrain/deepdetect_${DD_SERVER_IMAGE}:${DD_SERVER_TAG}
    runtime: nvidia
    restart: always
    volumes:
      - ${DD_PLATFORM}:/opt/platform

  #
  # Platform UI
  #
  # modify port 80 to change facade port
  #

  platform_ui:
    image: jolibrain/platform_ui:${DD_PLATFORM_UI_TAG}
    restart: always
    ports:
      - '${DD_PORT:-1912}:80'
    links:
      - jupyter:jupyter
      - deepdetect:deepdetect
      - gpustat_server
      - filebrowser
      - dozzle
    volumes:
      - ./config/nginx/nginx.conf:/etc/nginx/nginx.conf
      - ${DD_PLATFORM}:/opt/platform
      - ./config/platform_ui/config.json:/usr/share/nginx/html/config.json
      - ./.env:/usr/share/nginx/html/version

  #
  # Jupyter notebooks
  #

  jupyter:
    image: jolibrain/jupyter_dd_notebook:${DD_JUPYTER_TAG}
    runtime: nvidia
    user: root
    environment:
      - JUPYTER_LAB_ENABLE=yes
      - NB_UID=${MUID}
    volumes:
      - ${DD_PLATFORM}:/opt/platform
      - ${DD_PLATFORM}/notebooks:/home/jovyan/work

  #
  # gpustat-server
  #
  gpustat_server:
    image:  jolibrain/gpustat_server
    runtime: nvidia

  #
  # filebrowser
  #
  filebrowser:
    image: jolibrain/filebrowser:${DD_FILEBROWSER_TAG}
    restart: always
    user: ${CURRENT_UID}
    volumes:
      - ${DD_PLATFORM}/data:/srv/data

  #
  # real-time log viewer for docker containers
  #
  dozzle:
    image: amir20/dozzle
    restart: always
    environment:
      - DOZZLE_BASE=/docker-logs
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

Update to dd_widgets:

docker exec -it jupyter_dd_notebook /bin/bash
apt update -y; apt install vim git -y; \
    git clone https://github.com/jolibrain/dd_widgets; \
    rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
    mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/

---

My setup GPUID = 0 and Engine as DEFAULT

from deepdetect.

Bycob avatar Bycob commented on July 4, 2024

Hi, it's an issue related to dd_widget that has been fixed on the latest master, but the packages have not been updated, sorry for the inconvenience. How did you install dd_widget? Until we update the packages you can fix your issue by installing dd_widgets latest master.

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

The second error message still remains though.. I've updated the container directly.

Apologies for the wrong repo, I was unsure, I'd guess the second one belongs to this one?

If you happen to have any clue what's going on I'll be immensively thankful, I've been trying to deploy all morning but the error messages aren't really helpful...

Kind regards,
Lucca Ferri

from deepdetect.

Bycob avatar Bycob commented on July 4, 2024

The second message is a DD error indicating there is something wrong with the GPU. You can try nvidia-smi to see if something is wrong, ensure that you run the dd GPU docker with nvidia-docker installed and that there is enough memory available on your GPU

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

@Bycob sorry to disturb your time, but I'm really lost on one thing:

nvidia-smi returns OK, and nvidia-docker is installed also.

My question is, a RTX 3080 is enough to support this for testing and small datasets?

Kind regards and once again, thanks for the support, I can't thank you enough to point me in the right direction!

from deepdetect.

Bycob avatar Bycob commented on July 4, 2024

RTX 3080 should be perfectly fine. I will try to reproduce and come back to you.
Any log or system information would be helpful, especially the deepdetect server logs

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

GPU

My current issues with DeepDetect - note that I will update it whilst trying

dd_widgets

I'm updating dd_widgets with:

$ docker exec -it jupyter_dd /bin/bash
$ ~: apt update -y; apt install vim git -y; \
       git clone https://github.com/jolibrain/dd_widgets; \
       rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
       mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/

deepdetect

Deep detect logs the following:

deepdetect_1      | [2022-08-09 17:19:40.732] [api] [error] service not found: "word_mnist"
deepdetect_1      | [2022-08-09 17:19:40.732] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1      | [2022-08-09 17:19:40.738] [word_mnist] [info] Using GPU 1
deepdetect_1      | [2022-08-09 17:19:40.738] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1      | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1      | 
deepdetect_1      | [2022-08-09 17:19:40.738] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms
deepdetect_1      | [2022-08-09 17:19:41.335] [api] [info] HTTP/1.1 "GET /info" <n/a> 200 0ms

Whilst, the stacktrace is shown on the UI:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

CPU

Currently having issues with the preloaded models:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1006: Service Bad Request Error: using template while model prototxt and network weights exist, remove 'template' from 'mllib' or remove prototxt files instead instead ?

But even when fixing it, it says it's training but the metrics.json keeps returning the same value:

(base) root@9dcc26872ff5:/opt/platform/models/training/examples/words_mnist# cat metrics.json 
{"status":{"code":200,"msg":"OK"},"head":{"method":"/train","job":1,"status":"running","time":65.0},"body":{"sname":"word_mnist","mltype":"ctc","measure_hist":{"train_loss_hist":[70.4117202758789,71.0136489868164,50.909339904785159],"elapsed_time_ms_hist":[16774.0,32144.0,49401.0],"learning_rate_hist":[0.00009999999747378752,0.00009999999747378752,0.00009999999747378752]},"description":"word_mnist","measure_sampling":{},"measure":{"test_names":{},"iteration":3.0,"elapsed_time_ms":49401.0,"remain_time_str":"1d:23h:33m:15s","train_loss":50.909339904785159,"flops":2041736704,"iter_time":17123.0,"iteration_duration_ms":17123.0,"remain_time":171195.765625,"learning_rate":0.00009999999747378752},"model":{"repository":"/opt/platform/models/training/examples/words_mnist"}}}

Beautified:

{
  "status": {
    "code": 200,
    "msg": "OK"
  },
  "head": {
    "method": "/train",
    "job": 1,
    "status": "running",
    "time": 65
  },
  "body": {
    "sname": "word_mnist",
    "mltype": "ctc",
    "measure_hist": {
      "train_loss_hist": [
        70.4117202758789,
        71.0136489868164,
        50.909339904785156
      ],
      "elapsed_time_ms_hist": [
        16774,
        32144,
        49401
      ],
      "learning_rate_hist": [
        0.00009999999747378752,
        0.00009999999747378752,
        0.00009999999747378752
      ]
    },
    "description": "word_mnist",
    "measure_sampling": {},
    "measure": {
      "test_names": {},
      "iteration": 3,
      "elapsed_time_ms": 49401,
      "remain_time_str": "1d:23h:33m:15s",
      "train_loss": 50.909339904785156,
      "flops": 2041736704,
      "iter_time": 17123,
      "iteration_duration_ms": 17123,
      "remain_time": 171195.765625,
      "learning_rate": 0.00009999999747378752
    },
    "model": {
      "repository": "/opt/platform/models/training/examples/words_mnist"
    }
  }
}

No updates happens there. It may be normal - I'm still researching the software, I hope to get a good grasp on it soon


Thanks again for your time!
Kind Regards,
Lucca Ferri

from deepdetect.

Bycob avatar Bycob commented on July 4, 2024

CPU training is really slow, that's why you may not see the metrics.json change often. From what you sent, it looks like the model is training normally.

For GPU training, do you have multiple GPUs on your machine? If not, try to change gpuid to 0

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

For some reason, the same happens on gpuid 0...

My nvidia-smi:

Wed Aug 10 11:54:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   42C    P5    29W / 320W |    599MiB / 10240MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     12411      G   /usr/lib/xorg/Xorg                332MiB |
|    0   N/A  N/A     12581      G   /usr/bin/gnome-shell               61MiB |
|    0   N/A  N/A    134001      G   ...0/usr/lib/firefox/firefox      204MiB |
+-----------------------------------------------------------------------------+

Docker nvidia-docker:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    23W / 320W |    601MiB / 10240MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

from deepdetect.

Bycob avatar Bycob commented on July 4, 2024

Exactly the same error? do you have dd logs?

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

Hello again @Bycob , yes, same error, logs:

deepdetect_1      | [2022-08-10 18:04:19.858] [api] [error] service not found: "word_mnist"
deepdetect_1      | [2022-08-10 18:04:19.858] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1      | [2022-08-10 18:04:19.863] [word_mnist] [info] Using GPU 0
deepdetect_1      | [2022-08-10 18:04:19.863] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1      | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1      | 
deepdetect_1      | [2022-08-10 18:04:19.863] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms

UI:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
     45             self.output.clear_output()
     46             with self.output:
---> 47                 res = fun(*args, **kwargs)
     48                 try:
     49                     print(json.dumps(res, indent=2))

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
    138 
    139     def run(self, *_) -> JSONType:
--> 140         self._create()
    141         return self.train(resume=False)
    142 

/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
    124                 )
    125             )
--> 126             raise RuntimeError(
    127                 "Error code {code}: {msg}".format(
    128                     code=c.json()["status"]["dd_code"],

RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

When I try to publish the archived service:

Error while publishing service

InternalError: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)

Strage thing is, it appears as Archived Job! Progress!


Edit: deepserver/info:

{
  "dd_msg": null,
  "status": null,
  "head": {
    "method": "/info",
    "build-type": "dev",
    "version": "v0.21.0-dirty",
    "branch": "heads/v0.21.0",
    "commit": "385122d4eace490ab95fa7a7b9ed92121af1414e",
    "compile_flags": "USE_CAFFE2=OFF USE_TF=OFF USE_NCNN=OFF USE_TORCH=OFF USE_HDF5=ON USE_CAFFE=ON USE_TENSORRT=OFF USE_TENSORRT_OSS=OFF USE_DLIB=OFF USE_CUDA_CV=OFF USE_SIMSEARCH=ON USE_ANNOY=OFF USE_FAISS=ON USE_COMMAND_LINE=ON USE_JSON_API=ON USE_HTTP_SERVER=OFF USE_CUDA_CV=OFF",
    "deps_version": "OPENCV_VERSION=4.2.0 CUDA_VERSION=11.1 CUDNN_VERSION=8.0.5 TENSORRT_VERSION=",
    "services": []
  },
  "body": null
}

Edit: when creating a service, it says: No gpu found for deepdetect server. - but a GPU is detected on the right side of the panel (alongside its temperature etc).

from deepdetect.

iamdroppy avatar iamdroppy commented on July 4, 2024

Update: testing now with Ubuntu 20.04 LTS, will give back the results.

from deepdetect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.