Comments (13)
Update: it's all working now, tomorrow I'll edit this with how I managed to accomplish.
from deepdetect.
As promised (apologies for the delay).
/code/gpu
.env:
DD_PLATFORM=./../..
DD_SERVER_TAG=latest
DD_SERVER_IMAGE=gpu_torch
DD_PLATFORM_UI_TAG=latest
DD_JUPYTER_TAG=latest
DD_FILEBROWSER_TAG=latest
docker-compose.yml:
version: '2.3'
services:
#
# Platform Data
#
# Get data from dockerhub to run various services
#
platform_data:
image: jolibrain/platform_data:latest
user: ${CURRENT_UID}
volumes:
- ${DD_PLATFORM}:/platform
#
# Deepdetect
#
deepdetect:
image: jolibrain/deepdetect_${DD_SERVER_IMAGE}:${DD_SERVER_TAG}
runtime: nvidia
restart: always
volumes:
- ${DD_PLATFORM}:/opt/platform
#
# Platform UI
#
# modify port 80 to change facade port
#
platform_ui:
image: jolibrain/platform_ui:${DD_PLATFORM_UI_TAG}
restart: always
ports:
- '${DD_PORT:-1912}:80'
links:
- jupyter:jupyter
- deepdetect:deepdetect
- gpustat_server
- filebrowser
- dozzle
volumes:
- ./config/nginx/nginx.conf:/etc/nginx/nginx.conf
- ${DD_PLATFORM}:/opt/platform
- ./config/platform_ui/config.json:/usr/share/nginx/html/config.json
- ./.env:/usr/share/nginx/html/version
#
# Jupyter notebooks
#
jupyter:
image: jolibrain/jupyter_dd_notebook:${DD_JUPYTER_TAG}
runtime: nvidia
user: root
environment:
- JUPYTER_LAB_ENABLE=yes
- NB_UID=${MUID}
volumes:
- ${DD_PLATFORM}:/opt/platform
- ${DD_PLATFORM}/notebooks:/home/jovyan/work
#
# gpustat-server
#
gpustat_server:
image: jolibrain/gpustat_server
runtime: nvidia
#
# filebrowser
#
filebrowser:
image: jolibrain/filebrowser:${DD_FILEBROWSER_TAG}
restart: always
user: ${CURRENT_UID}
volumes:
- ${DD_PLATFORM}/data:/srv/data
#
# real-time log viewer for docker containers
#
dozzle:
image: amir20/dozzle
restart: always
environment:
- DOZZLE_BASE=/docker-logs
volumes:
- /var/run/docker.sock:/var/run/docker.sock
Update to dd_widgets:
docker exec -it jupyter_dd_notebook /bin/bash
apt update -y; apt install vim git -y; \
git clone https://github.com/jolibrain/dd_widgets; \
rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/
---
My setup GPUID = 0 and Engine as DEFAULT
from deepdetect.
Hi, it's an issue related to dd_widget that has been fixed on the latest master, but the packages have not been updated, sorry for the inconvenience. How did you install dd_widget? Until we update the packages you can fix your issue by installing dd_widgets latest master.
from deepdetect.
The second error message still remains though.. I've updated the container directly.
Apologies for the wrong repo, I was unsure, I'd guess the second one belongs to this one?
If you happen to have any clue what's going on I'll be immensively thankful, I've been trying to deploy all morning but the error messages aren't really helpful...
Kind regards,
Lucca Ferri
from deepdetect.
The second message is a DD error indicating there is something wrong with the GPU. You can try nvidia-smi
to see if something is wrong, ensure that you run the dd GPU docker with nvidia-docker installed and that there is enough memory available on your GPU
from deepdetect.
@Bycob sorry to disturb your time, but I'm really lost on one thing:
nvidia-smi returns OK, and nvidia-docker is installed also.
My question is, a RTX 3080 is enough to support this for testing and small datasets?
Kind regards and once again, thanks for the support, I can't thank you enough to point me in the right direction!
from deepdetect.
RTX 3080 should be perfectly fine. I will try to reproduce and come back to you.
Any log or system information would be helpful, especially the deepdetect server logs
from deepdetect.
GPU
My current issues with DeepDetect - note that I will update it whilst trying
dd_widgets
I'm updating dd_widgets with:
$ docker exec -it jupyter_dd /bin/bash
$ ~: apt update -y; apt install vim git -y; \
git clone https://github.com/jolibrain/dd_widgets; \
rm -rf /opt/conda/lib/python3.8/site-packages/dd_widgets/; \
mv dd_widgets/dd_widgets/ /opt/conda/lib/python3.8/site-packages/
deepdetect
Deep detect logs the following:
deepdetect_1 | [2022-08-09 17:19:40.732] [api] [error] service not found: "word_mnist"
deepdetect_1 | [2022-08-09 17:19:40.732] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1 | [2022-08-09 17:19:40.738] [word_mnist] [info] Using GPU 1
deepdetect_1 | [2022-08-09 17:19:40.738] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1 | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1 |
deepdetect_1 | [2022-08-09 17:19:40.738] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms
deepdetect_1 | [2022-08-09 17:19:41.335] [api] [info] HTTP/1.1 "GET /info" <n/a> 200 0ms
Whilst, the stacktrace is shown on the UI:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
45 self.output.clear_output()
46 with self.output:
---> 47 res = fun(*args, **kwargs)
48 try:
49 print(json.dumps(res, indent=2))
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
138
139 def run(self, *_) -> JSONType:
--> 140 self._create()
141 return self.train(resume=False)
142
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
124 )
125 )
--> 126 raise RuntimeError(
127 "Error code {code}: {msg}".format(
128 code=c.json()["status"]["dd_code"],
RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
CPU
Currently having issues with the preloaded models:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
45 self.output.clear_output()
46 with self.output:
---> 47 res = fun(*args, **kwargs)
48 try:
49 print(json.dumps(res, indent=2))
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
138
139 def run(self, *_) -> JSONType:
--> 140 self._create()
141 return self.train(resume=False)
142
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
124 )
125 )
--> 126 raise RuntimeError(
127 "Error code {code}: {msg}".format(
128 code=c.json()["status"]["dd_code"],
RuntimeError: Error code 1006: Service Bad Request Error: using template while model prototxt and network weights exist, remove 'template' from 'mllib' or remove prototxt files instead instead ?
But even when fixing it, it says it's training but the metrics.json
keeps returning the same value:
(base) root@9dcc26872ff5:/opt/platform/models/training/examples/words_mnist# cat metrics.json
{"status":{"code":200,"msg":"OK"},"head":{"method":"/train","job":1,"status":"running","time":65.0},"body":{"sname":"word_mnist","mltype":"ctc","measure_hist":{"train_loss_hist":[70.4117202758789,71.0136489868164,50.909339904785159],"elapsed_time_ms_hist":[16774.0,32144.0,49401.0],"learning_rate_hist":[0.00009999999747378752,0.00009999999747378752,0.00009999999747378752]},"description":"word_mnist","measure_sampling":{},"measure":{"test_names":{},"iteration":3.0,"elapsed_time_ms":49401.0,"remain_time_str":"1d:23h:33m:15s","train_loss":50.909339904785159,"flops":2041736704,"iter_time":17123.0,"iteration_duration_ms":17123.0,"remain_time":171195.765625,"learning_rate":0.00009999999747378752},"model":{"repository":"/opt/platform/models/training/examples/words_mnist"}}}
Beautified:
{
"status": {
"code": 200,
"msg": "OK"
},
"head": {
"method": "/train",
"job": 1,
"status": "running",
"time": 65
},
"body": {
"sname": "word_mnist",
"mltype": "ctc",
"measure_hist": {
"train_loss_hist": [
70.4117202758789,
71.0136489868164,
50.909339904785156
],
"elapsed_time_ms_hist": [
16774,
32144,
49401
],
"learning_rate_hist": [
0.00009999999747378752,
0.00009999999747378752,
0.00009999999747378752
]
},
"description": "word_mnist",
"measure_sampling": {},
"measure": {
"test_names": {},
"iteration": 3,
"elapsed_time_ms": 49401,
"remain_time_str": "1d:23h:33m:15s",
"train_loss": 50.909339904785156,
"flops": 2041736704,
"iter_time": 17123,
"iteration_duration_ms": 17123,
"remain_time": 171195.765625,
"learning_rate": 0.00009999999747378752
},
"model": {
"repository": "/opt/platform/models/training/examples/words_mnist"
}
}
}
No updates happens there. It may be normal - I'm still researching the software, I hope to get a good grasp on it soon
Thanks again for your time!
Kind Regards,
Lucca Ferri
from deepdetect.
CPU training is really slow, that's why you may not see the metrics.json
change often. From what you sent, it looks like the model is training normally.
For GPU training, do you have multiple GPUs on your machine? If not, try to change gpuid
to 0
from deepdetect.
For some reason, the same happens on gpuid 0...
My nvidia-smi:
Wed Aug 10 11:54:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 42C P5 29W / 320W | 599MiB / 10240MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12411 G /usr/lib/xorg/Xorg 332MiB |
| 0 N/A N/A 12581 G /usr/bin/gnome-shell 61MiB |
| 0 N/A N/A 134001 G ...0/usr/lib/firefox/firefox 204MiB |
+-----------------------------------------------------------------------------+
Docker nvidia-docker:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 43C P8 23W / 320W | 601MiB / 10240MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
from deepdetect.
Exactly the same error? do you have dd logs?
from deepdetect.
Hello again @Bycob , yes, same error, logs:
deepdetect_1 | [2022-08-10 18:04:19.858] [api] [error] service not found: "word_mnist"
deepdetect_1 | [2022-08-10 18:04:19.858] [api] [error] HTTP/1.1 "GET //services/word_mnist" <n/a> 404 0ms
deepdetect_1 | [2022-08-10 18:04:19.863] [word_mnist] [info] Using GPU 0
deepdetect_1 | [2022-08-10 18:04:19.863] [word_mnist] [error] service creation call failed: Dynamic exception type: CaffeErrorException
deepdetect_1 | std::exception::what: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
deepdetect_1 |
deepdetect_1 | [2022-08-10 18:04:19.863] [api] [error] HTTP/1.1 "PUT //services/word_mnist" <n/a> 500 0ms
UI:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/dd_widgets/widgets.py in fun_wrapper(*args, **kwargs)
45 self.output.clear_output()
46 with self.output:
---> 47 res = fun(*args, **kwargs)
48 try:
49 print(json.dumps(res, indent=2))
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in run(self, *_)
138
139 def run(self, *_) -> JSONType:
--> 140 self._create()
141 return self.train(resume=False)
142
/opt/conda/lib/python3.8/site-packages/dd_widgets/core.py in _create(self, *_)
124 )
125 )
--> 126 raise RuntimeError(
127 "Error code {code}: {msg}".format(
128 code=c.json()["status"]["dd_code"],
RuntimeError: Error code 1007: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
When I try to publish the archived service:
Error while publishing service
InternalError: src/caffe/common.cpp:164 / Check failed (custom): (error) == (cudaSuccess)
Strage thing is, it appears as Archived Job! Progress!
Edit: deepserver/info:
{
"dd_msg": null,
"status": null,
"head": {
"method": "/info",
"build-type": "dev",
"version": "v0.21.0-dirty",
"branch": "heads/v0.21.0",
"commit": "385122d4eace490ab95fa7a7b9ed92121af1414e",
"compile_flags": "USE_CAFFE2=OFF USE_TF=OFF USE_NCNN=OFF USE_TORCH=OFF USE_HDF5=ON USE_CAFFE=ON USE_TENSORRT=OFF USE_TENSORRT_OSS=OFF USE_DLIB=OFF USE_CUDA_CV=OFF USE_SIMSEARCH=ON USE_ANNOY=OFF USE_FAISS=ON USE_COMMAND_LINE=ON USE_JSON_API=ON USE_HTTP_SERVER=OFF USE_CUDA_CV=OFF",
"deps_version": "OPENCV_VERSION=4.2.0 CUDA_VERSION=11.1 CUDNN_VERSION=8.0.5 TENSORRT_VERSION=",
"services": []
},
"body": null
}
Edit: when creating a service, it says: No gpu found for deepdetect server. - but a GPU is detected on the right side of the panel (alongside its temperature etc).
from deepdetect.
Update: testing now with Ubuntu 20.04 LTS, will give back the results.
from deepdetect.
Related Issues (20)
- Inconsistent predictons using refinedet model HOT 12
- Memory leak on constant /predict requests HOT 8
- Refinedet Tensorrt prediction fails HOT 7
- Memory leak on compressed predict requests with oatpp HOT 7
- Different prediction with tensorrt on refinedet model for the version v0.18.0 HOT 3
- getting error while training, .solverstate HOT 23
- Chain predictions swapped between images HOT 2
- Simsearch query segfault when using IVF indexes, but not default/flat index HOT 6
- On object detect training call, missing either test or train list causes a segfault
- dd_client not find in this path anyone help HOT 2
- How do I do a face recognition using this? HOT 2
- DeepDetect full rewrite in Pure Java
- "best: -1" in predict behaves differently in torch models HOT 2
- Torch v1.12 requires libcupti* but nvidia/cuda:11.6.0-cudnn8-runtime-ubuntu20.04 doesn't include it
- Race condition / pthread error when predicting
- I have error build xgboost HOT 1
- Using `true` or `false` instead of `1` or `0` for query params for status or labels returns a internal server error HOT 1
- Question about hosting the docker image HOT 4
- Graphics problem with tsne algorithm HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepdetect.