Giter Site home page Giter Site logo

FER not working with GPU about fer HOT 9 OPEN

justinshenk avatar justinshenk commented on May 23, 2024
FER not working with GPU

from fer.

Comments (9)

kormoczi avatar kormoczi commented on May 23, 2024 1

Dear @Saran-nns and @JustinShenk,

My results were similar...
If there is any problem with the GPU initialization (similar to your example, @Saran-nns), then the system falls back using the CPU, so everything works (or at least it looks like).
But if the GPU initialization is OK, than after that there will be errors.

So I think the question is still pending...

Best regards,
Csaba

from fer.

kormoczi avatar kormoczi commented on May 23, 2024 1

Hi @Saran-nns,

Thanks for your suggestions, I will check them...
But I have two questions:

  1. You wrote this: "From your log, it is clear that the CUDA couldn't reach the cudnn .dll files."
    Which part of my log shows this?
  2. You suggested to use cuda 11.x, cudnn 8.x, but as I have stated in the beginning, I am using cuda 11.0.3 / cudnn 8.0.5 already,
    so this should be ok... No?

Best regards

from fer.

kormoczi avatar kormoczi commented on May 23, 2024 1

Hi @Saran-nns,

I have checked the project again, and to my very big surprise, after I have re-built the docker image (without any modification), right now the example is working without any problem!
I can't tell yet, what has changed, but most probably not the FER library and not the CUDA/CuDNN...

Best regards

from fer.

Saran-nns avatar Saran-nns commented on May 23, 2024

Thanks for reporting the issue.

I suspect possible compatibility issues between the OS and CUDA/cudnn versions:

I ran the example.py under the env:

OS: Windows 10
Python : 3.6
TF:2.4
CUDA:10.2 with 11.0 dll
Cudnn:8.2

The script ran without issues as seen below;

2021-08-28 14:07:48.781767: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll WARNING:tensorflow:From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2021-08-28 14:08:42.383958: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll 2021-08-28 14:08:47.564227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1060 with Max-Q Design computeCapability: 6.1 coreClock: 1.3415GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s 2021-08-28 14:08:47.602994: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll 2021-08-28 14:08:47.743832: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found 2021-08-28 14:08:47.801707: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found 2021-08-28 14:08:48.108700: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll 2021-08-28 14:08:48.163044: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll 2021-08-28 14:08:48.172109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusolver64_11.dll'; dlerror: cusolver64_11.dll not found 2021-08-28 14:08:48.180278: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found 2021-08-28 14:08:48.189189: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found 2021-08-28 14:08:48.195744: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-08-28 14:08:48.347281: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-08-28 14:08:48.421196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-08-28 14:08:48.435452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 2021-08-28 14:08:51.170786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1060 with Max-Q Design computeCapability: 6.1 coreClock: 1.3415GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s 2021-08-28 14:08:51.183624: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-08-28 14:08:53.758128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-08-28 14:08:53.765005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-08-28 14:08:53.769173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N WARNING:tensorflow:From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\layers\normalization.py:534: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 28-08-2021:14:08:54,419 WARNING [deprecation.py:336] From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\layers\normalization.py:534: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\engine\training.py:2426: UserWarning: Model.state_updateswill be removed in a future version. This property should not be used in TensorFlow 2.0, asupdates are applied automatically. warnings.warn('Model.state_updates will be removed in a future version. ' [{'box': (83, 83, 200, 200), 'emotions': {'angry': 0.0, 'disgust': 0.0, 'fear': 0.0, 'happy': 0.97, 'sad': 0.0, 'surprise': 0.0, 'neutral': 0.03}}]

May I know your

  1. OS
  2. Do you have multiple versions of CUDA installed?

Please try to upgrade to Cudnn==8.2 and let us know if the error persists

from fer.

JustinShenk avatar JustinShenk commented on May 23, 2024

from fer.

Saran-nns avatar Saran-nns commented on May 23, 2024

Hi @kormoczi . Thanks for the update.

I updated CUDA and cudnn and found the example.py ran successfully with GPU.

Logs:

(tfgpu) N:\fer>python example.py 2021-09-02 17:16:47.723348: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll WARNING:tensorflow:From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\compat\v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2021-09-02 17:18:06.955330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll 2021-09-02 17:18:13.564614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1060 with Max-Q Design computeCapability: 6.1 coreClock: 1.3415GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s 2021-09-02 17:18:13.576812: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll 2021-09-02 17:18:16.750779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll 2021-09-02 17:18:16.756477: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll 2021-09-02 17:18:17.022092: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll 2021-09-02 17:18:17.790750: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll 2021-09-02 17:18:18.722597: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll 2021-09-02 17:18:19.191502: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll 2021-09-02 17:18:21.192585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll 2021-09-02 17:18:22.231133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-09-02 17:18:23.184233: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-09-02 17:18:23.578463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1060 with Max-Q Design computeCapability: 6.1 coreClock: 1.3415GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s 2021-09-02 17:18:23.592416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-09-02 17:18:47.042383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-09-02 17:18:47.071966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-09-02 17:18:47.076344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2021-09-02 17:18:47.330953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4484 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) 2021-09-02 17:18:49.850172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce GTX 1060 with Max-Q Design computeCapability: 6.1 coreClock: 1.3415GHz coreCount: 10 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 178.99GiB/s 2021-09-02 17:18:49.861020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-09-02 17:18:49.865012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-09-02 17:18:49.870792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-09-02 17:18:49.873996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2021-09-02 17:18:49.877882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4484 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\layers\normalization.py:534: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 02-09-2021:17:18:50,945 WARNING [deprecation.py:336] From C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\layers\normalization.py:534: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. C:\Users\saran\Anaconda3\envs\tfgpu\lib\site-packages\tensorflow\python\keras\engine\training.py:2426: UserWarning: Model.state_updateswill be removed in a future version. This property should not be used in TensorFlow 2.0, asupdates are applied automatically. warnings.warn('Model.state_updates will be removed in a future version. ' 2021-09-02 17:18:56.204265: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll 2021-09-02 17:19:15.136577: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8202 2021-09-02 17:19:48.728937: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll 2021-09-02 17:19:57.370451: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll 2021-09-02 17:20:06.030969: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "GeForce GTX 1060 with Max-Q Design" frequency: 1341 num_cores: 10 environment { key: "architecture" value: "6.1" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 1572864 shared_memory_size_per_multiprocessor: 98304 memory_size: 4702352179 bandwidth: 192192000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } } [{'box': (83, 83, 200, 200), 'emotions': {'angry': 0.0, 'disgust': 0.0, 'fear': 0.0, 'happy': 0.97, 'sad': 0.0, 'surprise': 0.0, 'neutral': 0.03}}]

From your log, it is clear that the CUDA couldn't reach the cudnn .dll files.

Please make sure that,

  1. You have the right version of cudnn. I suggest cuda 11.x, cudnn 8.x
  2. You have added cudnn in your system path
  3. Copy and paste the dll files from CUDNN bin to CUDA bin as in user guide
  4. Please remove any old versions of CUDA and cudnn from your system paths to avoid path conflicts
  5. Restart your system

Hope this helps

from fer.

Saran-nns avatar Saran-nns commented on May 23, 2024

@kormoczi Also, 5. Restart your system

from fer.

Saran-nns avatar Saran-nns commented on May 23, 2024

@kormoczi
Great that it works through docker.
cuda 11.x is not packaged with cublas. cudnn provides this functionality for ml frameworks like tf, pytorch or keras to generate (initialize) any cublas handles. Even you have the right versions installed, the error could still throw if their (cuda and cudnn) paths(including the python environment) are not well defined.

from fer.

Saran-nns avatar Saran-nns commented on May 23, 2024

Thanks for the issue again and hope you enjoy ferr'ing :)

from fer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.