<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a href="https://github.com/lukeiwanski/tensorflow/blob/dev/amd_gpu/tensorflow/core/co

Warn user about blacklisted device (allow ovverride too?) about computecpp-sdk HOT 17 CLOSED

mirh commented on May 30, 2024

Warn user about blacklisted device (allow ovverride too?)

from computecpp-sdk.

Comments (17)

mirh commented on May 30, 2024 1

Mhhh.. it seems to be something on the Eigen side of the thing.
https://bitbucket.org/eigen/eigen/src/c2947c341c686c88e966836dcabfd26f0b77bd5b/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceSycl.h?at=default&fileviewer=file-view-default#TensorDeviceSycl.h-102
https://bitbucket.org/eigen/eigen/commits/379751fdec12ed6dbc3c0894b9ea187c98f48bb4
https://bitbucket.org/eigen/eigen/commits/e014909
And.. Yeah, you seems right.

Though it was a colleague of yours to add that code, no TF magic.
Then, I'd close this then - though I'm not sure whatever you were testing against fglrx (old amd's driver) AMDGPU-PRO (newer) or ROCm (not technically a driver, but tl;dr it's another codebase)

from computecpp-sdk.

mirh commented on May 30, 2024 1

~~Yeah sure, but I'm totally missing where this code would be present in TF.~~ Doing, it gets downloaded by bazel once you start building to bazel-tensorflow-dev-amd_gpu/external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceSycl.h

EDIT: building is going, see ya tomorrow

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024 1

Intel GPU generally is fine but seemed to have some threading-related issues with Tensorflow. I believe all the SDK stuff works fine, though.

@lissyx Ah, you're using Beignet - we did look into that, but I seem to remember hearing that Intel wouldn't be making any more major releases of Beignet. I think that might also be the reason why you are seeing this "Module with no kernel" error - their SPIR support is not quite up to par, and they cannot reliably read the SPIR we produce (the closed-source driver from Intel at least should be able to parse it).

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

If I remember correctly, the SYCL code in Tensorflow does some filtering of devices, as that error message seems to suggest. It's possible that it is being discarded at that stage, though I am not sure of the criteria required to be a supported device.

Saying that, actually, I remember that AMD's CPU implementation was... failing lots of our tests internally, and was not receiving updates. It's possible that given this history of failures, the device is not considered suitable for using with Eigen and Tensorflow. ComputeCpp will likely still tell you it exists, but maybe TF is removing it for that reason?

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

Aha, you're correct, that's the code. I will suggest to my colleagues that we consider removing this, because I think it's starting to cause more problems than it solves. Maybe if it emits a warning when using untrusted devices? At least a warning that devices are being discarded would be nice. @Rbiessy would you consider something like this?

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

Actually, it would be cool if you tried removing the line with "if(!unsuported_condition)" so that all devices are added to the list, so you could try out your implementation. Would you mind trying that? We're always interested in seeing what our platform support is like.

from computecpp-sdk.

mirh commented on May 30, 2024

Ok so.. For the love of me I couldn't manage to have *any* edit of Eigen source code to eventually "make its way" to compiled tensorflow.
I even cleaned up all the various bazel .caches and all, but no luck.

OTOH, I just eventually resolved to hex edit libamdocl64.so, changing reported platform name from AMD Accelerated Parallel Processing to 'APD whatever'
That worked and the device showed up.
Unfortunately after some time, it segfaulted (with a different stacktrace than #77, but still)
Leading me to ask how you had got to "confirm support" for fglrx, with all these nuisances

traps: python[21994] general protection ip:7f3f871a6030 sp:7f3f61b7fe78 error:0
Stack trace of thread 21994:
#0  0x00007f3f871a6030 n/a (n/a)
#1  0x00007f3f44711edd n/a (libamdocl64.so)
#2  0x00007f3f44712521 n/a (libamdocl64.so)
#3  0x00007f3f44712bcd n/a (libamdocl64.so)
#4  0x00007f3f4469e15f n/a (libamdocl64.so)
#5  0x00007f3f4470e15c n/a (libamdocl64.so)
#6  0x00007f3f8b04908a start_thread (libpthread.so.0)
#7  0x00007f3f8ad8042f __clone (libc.so.6)

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

Hi, I should have mentioned that editing the Eigen code used in Tensorflow is a bit of a pain. I can explain if you'd like, but the short story is that Bazel downloads a tarball of Eigen from online in the name of producing reproduceable builds.

I've checked our computecpp_info tool, it appears that we support two AMD device families across two driver versions. Yes, we have had some pain points when using this older driver. I'm afraid that I can't really offer any assistance for the CPU device AMD provides with its driver - it's in the category of "should work", but we don't test it internally. Intel's CPU device is a target we support though.

from computecpp-sdk.

mirh commented on May 30, 2024

but the short story is that Bazel downloads a tarball of Eigen from online in the name of producing reproduceable builds.

Yes, I saw - but indeed I was editing the thing after it had been decompressed in its aforementioned folder.

I'm afraid that I can't really offer any assistance for the CPU device AMD provides with its driver

Sure - though a warning it's skipped and/or stating some way one (should he want) to override that, would be welcome

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

OK, I will pass that on to our team here, since that'd make what's going on more clear. Thanks!

(The long story is that after untarring, bazel then copies the files to a totally different location on your hard drive, in the bazel cache. If you edit the headers in there, then your changes will be visible!)

from computecpp-sdk.

lissyx commented on May 30, 2024

Thanks @DuncanMcBain and @mirh, I was struggling to get TensorFlow built against ComputeCpp to see my laptop's GPU device to perform computation, always ending in error like that:

alex@portable-alex:~/tmp/deepspeech/sycl$ ./deepspeech ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt 
2017-12-21 00:49:28.967609: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2017-12-21 00:49:29.047875: W ./tensorflow/core/common_runtime/sycl/sycl_device.h:49] No OpenCL GPU found that is supported by ComputeCpp, trying OpenCL CPU
2017-12-21 00:49:29.047906: F ./tensorflow/core/common_runtime/sycl/sycl_device.h:63] No OpenCL GPU nor CPU found that is supported by ComputeCpp
Abandon (core dumped)
alex@portable-alex:~/tmp/deepspeech/sycl$

I've taken the suggested approach of disabling the platform filtering in eigen, and now it's able to pickup the GPU :):

alex@portable-alex:~/tmp/deepspeech/sycl_eigen_hack$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt
2017-12-21 01:31:01.407356: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Platform name intel gen ocl driver
2017-12-21 01:31:01.476690: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:66] Found following OpenCL devices:
2017-12-21 01:31:01.476719: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:68] id: 0, type: GPU, name: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2, vendor: Intel, profile: FULL_PROFILE
One module without kernel function!
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0101] Failed to create kernel ((Kernel Name: SYCL_cac2b3592d2272412db5415963f17f08_0))
Abandon (core dumped)
alex@portable-alex:~/tmp/deepspeech/sycl_eigen_hack$

It does fail, but I guess it's now unrelated (and I see something similar trying to run on NVIDIA GTX1080) :).

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

Hi @lissyx, that's a different error, the runtime is saying that the kernel does not exist in your program. The default build options should target the Intel GPU without issue. It is true, however, that for the Nvidia device, you would need to make some changes (see here). I've not seen the deep speech model before, is it possible that it uses tensorflow nodes not yet implemented in SYCL?

from computecpp-sdk.

Rbiessy commented on May 30, 2024

Even if a node is not implemented in SYCL, it will run using the default implementation on CPU so that shouldn't be an issue.

from computecpp-sdk.

lissyx commented on May 30, 2024

@DuncanMcBain Sure, but I don't want to hijack more the thread with unrelated issues, just highlight that it also makes Intel GPU using Beignet driver visible, I have been trying to find that information for days now, without any luck :)

from computecpp-sdk.

mirh commented on May 30, 2024

(The long story is that after untarring, bazel then copies the files to a totally different location on your hard drive, in the bazel cache. If you edit the headers in there, then your changes will be visible!)

(yes, that's what I had noticed after like the first try and half - yet even after checking that had patched headers too, it didn't change anything)

Anyway, these are ctest results:

The following tests FAILED:
   4 - example-sycl-application (SEGFAULT)
  11 - opencl-c-interop (Child aborted)
  12 - parallel-for (Failed)
  18 - simple-vector-add (Failed)
  20 - template-function-object (Failed)
  21 - using-function-objects (Failed)
  22 - vptr (Failed)
Errors while running CTest

I can totally see your point into blacklisting it (not sure instead for intel gpu, perhaps that's only a "light" bug)

from computecpp-sdk.

mirh commented on May 30, 2024

I see you have quite relaxed the rules now.
Still it would be helpful if the user was noticed of the blacklisting.

from computecpp-sdk.

DuncanMcBain commented on May 30, 2024

I think in order to not lose track of issues like this, you'd be better creating an issue for Eigen itself, with the requested change. Maybe they could implement a documentation change to say that the AMD CPU implementation is unsupported (though honestly, at this stage, Codeplay documents that this is a known-broken device, and cannot run ComputeCpp).

from computecpp-sdk.

Warn user about blacklisted device (allow ovverride too?) about computecpp-sdk HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent