Giter Site home page Giter Site logo

Comments (17)

mirh avatar mirh commented on May 30, 2024 1

Mhhh.. it seems to be something on the Eigen side of the thing.
https://bitbucket.org/eigen/eigen/src/c2947c341c686c88e966836dcabfd26f0b77bd5b/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceSycl.h?at=default&fileviewer=file-view-default#TensorDeviceSycl.h-102
https://bitbucket.org/eigen/eigen/commits/379751fdec12ed6dbc3c0894b9ea187c98f48bb4
https://bitbucket.org/eigen/eigen/commits/e014909
And.. Yeah, you seems right.

Though it was a colleague of yours to add that code, no TF magic.
Then, I'd close this then - though I'm not sure whatever you were testing against fglrx (old amd's driver) AMDGPU-PRO (newer) or ROCm (not technically a driver, but tl;dr it's another codebase)

from computecpp-sdk.

mirh avatar mirh commented on May 30, 2024 1

Yeah sure, but I'm totally missing where this code would be present in TF. Doing, it gets downloaded by bazel once you start building to bazel-tensorflow-dev-amd_gpu/external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceSycl.h

EDIT: building is going, see ya tomorrow

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024 1

Intel GPU generally is fine but seemed to have some threading-related issues with Tensorflow. I believe all the SDK stuff works fine, though.

@lissyx Ah, you're using Beignet - we did look into that, but I seem to remember hearing that Intel wouldn't be making any more major releases of Beignet. I think that might also be the reason why you are seeing this "Module with no kernel" error - their SPIR support is not quite up to par, and they cannot reliably read the SPIR we produce (the closed-source driver from Intel at least should be able to parse it).

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

If I remember correctly, the SYCL code in Tensorflow does some filtering of devices, as that error message seems to suggest. It's possible that it is being discarded at that stage, though I am not sure of the criteria required to be a supported device.

Saying that, actually, I remember that AMD's CPU implementation was... failing lots of our tests internally, and was not receiving updates. It's possible that given this history of failures, the device is not considered suitable for using with Eigen and Tensorflow. ComputeCpp will likely still tell you it exists, but maybe TF is removing it for that reason?

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

Aha, you're correct, that's the code. I will suggest to my colleagues that we consider removing this, because I think it's starting to cause more problems than it solves. Maybe if it emits a warning when using untrusted devices? At least a warning that devices are being discarded would be nice. @Rbiessy would you consider something like this?

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

Actually, it would be cool if you tried removing the line with "if(!unsuported_condition)" so that all devices are added to the list, so you could try out your implementation. Would you mind trying that? We're always interested in seeing what our platform support is like.

from computecpp-sdk.

mirh avatar mirh commented on May 30, 2024

Ok so.. For the love of me I couldn't manage to have *any* edit of Eigen source code to eventually "make its way" to compiled tensorflow.
I even cleaned up all the various bazel .caches and all, but no luck.

OTOH, I just eventually resolved to hex edit libamdocl64.so, changing reported platform name from AMD Accelerated Parallel Processing to 'APD whatever'
That worked and the device showed up.
Unfortunately after some time, it segfaulted (with a different stacktrace than #77, but still)
Leading me to ask how you had got to "confirm support" for fglrx, with all these nuisances

traps: python[21994] general protection ip:7f3f871a6030 sp:7f3f61b7fe78 error:0
Stack trace of thread 21994:
#0  0x00007f3f871a6030 n/a (n/a)
#1  0x00007f3f44711edd n/a (libamdocl64.so)
#2  0x00007f3f44712521 n/a (libamdocl64.so)
#3  0x00007f3f44712bcd n/a (libamdocl64.so)
#4  0x00007f3f4469e15f n/a (libamdocl64.so)
#5  0x00007f3f4470e15c n/a (libamdocl64.so)
#6  0x00007f3f8b04908a start_thread (libpthread.so.0)
#7  0x00007f3f8ad8042f __clone (libc.so.6)

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

Hi, I should have mentioned that editing the Eigen code used in Tensorflow is a bit of a pain. I can explain if you'd like, but the short story is that Bazel downloads a tarball of Eigen from online in the name of producing reproduceable builds.

I've checked our computecpp_info tool, it appears that we support two AMD device families across two driver versions. Yes, we have had some pain points when using this older driver. I'm afraid that I can't really offer any assistance for the CPU device AMD provides with its driver - it's in the category of "should work", but we don't test it internally. Intel's CPU device is a target we support though.

from computecpp-sdk.

mirh avatar mirh commented on May 30, 2024

but the short story is that Bazel downloads a tarball of Eigen from online in the name of producing reproduceable builds.

Yes, I saw - but indeed I was editing the thing after it had been decompressed in its aforementioned folder.

I'm afraid that I can't really offer any assistance for the CPU device AMD provides with its driver

Sure - though a warning it's skipped and/or stating some way one (should he want) to override that, would be welcome

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

OK, I will pass that on to our team here, since that'd make what's going on more clear. Thanks!

(The long story is that after untarring, bazel then copies the files to a totally different location on your hard drive, in the bazel cache. If you edit the headers in there, then your changes will be visible!)

from computecpp-sdk.

lissyx avatar lissyx commented on May 30, 2024

Thanks @DuncanMcBain and @mirh, I was struggling to get TensorFlow built against ComputeCpp to see my laptop's GPU device to perform computation, always ending in error like that:

alex@portable-alex:~/tmp/deepspeech/sycl$ ./deepspeech ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt 
2017-12-21 00:49:28.967609: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2017-12-21 00:49:29.047875: W ./tensorflow/core/common_runtime/sycl/sycl_device.h:49] No OpenCL GPU found that is supported by ComputeCpp, trying OpenCL CPU
2017-12-21 00:49:29.047906: F ./tensorflow/core/common_runtime/sycl/sycl_device.h:63] No OpenCL GPU nor CPU found that is supported by ComputeCpp
Abandon (core dumped)
alex@portable-alex:~/tmp/deepspeech/sycl$ 

I've taken the suggested approach of disabling the platform filtering in eigen, and now it's able to pickup the GPU :):

alex@portable-alex:~/tmp/deepspeech/sycl_eigen_hack$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../audio/2830-3980-0043.wav ../models/alphabet.txt
2017-12-21 01:31:01.407356: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Platform name intel gen ocl driver
2017-12-21 01:31:01.476690: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:66] Found following OpenCL devices:
2017-12-21 01:31:01.476719: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:68] id: 0, type: GPU, name: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2, vendor: Intel, profile: FULL_PROFILE
One module without kernel function!
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0101] Failed to create kernel ((Kernel Name: SYCL_cac2b3592d2272412db5415963f17f08_0))
Abandon (core dumped)
alex@portable-alex:~/tmp/deepspeech/sycl_eigen_hack$ 

It does fail, but I guess it's now unrelated (and I see something similar trying to run on NVIDIA GTX1080) :).

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

Hi @lissyx, that's a different error, the runtime is saying that the kernel does not exist in your program. The default build options should target the Intel GPU without issue. It is true, however, that for the Nvidia device, you would need to make some changes (see here). I've not seen the deep speech model before, is it possible that it uses tensorflow nodes not yet implemented in SYCL?

from computecpp-sdk.

Rbiessy avatar Rbiessy commented on May 30, 2024

Even if a node is not implemented in SYCL, it will run using the default implementation on CPU so that shouldn't be an issue.

from computecpp-sdk.

lissyx avatar lissyx commented on May 30, 2024

@DuncanMcBain Sure, but I don't want to hijack more the thread with unrelated issues, just highlight that it also makes Intel GPU using Beignet driver visible, I have been trying to find that information for days now, without any luck :)

from computecpp-sdk.

mirh avatar mirh commented on May 30, 2024

(The long story is that after untarring, bazel then copies the files to a totally different location on your hard drive, in the bazel cache. If you edit the headers in there, then your changes will be visible!)

(yes, that's what I had noticed after like the first try and half - yet even after checking that had patched headers too, it didn't change anything)

Anyway, these are ctest results:

The following tests FAILED:
   4 - example-sycl-application (SEGFAULT)
  11 - opencl-c-interop (Child aborted)
  12 - parallel-for (Failed)
  18 - simple-vector-add (Failed)
  20 - template-function-object (Failed)
  21 - using-function-objects (Failed)
  22 - vptr (Failed)
Errors while running CTest

I can totally see your point into blacklisting it (not sure instead for intel gpu, perhaps that's only a "light" bug)

from computecpp-sdk.

mirh avatar mirh commented on May 30, 2024

I see you have quite relaxed the rules now.
Still it would be helpful if the user was noticed of the blacklisting.

from computecpp-sdk.

DuncanMcBain avatar DuncanMcBain commented on May 30, 2024

I think in order to not lose track of issues like this, you'd be better creating an issue for Eigen itself, with the requested change. Maybe they could implement a documentation change to say that the AMD CPU implementation is unsupported (though honestly, at this stage, Codeplay documents that this is a known-broken device, and cannot run ComputeCpp).

from computecpp-sdk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.