Giter Site home page Giter Site logo

Comments (10)

stu1130 avatar stu1130 commented on August 17, 2024 1

Feel free to reopen the issue if you have any other question

from djl.

stu1130 avatar stu1130 commented on August 17, 2024 1

As it is MXNet issue, will close the issue and update our MXNet artifact once it is fixed

from djl.

lanking520 avatar lanking520 commented on August 17, 2024

Thanks for report the issue! Could you please try out:

Add snapshot repository: "https://oss.sonatype.org/content/repositories/snapshots/" in maven repository.

And use 0.6.0-SNAPSHOT version.

recently we fixed an issue related to this. It might be caused by the NDArray is no CPU and not on GPU, and GPU cannot find the pointer caused the crash.

from djl.

danhlephuoc avatar danhlephuoc commented on August 17, 2024

I changed to 0.6.0-SNAPSHOT version, I got the same error. Btw, I tested with my MacPro without GPU, the Yolo models with Darknet53 and Coco work fine on CPU

from djl.

frankfliu avatar frankfliu commented on August 17, 2024

@danhlephuoc is that possible you can share your repo?
Are you using multiple GPU for the training? Can you try to limit use only one GPU?

The commit lanking520 mentioned is here:
0ca79f4

It's an error checking to identify where the mismatch device happens. You still need fix the ndarray creation.

from djl.

danhlephuoc avatar danhlephuoc commented on August 17, 2024

@frankfliu : I've forked report and changed the code for the error at https://github.com/danhlephuoc/djl.git , you just clone and run following command to reproduce the error

mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"

I already limited to 1 GPU via CUDA_VISIBLE_DEVICES, but the error still stays

from djl.

stu1130 avatar stu1130 commented on August 17, 2024

I can reproduce the issue with single GPU
When I used export MXNET_ENGINE_TYPE=NaiveEngine, I saw

Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory
 access was encountered
Stack trace:
  File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-
inl.h", line 81

        at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1788)
        at ai.djl.mxnet.jna.JnaUtils.cachedOpInvoke(JnaUtils.java:1757)
        at ai.djl.mxnet.engine.CachedOp.forward(CachedOp.java:133)
        at ai.djl.mxnet.engine.MxSymbolBlock.forward(MxSymbolBlock.java:145)
        at ai.djl.nn.Block.forward(Block.java:116)
        at ai.djl.inference.Predictor.predict(Predictor.java:117)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
        at ai.djl.inference.Predictor.predict(Predictor.java:112)
        at ai.djl.examples.inference.ObjectDetection.predict(ObjectDetection.java:68)
        at ai.djl.examples.inference.ObjectDetection.main(ObjectDetection.java:47)
[18:09:57] src/resource.cc:279: Ignore CUDA Error [18:09:57] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

So it could be a problem with our symbolic model.

from djl.

stu1130 avatar stu1130 commented on August 17, 2024

I have tried out the same model with the same libmxnet using the MXNet Python and it worked fine. The next step is to dive deeper into our CachedOp

from djl.

stu1130 avatar stu1130 commented on August 17, 2024

@danhlephuoc Hi after I dove deeper, the root cause is that the image is too large, which causes the GPU OOM. I tried to reduce the size of the input image and works perfectly. In addition, I also tried the MXNet Python with the same image size and failed as well. The PR 4629a6c fixes it.

from djl.

stu1130 avatar stu1130 commented on August 17, 2024

I think I found the problem @danhlephuoc. When I tried out the original gluoncv model, it works with the image as large as 1000 * 1000. To be able to run on the DJL, we have to hybridize the model. The hybridized model with current MXNet failed to execute on the GPU when we upgrade the mxnet from 1.6 to 1.7. I created a minimal reproducible script apache/mxnet#18834. I will keep you posted.

from djl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.