Hi all, great work on the software, it works beautifully on my new M3. I don't kno

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Metal GPU support via llama-cpp? about inference HOT 7 OPEN

slobentanzer commented on May 25, 2024

Metal GPU support via llama-cpp?

from inference.

Comments (7)

slobentanzer commented on May 25, 2024 1

Ah, that clears it up. I am using a range of quantisations for benchmarking purposes. I can offer to do a PR for documenting this better, if you'd like. It would be nice to have this information in the docs, and maybe even programmatically. I have not been involved with this for long, but I intend to invest more time now, and could give feedback on usability on Apple Silicon (I have the biggest M3 machine).

from inference.

aresnow1 commented on May 25, 2024

If you launch a model with gguf format, it will use Metal automatically without specify n_gpu.

from inference.

slobentanzer commented on May 25, 2024

Hi @aresnow1, thanks for the quick reply! It does use Metal, but does that automatically mean using the GPUs? I see heavy CPU usage when I do inference, and not a lot of GPU. The running Xinference process also responds with this when I try to set n_gpu higher than 0:

The parameter `n_gpu` must be greater than 0 and not greater than the number of GPUs: 0 on the machine.

So it seems there are no GPUs registered. Happy to help troubleshooting or file a PR if it is within my power.

from inference.

aresnow1 commented on May 25, 2024

The n_gpu parameter is provided for NVIDIA graphics card users and does require some additional explanation. Regarding GPU utilization, what observations have you made when using llamacpp directly? In this code snippet at line 99 of the file here, we have set n_gpu_layers to 1 for Apple users.

from inference.

slobentanzer commented on May 25, 2024

@aresnow1 thanks for the explanation, that makes sense! Still wondering about the monitored activity, but I will do some A/B testing and get back to you.

It is not super well-documented in cpp-llama, the version mentioned in the docs I quote above is quite old and back then, it was only implemented for 4-bit quantised models. It is hard to find out what has happened in the meantime in terms of model support. If you have a pointer, that would be great as well.

from inference.

aresnow1 commented on May 25, 2024

@slobentanzer Oh, that reminds me, only 4-bit quantization can be accelerated with Metal, like Q4_K_M. What kind of quantization are you using?

from inference.

aresnow1 commented on May 25, 2024

Any PR or feedback is welcome!

from inference.

Recommend Projects

Metal GPU support via llama-cpp? about inference HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent