Giter Site home page Giter Site logo

Comments (7)

slobentanzer avatar slobentanzer commented on May 25, 2024 1

Ah, that clears it up. I am using a range of quantisations for benchmarking purposes. I can offer to do a PR for documenting this better, if you'd like. It would be nice to have this information in the docs, and maybe even programmatically. I have not been involved with this for long, but I intend to invest more time now, and could give feedback on usability on Apple Silicon (I have the biggest M3 machine).

from inference.

aresnow1 avatar aresnow1 commented on May 25, 2024

If you launch a model with gguf format, it will use Metal automatically without specify n_gpu.

from inference.

slobentanzer avatar slobentanzer commented on May 25, 2024

Hi @aresnow1, thanks for the quick reply! It does use Metal, but does that automatically mean using the GPUs? I see heavy CPU usage when I do inference, and not a lot of GPU. The running Xinference process also responds with this when I try to set n_gpu higher than 0:

The parameter `n_gpu` must be greater than 0 and not greater than the number of GPUs: 0 on the machine.

So it seems there are no GPUs registered. Happy to help troubleshooting or file a PR if it is within my power.

from inference.

aresnow1 avatar aresnow1 commented on May 25, 2024

The n_gpu parameter is provided for NVIDIA graphics card users and does require some additional explanation. Regarding GPU utilization, what observations have you made when using llamacpp directly? In this code snippet at line 99 of the file here, we have set n_gpu_layers to 1 for Apple users.

from inference.

slobentanzer avatar slobentanzer commented on May 25, 2024

@aresnow1 thanks for the explanation, that makes sense! Still wondering about the monitored activity, but I will do some A/B testing and get back to you.

It is not super well-documented in cpp-llama, the version mentioned in the docs I quote above is quite old and back then, it was only implemented for 4-bit quantised models. It is hard to find out what has happened in the meantime in terms of model support. If you have a pointer, that would be great as well.

from inference.

aresnow1 avatar aresnow1 commented on May 25, 2024

@slobentanzer Oh, that reminds me, only 4-bit quantization can be accelerated with Metal, like Q4_K_M. What kind of quantization are you using?

from inference.

aresnow1 avatar aresnow1 commented on May 25, 2024

Any PR or feedback is welcome!

from inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.