Giter Site home page Giter Site logo

jjohare / comfyui-llava-next Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ceruleandeep/comfyui-llava-captioner

0.0 0.0 0.0 36 KB

A ComfyUI extension for chatting with your images with LLaVA-NEXT. Runs locally, no external services, no filter.

License: GNU General Public License v3.0

Python 100.00%

comfyui-llava-next's Introduction

ComfyUI LLaVA-NEXT Captioner (ClaudeAI hack forward from upstream)

A ComfyUI extension for chatting with your images. Runs on your own system, no external services used, no filter.

Uses the LLaVA multimodal LLM so you can give instructions or ask questions in natural language. It's maybe as smart as GPT3.5, and it can see.

Try asking for:

  • captions or long descriptions
  • whether a person or object is in the image, and how many
  • lists of keywords or tags
  • a description of the opposite of the image

llava_captioner

NSFWness (FAQ #1 apparently)

The model is quite capable of analysing NSFW images and returning NSFW replies.

It is unlikely to return an NSFW response to a SFW image, in my experience. It seems like this is because (1) the model's output is strongly conditioned on the contents of the image so it's hard to activate concepts that aren't pictured and (2) the LLM has had a hefty dose of safety-training.

This is probably for the best in general. But you will not have much success asking NSFW questions about SFW images.

Installation

  1. git clone https://github.com/ceruleandeep/ComfyUI-LLaVA-Captioner into your custom_nodes folder
    • e.g. custom_nodes\ComfyUI-LLaVA-Captioner
  2. Open a console/Command Prompt/Terminal etc
  3. Change to the custom_nodes/ComfyUI-LLaVA-Captioner folder you just created
    • e.g. cd C:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-LLaVA-Captioner or wherever you have it installed
  4. Run python install.py
  5. Download models from ๐Ÿค— into models\llama:

Usage

Add the node via image -> LlavaCaptioner

Supports tagging and outputting multiple batched inputs.

  • model: The multimodal LLM model to use. People are most familiar with LLaVA but there's also Obsidian or BakLLaVA or ShareGPT4
  • mmproj: The multimodal projection that goes with the model
  • prompt: Question to ask the LLM
  • max_tokens Maximum length of response, in tokens. A token is approximately half a word.
  • temperature How much randomness to allow in the result. While a lot of people are using the text-only Llama series models with temperatures up around 0.7 and enjoying the creativity, LLaVA's accuracy seems to benefit greatly from temperatures less than 0.2.

Requirements

This is easy to install but getting it to use the GPU can be a saga.

GPU inference time is 4 secs per image on a RTX 4090 with 4GB of VRAM to spare, and 8 secs per image on a Macbook Pro M1. CPU inference time is 25 secs per image. If your inference times are closer to 25 than to 5, you're probably doing CPU inference.

Unfortunately the multimodal models in the Llama family need about a 4x larger context size than the text-only ones, so the llama.cpp promise of doing fast LLM inference on their CPUs hasn't quite arrived yet. If you have a GPU, put it to work.

See also

comfyui-llava-next's People

Contributors

jjohare avatar ceruleandeep avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.