Giter Site home page Giter Site logo

gokayfem / img2txt-comfyui-nodes Goto Github PK

View Code? Open in Web Editor NEW

This project forked from christian-byrne/img2txt-comfyui-nodes

0.0 0.0 0.0 4.03 MB

Implements some of the most popular img2txt models on HF into ComfyUI nodes. Uses questions/conditional-prompts to get descriptions that are suited for being fed back into a txt2img node.

Shell 6.02% JavaScript 5.51% Python 88.47%

img2txt-comfyui-nodes's Introduction

Auto-generate caption (BLIP Only):

alt text

Using to automate img2img process (BLIP and Llava)

alt text

Requirements/Dependencies

  • Shared with ComfyUI
    • Pillow>=8.3.2
    • torch>=2.2.1
    • torchvision>=0.17.1
  • For Llava model
    • bitsandbytes>=0.43.0
    • accelerate>=0.3.0
  • For MiniCPM model
    • transformers>=4.36.0
    • timm==0.9.10
    • sentencepiece==0.1.99
  • Python 3.10+

Installation

  • cd into ComfyUI/custom_nodes directory
  • git clone this repo
  • cd img2txt-comfyui-nodes
  • pip install -r requirements.txt
  • Models will be automatically downloaded per-use. If you never toggle a model on in the UI, it will never be downloaded.
  • To ask a list of specific questions about the image, use the Llava or MiniPCM models. The questions are separated by line in the multiline text input box.

Support for Chinese

  • The MiniCPM model works with Chinese text input without any additional configuration. The output will also be in Chinese.
    • "MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM"

Model Locations/Paths

  • Models are downloaded automatically using the Huggingface cache system and the transformers from_pretrained method so no manual installation of models is necessary.
  • If you really want to manually download the models, please refer to Huggingface's documentation concerning the cache system. Here is the relevant except:
    • Pretrained models are downloaded and locally cached at ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:
      • Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
      • Shell environment variable: HF_HOME.
      • Shell environment variable: XDG_CACHE_HOME + /huggingface.

Models Implemented (so far)

  • MiniCPM (Chinese & English)
    • Title: MiniCPM-V-2 - Strong multimodal large language model for efficient end-side deployment
    • Datasets: HuggingFaceM4VQAv2, RLHF-V-Dataset, LLaVA-Instruct-150K
    • Size: ~ 6.8GB
  • Salesforce - blip-image-captioning-base
    • Title: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    • Size: ~ 2GB
    • Dataset: COCO (The MS COCO dataset is a large-scale object detection, image segmentation, and captioning dataset published by Microsoft)
  • llava - llava-1.5-7b-hf
    • Title: LLava: Large Language Models for Vision and Language Tasks
    • Size: ~ 15GB
    • Dataset: 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP, 158K GPT-generated multimodal instruction-following data, 450K academic-task-oriented VQA data mixture, 40K ShareGPT data.
  • Coming Soon: Microsoft - Git Large coco
    • Title: GIT (short for GenerativeImage2Text) mode
    • Size: ~ 3GB
    • Dataset: COCO
  • More - to do

Prompts

This is the guide for the format of an "ideal" txt2img prompt (using BLIP). Use as the basis for the questions to ask the img2txt models.

  • Subject - you can specify region, write the most about the subject
  • Medium - material used to make artwork. Some examples are illustration, oil painting, 3D rendering, and photography. Medium has a strong effect because one keyword alone can dramatically change the style.
  • Style - artistic style of the image. Examples include impressionist, surrealist, pop art, etc.
  • Artists - Artist names are strong modifiers. They allow you to dial in the exact style using a particular artist as a reference. It is also common to use multiple artist names to blend their styles. Now let’s add Stanley Artgerm Lau, a superhero comic artist, and Alphonse Mucha, a portrait painter in the 19th century.
  • Website - Niche graphic websites such as Artstation and Deviant Art aggregate many images of distinct genres. Using them in a prompt is a sure way to steer the image toward these styles.
  • Resolution - Resolution represents how sharp and detailed the image is. Let’s add keywords highly detailed and sharp focus
  • Enviornment
  • Additional Details and objects - Additional details are sweeteners added to modify an image. We will add sci-fi, stunningly beautiful and dystopian to add some vibe to the image.
  • Composition - camera type, detail, cinematography, blur, depth-of-field
  • Color/Warmth - You can control the overall color of the image by adding color keywords. The colors you specified may appear as a tone or in objects.
  • Lighting - Any photographer would tell you lighting is a key factor in creating successful images. Lighting keywords can have a huge effect on how the image looks. Let’s add cinematic lighting and dark to the prompt.

img2txt-comfyui-nodes's People

Contributors

christian-byrne avatar haohaocreates avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.