For the llava-onevision model, the official video inference code does not modify the <

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I think there is a small error in the jupyter notebook. Passing <code class="notransla

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The llava-onevision model video inference code has an error about llava-next HOT 16 OPEN

AmazDeng commented on September 4, 2024

The llava-onevision model video inference code has an error

from llava-next.

Comments (16)

kcz358 commented on September 4, 2024 2

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get

All the video frames are being pooled correctly. Hope it would help

from llava-next.

AmazDeng commented on September 4, 2024

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

from llava-next.

kcz358 commented on September 4, 2024

I think there is a small error in the jupyter notebook. Passing modalities=['video'] should lower the token usage

from llava-next.

Luodian commented on September 4, 2024

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

Sorry we found that we wrongly added some video specific logics in our llava_arch.py in commit c121c20.

Now we revert it and please try with updated code, thanks!

from llava-next.

AmazDeng commented on September 4, 2024

@Luodian @ZhangYuanhan-AI @kcz358
Thank you for your response.
I need to point out that the reason for the excessively high GPU memory usage during video inference is that after the process_images method completes, the image_tensors dimensions are extremely large. For a single image, the dimensions are [16, 3, 384, 384], and for 32 frames, it becomes [512, 3, 384, 384]. This issue occurs at the stage where image_tensors = process_images(video_frames, image_processor, model.config) is executed, not during the generate stage. Therefore, even if you pass modalities=["video"] during the generate stage, it doesn’t help.

The reason the first dimension of the process_images output for a single image is 16 is that image_aspect_ratio="anyres_max_9".The "anyres_max_9" parameter is applicable to single image inference, not to video inference. I tested this using the latest code you modified, and the result is the same. GPU memory usage is still very high (about 57GB for 24 frames). The generated tensor does not have a shape of 196.
So, does the process_images method also need some modifications?

inference code

import argparse
import torch
import sys
# print(f"before,sys.path============={sys.path}")
sys.path.append("/media/star/8T/PycharmProjects/github/gpt/LLaVA-NeXT")
# print(f"after,sys.path============={sys.path}")
import time

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings

warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "/media/star/8T/model/gpt/llava/llava-next/lmms-lab/llava-onevision/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()


# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/media/star/8T/tmp/gpt4v/video/zouxiu2_5/clip_135_140.mp4"
num_frames=24


print(f"num_frames={num_frames}")
video_frames = extract_frames(video_path,num_frames=num_frames)
print(f"model.config={model.config}")
image_tensors = process_images(video_frames, image_processor, model.config)


image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
print(f"image_tensors.shape={[image_tensor.shape for image_tensor in image_tensors]}")
# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nIs the model changing clothes in the video? answer the question using a single word or phrase."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
print(f"image_sizes={image_sizes[:2]}")
# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

from llava-next.

kcz358 commented on September 4, 2024

The 729 dimension
Hi, I think it is correct because in the encode image function there is no pooling operation. The pooling operation will be after the encode images in the get 2d pool and after this I think it will turn into 196 dim.

LLaVA-NeXT/llava/model/llava_arch.py

Lines 254 to 258 in 16dbbb3

    
           for idx, image_feat in enumerate(encoded_image_features): 
        
               if idx in video_idx_in_batch: 
        
                   image_features.append(self.get_2dPool(image_feat)) 
        
               else: 
        
                   image_features.append(image_feat)

For the wrong process images for video

Yes, I agree with you. There is an error in the tutorial again. Should not use the process_image in the mm_utils, it is using much more tokens then expected for frames.

You should use image processor to handle the frame instead

image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().cuda()

Thank you for pointing it out and I will check with it later and revise the notebook

from llava-next.

AmazDeng commented on September 4, 2024

image_features.append(self.get_2dPool(image_feat))

The code you referenced is located in the encode_multimodals method. However, in the main branch of the llava_arch.py code, encode_multimodals is commented out. @kcz358

from llava-next.

kcz358 commented on September 4, 2024

These lines contain the processing logic, not the encode_multimodals.

LLaVA-NeXT/llava/model/llava_arch.py

Lines 232 to 236 in 3fbf54b

    
           for idx, image_feat in enumerate(encoded_image_features): 
        
               if idx in video_idx_in_batch: 
        
                   image_features.append(self.get_2dPool(image_feat)) 
        
               else: 
        
                   image_features.append(image_feat)

from llava-next.

AmazDeng commented on September 4, 2024

These lines contain the processing logic, not the encode_multimodals.

LLaVA-NeXT/llava/model/llava_arch.py

Lines 232 to 236 in 3fbf54b

for idx, image_feat in enumerate(encoded_image_features):

if idx in video_idx_in_batch:

image_features.append(self.get_2dPool(image_feat))

else:

image_features.append(image_feat)

I did as you said and replaced "process_images" with "image_processor". I printed out the shape after the statement "image_features.append(self.get_2dPool(image_feat))", but still no 196 appeared.

I am using the llava-onevision-qwen2-7b-ov version and conducted both local and online tests on the same video (https://llava-onevision.lmms-lab.com/). The results were "yes" and "no," respectively. The prompt was "Is the model changing clothes in the video? Answer the question using a single word or phrase." Clearly, the online result was correct, the local result is was wrong.
Therefore, I think there are still some issues with the code.

from llava-next.

Luodian commented on September 4, 2024

Thank you Kaichen, it's great to see the problem has been addressed, also tested my side and it works.

from llava-next.

AmazDeng commented on September 4, 2024

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get

All the video frames are being pooled correctly. Hope it would help

I understand now. In my original approach, I only passed in [video], so the video only read a single frame. The subsequent frames were all processed as images.

from llava-next.

hulianyuyy commented on September 4, 2024

Many thanks for your question. In the toturial, it works normally. But in the video inference code upon evaluation benchmarks, would it still incur huge memory costs?

from llava-next.

kcz358 commented on September 4, 2024

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

from llava-next.

hulianyuyy commented on September 4, 2024

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

Thanks, the lmms_eval evaluation logic is correct. But when i evaluate with 7b modal, it still incurs ~70GB memory, which is too huge as LLAVA-Next-Video-7B only occupies ~20GB memory. Maybe there is still something wrong with the inference code?

from llava-next.

Luodian commented on September 4, 2024

I did a quick test, it runs in 20GB.

My script is:

FINAL_RUN_NAME=$1
TASKS=$2

MODEL_BASENAME=$(basename "$FINAL_RUN_NAME")

echo "MODEL_BASENAME: ${MODEL_BASENAME}"
cd /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval

python3 -m accelerate.commands.launch --num_processes 8 --main_process_port 12399 lmms_eval \
    --model llava_onevision \
    --model_args pretrained=${FINAL_RUN_NAME},conv_template=qwen_1_5,model_name=llava_qwen \
    --tasks ${TASKS} \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix ${MODEL_BASENAME} \
    --output_path ./logs

bash /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval/scripts/llava_one_vision/ov_eval.sh lmms-lab/llava-onevision-qwen2-7b-ov videomme;

from llava-next.

hulianyuyy commented on September 4, 2024

Thanks for your reply. I will try it.

from llava-next.

The llava-onevision model video inference code has an error about llava-next HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	for idx, image_feat in enumerate(encoded_image_features):
	if idx in video_idx_in_batch:
	image_features.append(self.get_2dPool(image_feat))
	else:
	image_features.append(image_feat)