Giter Site home page Giter Site logo

Comments (16)

kcz358 avatar kcz358 commented on September 4, 2024 2

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get
image

All the video frames are being pooled correctly. Hope it would help

from llava-next.

AmazDeng avatar AmazDeng commented on September 4, 2024

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

from llava-next.

kcz358 avatar kcz358 commented on September 4, 2024

I think there is a small error in the jupyter notebook. Passing modalities=['video'] should lower the token usage

from llava-next.

Luodian avatar Luodian commented on September 4, 2024

@kcz358 @ZhangYuanhan-AI @Luodian Could you please take a look at this issue?

Sorry we found that we wrongly added some video specific logics in our llava_arch.py in commit c121c20.

Now we revert it and please try with updated code, thanks!

from llava-next.

AmazDeng avatar AmazDeng commented on September 4, 2024

@Luodian @ZhangYuanhan-AI @kcz358
Thank you for your response.
I need to point out that the reason for the excessively high GPU memory usage during video inference is that after the process_images method completes, the image_tensors dimensions are extremely large. For a single image, the dimensions are [16, 3, 384, 384], and for 32 frames, it becomes [512, 3, 384, 384]. This issue occurs at the stage where image_tensors = process_images(video_frames, image_processor, model.config) is executed, not during the generate stage. Therefore, even if you pass modalities=["video"] during the generate stage, it doesn’t help.

The reason the first dimension of the process_images output for a single image is 16 is that image_aspect_ratio="anyres_max_9".The "anyres_max_9" parameter is applicable to single image inference, not to video inference. I tested this using the latest code you modified, and the result is the same. GPU memory usage is still very high (about 57GB for 24 frames). The generated tensor does not have a shape of 196.
So, does the process_images method also need some modifications?

1
2
3
inference code

import argparse
import torch
import sys
# print(f"before,sys.path============={sys.path}")
sys.path.append("/media/star/8T/PycharmProjects/github/gpt/LLaVA-NeXT")
# print(f"after,sys.path============={sys.path}")
import time

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

import torch
import cv2
import numpy as np
from PIL import Image
import requests
import copy
import warnings

warnings.filterwarnings("ignore")
# Load the OneVision model
pretrained = "/media/star/8T/model/gpt/llava/llava-next/lmms-lab/llava-onevision/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()


# Function to extract frames from video
def extract_frames(video_path, num_frames=8):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    for i in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Load and process video
video_path = "/media/star/8T/tmp/gpt4v/video/zouxiu2_5/clip_135_140.mp4"
num_frames=24


print(f"num_frames={num_frames}")
video_frames = extract_frames(video_path,num_frames=num_frames)
print(f"model.config={model.config}")
image_tensors = process_images(video_frames, image_processor, model.config)


image_tensors = [_image.to(dtype=torch.float16, device=device) for _image in image_tensors]
print(f"image_tensors.shape={[image_tensor.shape for image_tensor in image_tensors]}")
# Prepare conversation input
conv_template = "qwen_1_5"
question = f"{DEFAULT_IMAGE_TOKEN}\nIs the model changing clothes in the video? answer the question using a single word or phrase."

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [frame.size for frame in video_frames]
print(f"image_sizes={image_sizes[:2]}")
# Generate response
cont = model.generate(
    input_ids,
    images=image_tensors,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
    modalities=["video"],
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs[0])

from llava-next.

kcz358 avatar kcz358 commented on September 4, 2024

The 729 dimension
Hi, I think it is correct because in the encode image function there is no pooling operation. The pooling operation will be after the encode images in the get 2d pool and after this I think it will turn into 196 dim.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)


For the wrong process images for video

Yes, I agree with you. There is an error in the tutorial again. Should not use the process_image in the mm_utils, it is using much more tokens then expected for frames.

You should use image processor to handle the frame instead

image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().cuda()

Thank you for pointing it out and I will check with it later and revise the notebook

from llava-next.

AmazDeng avatar AmazDeng commented on September 4, 2024

image_features.append(self.get_2dPool(image_feat))

The code you referenced is located in the encode_multimodals method. However, in the main branch of the llava_arch.py code, encode_multimodals is commented out. @kcz358

from llava-next.

kcz358 avatar kcz358 commented on September 4, 2024

These lines contain the processing logic, not the encode_multimodals.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)

from llava-next.

AmazDeng avatar AmazDeng commented on September 4, 2024

These lines contain the processing logic, not the encode_multimodals.

for idx, image_feat in enumerate(encoded_image_features):
if idx in video_idx_in_batch:
image_features.append(self.get_2dPool(image_feat))
else:
image_features.append(image_feat)

I did as you said and replaced "process_images" with "image_processor". I printed out the shape after the statement "image_features.append(self.get_2dPool(image_feat))", but still no 196 appeared.

I am using the llava-onevision-qwen2-7b-ov version and conducted both local and online tests on the same video (https://llava-onevision.lmms-lab.com/). The results were "yes" and "no," respectively. The prompt was "Is the model changing clothes in the video? Answer the question using a single word or phrase." Clearly, the online result was correct, the local result is was wrong.
Therefore, I think there are still some issues with the code.

1
2

from llava-next.

Luodian avatar Luodian commented on September 4, 2024

Thank you Kaichen, it's great to see the problem has been addressed, also tested my side and it works.

from llava-next.

AmazDeng avatar AmazDeng commented on September 4, 2024

The problem is actually you are still processing the video with incorrect logic even though you are using image_processor to process images. The video frames are treated as multiple images instead of video. You can see that the first frames has 196 dimension but the rest are not being pooled. I have changed the correct logic of reading videos in the onevision tutorial notebook in PR #152. Here are the results I get image

All the video frames are being pooled correctly. Hope it would help

I understand now. In my original approach, I only passed in [video], so the video only read a single frame. The subsequent frames were all processed as images.

from llava-next.

hulianyuyy avatar hulianyuyy commented on September 4, 2024

Many thanks for your question. In the toturial, it works normally. But in the video inference code upon evaluation benchmarks, would it still incur huge memory costs?

from llava-next.

kcz358 avatar kcz358 commented on September 4, 2024

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

from llava-next.

hulianyuyy avatar hulianyuyy commented on September 4, 2024

Yes, the lmms_eval evaluation logic is correct. I fixed the tutorial part using the code from lmms_eval

Thanks, the lmms_eval evaluation logic is correct. But when i evaluate with 7b modal, it still incurs ~70GB memory, which is too huge as LLAVA-Next-Video-7B only occupies ~20GB memory. Maybe there is still something wrong with the inference code?

from llava-next.

Luodian avatar Luodian commented on September 4, 2024
image

I did a quick test, it runs in 20GB.

My script is:

FINAL_RUN_NAME=$1
TASKS=$2

MODEL_BASENAME=$(basename "$FINAL_RUN_NAME")

echo "MODEL_BASENAME: ${MODEL_BASENAME}"
cd /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval

python3 -m accelerate.commands.launch --num_processes 8 --main_process_port 12399 lmms_eval \
    --model llava_onevision \
    --model_args pretrained=${FINAL_RUN_NAME},conv_template=qwen_1_5,model_name=llava_qwen \
    --tasks ${TASKS} \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix ${MODEL_BASENAME} \
    --output_path ./logs

bash /mnt/bn/vl-research/workspace/boli01/projects/lmms-eval/scripts/llava_one_vision/ov_eval.sh lmms-lab/llava-onevision-qwen2-7b-ov videomme;

from llava-next.

hulianyuyy avatar hulianyuyy commented on September 4, 2024

Thanks for your reply. I will try it.

from llava-next.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.