Giter Site home page Giter Site logo

Comments (6)

ZhangYuanhan-AI avatar ZhangYuanhan-AI commented on September 4, 2024 1

bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-34B-DPO mistral_direct 16 2 True XXX.mp4

works well at my side

from llava-next.

ZhangYuanhan-AI avatar ZhangYuanhan-AI commented on September 4, 2024 1

Hi, as our training data includes images as well, and hence there are many instructions includes phases like "in the image", so the current model generate "in the image" sometimes.

We are currently focusing on solving this!

from llava-next.

sykuann avatar sykuann commented on September 4, 2024

I am facing the same issue..

from llava-next.

gyfastas avatar gyfastas commented on September 4, 2024

same issue too. How to fix it?

updated: I tried to mofidy mm_utils.py as the following. It now works for me:

def call_for_batch(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
    offset = min(output_ids.shape[1] - self.start_len, self.max_keyword_len)
    self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
    try:
        for keyword_id in self.keyword_ids: # fix: if output_ids[0, -keyword_id.shape[0]:] is not equal len as keyword_id just pass
            if output_ids[0, -keyword_id.shape[0]:].shape[0] != keyword_id.shape[0]:
                continue
            elif (output_ids[0, -keyword_id.shape[0]:] == keyword_id).all():
                return True
    except Exception as e:
        print(f"Error raised here {e}")
        import pdb;pdb.set_trace()
    outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
    for keyword in self.keywords:
        if keyword in outputs:
            return True
    return False

Note: This might changes the behavior of stopping criteria. It starts to repeat words in my case.

from llava-next.

ZhangYuanhan-AI avatar ZhangYuanhan-AI commented on September 4, 2024

Hi, please share with me the command you use.

from llava-next.

Marlod390 avatar Marlod390 commented on September 4, 2024

I hardcoded the parameters into a code for inference. I changed vicuna_v1 to mistral_direct like yours and it worked. But compared to the 7B version, the 34B answer contains a lot of "in the image" and "in the frame". This may not be what the video VLM should output. Do you have the similar problem? If not, there may be something wrong with my code.

from llava-next.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.