x-plug / mplug-owl Goto Github PK

mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model

Home Page: https://www.modelscope.cn/studios/damo/mPLUG-Owl

License: MIT License

Python 99.25% Shell 0.75%

alpaca chatbot chatgpt damo dialogue gpt gpt4 gpt4-api huggingface instruction-tuning large-language-models llama mplug mplug-owl multimodal pretraining pytorch transformer video visual-recognition

mplug-owl's People

Contributors

Stargazers

Watchers

Forkers

yiyangzhou hawlyq xhyandwyy dumpmemory spc121 utkarshsingh77 stracerxx twilwa techthiyanes eltociear myan123 vateye linucas2017 casiaxuelin 976268367 aaandy-zhu bugomg bobosui visualstory wyyxhy junyangwang0410 pokerwx alexwangmac gaoyi439 dragonking-lzl shijing888 ffzhang1231 zhangji2023 kly11 zozoz wby1999 wst-casd yaser-wyx xuan129 tengingit chaoxing92 sameleckx flowerroad1996 xuyuanfei jianhai0527 hopever1991 zhou-binbin yanghao1996 yayong-guan youngluc wjpstudy crazyboystop ranjithrosan17 dorbodwolf phoebussi hufeihu hrb518 jangocheng butyuhao wttc-nitr jinpeng01 hurricanejin mlshenkai zhongpeixiang vpegasus mldl ad56740051 replicate snomyc aceanan hongwen-sun cyd3nt guttappa1238 stjordanis joehoover woolenwang paperwave xyzcs replicate sudojarvis laserwave chickenribtianhy 2132660698 fortune-long feiward cleardry zfw1111 dongwhfdyer auu0771 guoqiangjia vegb lazykumasensei hhhtty lenjiuyian kunato mfishzhang ixxooi-baijian adambear lili0710432 apollohuang1 omarnagy91 shadowkun lmz0506 ryanliu112 usama04

mplug-owl's Issues

Computing output likelihoods with the model

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like:
Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with mPLUG-Owl? I assume since the LM is built with transformers there should be a possibility to use output score functions already implemented (haven't dug into this yet)?

Checkpoint links correct?

Hi, it seems like the checkpoint links for both the pre-trained and the instruction fine-tuned models are the same in the readme. They both point to http://mm-chatgpt.oss-cn-zhangjiakou.aliyuncs.com/mplug_owl_demo/released_checkpoint/pretrained.pth.
Is this the pre-trained or the instruction fine-tuned model? I'm assuming since this is the checkpoint used in the demo, it is the instruction fine-tuned model?

Weird result？

This is my result. Can you help me what's wrong?
thanks a lot!

Where is dataset for instruction tuning?

How to process video input

I can input video in the Hugging face demo, but I can't find any relevant video data processing in the code. are you only sampling 4 frames of video in the front end and inputting them into the model as images?This is very important to me, please let me know, thanks!

curious about Alpaca and Vicuna datasets used in the training.

hi to the team, thanks for your hard work and mPLUG-Owl demostrated great performance!

during reading the paper, i found that in "4.1 Experimental Setup/Data and Training Details.", it said "we gather pure text instruction data from three distinct sources: 102k data from the Alpaca [Taori et al., 2023], 90k from the Vicuna [Vicuna, 2023], and 50k from the Baize [Xu et al., 2023a]."

however, to my knowledge

the alpaca dataset only has 52k examples
vicuna has not released their dataset yet due to some concern.

would you mind sharing more information about the datasets? thanks a lot!

unable to reproduce the results shown in the paper

I used the mplug-owl demo with default hyper-params for generation.

Will train.py code be released

As title.

OwlEval dataset?

Is it going to be released in the next version?

How to use the finetuned model

I run train_it.sh on my own data
I got the file as below：

optimizer.pt rng_state_0.pth rng_state_2.pth rng_state_4.pth rng_state_6.pth scheduler.pt training_args.bin
pytorch_model.bin rng_state_1.pth rng_state_3.pth rng_state_5.pth rng_state_7.pth trainer_state.json

How to use this model

What is the computational cost of training?

How many GPUs (V100 or A100) are required?

The environment installation is so confusing.

env.yaml depends on apex, and apex depends on the torch in env.yaml, it is not easy to install dependencies

dataset problem

Do you use open source data for the video dataset, or do you organize it yourself?

Bug when trying your example

After preparing all environments follwing the instructions, trying:
from interface import get_model model, tokenizer, img_processor = get_model( checkpoint_path='checkpoint path', tokenizer_path='tokenizer path')
facing:
Traceback (most recent call last): File "/mPLUG-Owl/try.py", line 3, in <module> model, tokenizer, img_processor = get_model( File "/mPLUG-Owl/interface.py", line 15, in get_model model = mPLUG_OwlForConditionalGeneration(config=config) File "/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 973, in __init__ self.vision_model = CLIPVisionTransformer(config.vision_config) File "/mPLUG-Owl/clip/modeling_clip.py", line 972, in __init__ self.pre_layernorm = MixedFusedLayerNorm(embed_dim, eps=config.layer_norm_eps) File "/anaconda3/envs/mplug_owl/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 212, in __init__ super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=True) File "/anaconda3/envs/mplug_owl/lib/python3.10/site-packages/apex/normalization/fused_layer_norm.py", line 166, in __init__ fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda") File "/anaconda3/envs/mplug_owl/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 674, in _load_unlocked File "<frozen importlib._bootstrap>", line 571, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1176, in create_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed ImportError: /anaconda3/envs/mplug_owl/lib/python3.10/site-packages/fused_layer_norm_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c107SymBool10guard_boolEPKcl
How to fix it? Thanks for your help.

One error should be solved and one improvement for reducing the CUDA memory

1.One error should be solved
when install apex, there will be 4 erors about "convert unsigned long to long", you need to edit:
(1) line 65 in apex_22.01_pp/csrc/mlp.cpp
auto reserved_space = at::empty({reserved_size}, inputs[0].type());
change to:
auto reserved_space = at::empty({static_cast<long>(reserved_size)}, inputs[0].type());

(2) line 138 in apex_22.01_pp/csrc/mlp.cpp
auto work_space = at::empty({work_size / sizeof(scalar_t)}, inputs[0].type());
change to:
auto work_space = at::empty({static_cast<long>(work_size / sizeof(scalar_t))}, inputs[0].type());

or you need to change the compile option

2.one improvement for reducing the CUDA memory
when launch the owl_demo.py using a GPU with 16G, I ran into a CUDA memory overflow error. Then I edit here:
line 33 and 34 in interface.py:

    model = model.to(device)
    model = model.to(dtype)

change to:

    model = model.to(dtype)
    model = model.to(device)

Then, After the demo is started, the memory usage is about 14 GB. It can run very well on a 16GB GPU.

Conflicting `torch` and `torchvision` versions

Unfortunately, I ran into another issue with dependency conflicts.

There's an open PR that bumps torch to 1.13.1. However, torchvision==0.13.1 is not compatible with torch==1.13.1.

What version of torchvision would you recommend?

fail to reproduce the example

conda create -n mplug_owl python=3.10
conda activate mplug_owl
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

I strictly followed the installation instructions above.
python -m serve.web_server --base-model 'pretrained_weights/mplug-owl-llama-7b' --port 8501 --bf16
But when I try to run a simple example like this one below.

I got an error.

Traceback (most recent call last):
  File "/data/pcl/proj/mPLUG-Owl/serve/model_utils.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/data/pcl/proj/mPLUG-Owl/serve/model_worker.py", line 110, in generate_with_callback
    self.model.generate(**kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 1524, in generate
    query_outputs = self.abstractor(
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 1092, in forward
    encoder_outputs = self.encoder(
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 936, in forward
    layer_outputs = layer_module(
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 874, in forward
    cross_attention_outputs = self.crossattention(
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 846, in forward
    attention_output = self.output(self_outputs[0], hidden_states)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 791, in forward
    input_tensor = input_tensor + self.mlp(self.norm2(input_tensor))
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 679, in forward
    hidden_states = self.ffn_ln(hidden_states)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/pcl/proj/mPLUG-Owl/mplug_owl/modeling_mplug_owl.py", line 158, in forward
    output = torch.nn.functional.layer_norm(
  File "/data/pcl/miniconda3/envs/ovd2/lib/python3.10/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be of same shape as normalized_shape, but got weight of shape [2816] and normalized_shape = [2048]

If there is any help, I am very grateful!!

Apex Dependency

Hi, thank you for sharing your awesome work. I have been trying to run the model locally but the apex dependency and its dependency to specific CUDA versions make the environment setup with older GPUs tricky. Really appreciate if you can remove the apex dependency soon. Thanks!

different results between Huggingface and colab

Hi.
Thanks for this great work.

I've used the Huggingface demo to generate descriptions for some images with the following prompt:

Describe this image as detailed as possible.

I also used the 8bits model in colab. This is the code that I used to generate the descriptions:

import torch
from PIL import Image
import requests
from mplug_owl.modeling_mplug_owl import MplugOwlForConditionalGeneration
from mplug_owl.tokenization_mplug_owl import MplugOwlTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    load_in_8bit = True,
    torch_dtype =  torch.half,
    device_map= 'auto'
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = MplugOwlTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: Describe this image as detailed as possible.
AI: ''']

image_list = ["/path_to_image"]

generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
}
from PIL import Image
import requests

images = [Image.open(_) for _ in image_list]
inputs = processor(text=prompts, images=images, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)

However, the results from Huggingface demo are different from the locally runned model.
For example, Huggingface will describe an image as:

The painting depicts a woman with her arms outstretched and wearing a red dress, standing in front of a brightly colored background with a vibrant rainbow-like design. The woman's pose appears confident and dynamic, as if she is ready to embrace the colorful surroundings.
There are several other objects in the scene, including a potted plant located on the left side of the painting, a handbag situated near the bottom right corner, and a cup placed towards the right side. Additionally, there is a bowl on a stand near her right foot and another bow on her left arm, adding to the artwork' s vivid appearance.

But when I run the model on colab, for the same image, I obtain the following description:

The image is a painting featuring a colorful dog with a purple and green background. The dog's body is in the middle of the painting, while its head appears at the left side of the picture, slightly turned to the right. Its fur is a mix of purple, green, and brown, giving it a vibrant appearance. There are a few more dogs present in the background, but their focus is not as prominent as the main subject's. The background consists of various colors, including red, blue, yellow, orange, white, and purple, creating a visually engaging and lively composition. The overall painting has a cheerful and playful mood.

The second description is wrong, as there are no dogs in the image. I noticed that many descriptions generated when running the model on colab are completely out of concept. Is there a something that I am doing wrong? Could it be because the model is loaded differently?

I also noticed that even when using the Huggingface demo, the model hallucinates and includes elements in the description that are not present in the image. For example, in the first description there are no handbags, cups or bowls. For example, given the image of a statue it will start to describe how the statue is surrounded by people that are admiring the statue when there are no people or crowds in the image whatsoever.

Is there a way to control the hallucinations?
And why are the results so different when I run the model in different environments (Huggingface vs colab)?

I apologize for the long post.
Any help is greatly appreciated.
Thank you!

How to load the 8bits model with Huggingface in colab

Hi. Thanks for providing the code for huggingface.
I am trying to use the following code in colab, but the session crashes because it runs out of ram.
I am using colab pro with high-ram setup with 25 gb of ram and T4 gpu. But the session still crashes.

# Load via Huggingface Style
from mplug_owl.modeling_mplug_owl import MplugOwlForConditionalGeneration
from mplug_owl.tokenization_mplug_owl import MplugOwlTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    torch_dtype=torch.bfloat16,
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = MplugOwlTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

In the readme it was mentioned that the offline demo can be inferenced with only a single 16GB T4 GPU with 8 bits support.
How can I do this in colab?

Thank you!

Multi-image correlation modelling inference?

Thank you for the amazing work and releasing such concise code. I had a question about how to do model inference with multi-image inputs. Expanding on your provided inference code, something like this comes to mind:

# We use a human/AI template to organize the context as a multi-turn conversation.
# <image> denotes an image placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: <image>
Human: Is the colour of the animal in the first image the same as the second image?
AI: ''']

# The image paths should be placed in the image_list and kept in the same order as in the prompts.
# We support urls, local file paths and base64 string. You can custom the pre-process of images by modifying the mplug_owl.modeling_mplug_owl.ImageProcessor
image_list = ['https://xxx.com/image_1.jpg', 'https://xxx.com/image_2.jpg',]

Could you please confirm if this is the right way to do multi-image inference with the model?
Thanks!

请问 mPLUG-Owl要多少显存，和minigpt4的13b模型比起来效果如何

No implementation?

mPLUG-Owl/server_mplug/utils.py

Line 139 in cc770c2

models = get_model_list()

mPLUG-Owl/server_mplug/utils.py

Line 123 in cc770c2

model = url_params["model"]

Good Work！But, It seems that there are many issue in this code？

How to filter the instruction tuning data?

As the comment text in config file, the size of each dataset (# [50997(alpaca), 155562(llava), 53456(quora), 101466(sharegpt)] 361481 ) is different from the original dataset.

Is there any code or script to filter the data?

How to do the training on multiple images or image pair data?

Thank you for your contribution!

I tried to make a custom image pair dataset as:

{"image": ["image1.jpg","image2.jpg"], "text": "The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nHuman: \nHuman: \nHuman: Can you compare the different between these two images?\nAI: xxxx", "task_type": "xxx"}

However the training loss is always NaN.
How can I train a custom image pair dataset, or how did you train your video data?

Thank you so much!

Train/Validation splits

Dear authors,
Thank you for the great work and open sourcing.
I noticed that you keep about 1k validations for alpaca, llava, quora, and sharegpt #39. Since I want to keep the same setting in my experiments, would you mind sharing the splits with me?

Conflicting requirements

There are conflicts between the dependencies specified in the README and env.yaml.

In the README, torchvision is not specified and torch is pinned at 1.13.1. Further, in env.yaml peft is not listed as a dependency.

PyTorch=1.13.1 (1.13.1 is required by the peft)

However, when I tried to run the demo like:

python -m server_mplug.owl_demo --debug --port 6363 --checkpoint_path 'your checkpoint path' --tokenizer_path 'your tokenizer path'

I got an import error for torchvision.

I then referred to env.yaml for dependencies, but there torch is pinned at 1.12.1

  - pytorch=1.12.1=py3.10_cuda11.3_cudnn8.3.2_0

This conflicts with the version specified in the README.

数据集提供一下，谢谢

About the details of visual abastractor

First of all, thanks for your great work.
From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py).
I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code.
Thanks again.

Could not download the weight

I appreciate the great work!! But when I tried to deploy the model in my local machine, there is no response after I clicked the download button for the weight. Is it possible to store the weight in different platform?

Inquiry about the performance difference between mPLUG-owl and other models

Hi there,

I've been using mPLUG-owl and noticed a significant difference in inference speed compared to other models such as Otter and multimodal-GPT. It also outperforms Vicuna and LLaMA in terms of speed. I'm curious to know the reason behind this performance gap.

Could you kindly shed some light on the factors contributing to the observed speed advantage of mPLUG-owl over these models? I'm curious to know what factors or optimizations contribute to its improved performance. :)

本地部署的demo启动效果与预期不符？

通过命令行：python -m server_mplug.owl_demo --debug --port 6363 --checkpoint_path 'your checkpoint path' --tokenizer_path 'your tokenizer path'

但是展示的demo是有缺失的？但我看demo的代码中应该有文本输入及其他的组件的啊

Installation error for Apex

WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at pypa/pip#11453
Processing c:\users\haose\github\mplug-owl-main\apex
Running command python setup.py egg_info
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\haose\Github\mPLUG-Owl-main\apex\setup.py", line 130, in
_, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
File "C:\Users\haose\Github\mPLUG-Owl-main\apex\setup.py", line 17, in get_cuda_bare_metal_version
raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

torch.version = 2.0.0+cu117

error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: 'C:\Users\haose\anaconda3\envs\owl\python.exe' -c '
exec(compile('"'"''"'"''"'"'

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from `distutils.core` to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so

setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import os, sys, tokenize

try:
import setuptools
except ImportError as error:
print(
"ERROR: Can not execute setup.py since setuptools is not available in "
"the build environment.",
file=sys.stderr,
)
sys.exit(1)

file = %r
sys.argv[0] = file

if os.path.exists(file):
filename = file
with tokenize.open(file) as f:
setup_py_code = f.read()
else:
filename = ""
setup_py_code = "from setuptools import setup; setup()"

exec(compile(setup_py_code, filename, "exec"))
'"'"''"'"''"'"' % ('"'"'C:\Users\haose\Github\mPLUG-Owl-main\apex\setup.py'"'"',), "", "exec"))' egg_info --egg-base 'C:\Users\haose\AppData\Local\Temp\pip-pip-egg-info-h58h_b5f'
cwd: C:\Users\haose\Github\mPLUG-Owl-main\apex
Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

[Suggestion] Could you add the minimum resource requirement in the readme doc?

Could you add the minimum resource requirement in the readme doc? I think many people are curious about it.

The training data in the first stage

According to the paper, the training data in the 1st stage is 104 billion tokens. Since the captions are short, we assume each caption has 20 tokens. 104B/20 = 5200M captions, which is amazing. Maybe my calculation is wrong, would you mind explaining the number of captions you used during the 1st training stage? Thanks in advance.

How do instruct tuning on multi-GPU on the single machine?

Hello, how can I do instruct tuning on multi-GPU on the single machine? I have four 3090 GPUs, can I instruct tuning owl on my machine?

What GPU do you use? And how much VRAM does it have?

I just cannot start instruction tuning for stage2 on single A100 with 40GB VRAM.
TKS！

请问有计划增加听觉能力么, 比如音乐鉴赏, 多语言识别等

How to use the finetuned model with lora?

Interface.py can not load the model that finetuned with lora.

what is the network architecture of the visual abstractor?

请问是否可以在readme里写个huggingface的使用说明呢？

感谢你们的工作，以及发布到huggingface的努力，实在是太棒了。

这个是blip-2写在readme里面的， https://github.com/salesforce/LAVIS/tree/main/projects/blip2 ，可以参考一下最下面。
或者可以附一下huggingface调用mplug-owl的说明吗？我在huggingface里面找了好久没找到呢？只找到了https://huggingface.co/MAGAer13/mplug-owl-llama-7b/discussions ，和
from transformers import AutoModel
model = AutoModel.from_pretrained("MAGAer13/mplug-owl-llama-7b")，
但是至于如何进行inference，就没找到了。

Installation failed

I tried to prepare environment with conda env create -f env.yaml, but failed. The error message as below:

Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 2.10.1 Requires-Python <3; 2.11.0 Requires-Python <3; 2.11.1 Requires-Python <3; 2.4.0 Requires-Python <3; 2.4.1 Requires-Python <3; 2.4.2 Requires-Python <3; 2.4.3 Requires-Python <3; 2.4.4 Requires-Python <3; 2.5.0 Requires-Python <3; 2.5.1 Requires-Python <3; 2.5.2 Requires-Python <3; 2.6.0 Requires-Python <3; 2.6.1 Requires-Python <3; 2.6.2 Requires-Python <3; 2.7.0 Requires-Python <3; 2.7.2 Requires-Python <3; 2.8.0 Requires-Python <3; 2.8.1 Requires-Python <3; 2.8.2 Requires-Python <3; 2.8.3 Requires-Python <3; 2.8.4 Requires-Python <3; 2.8.5 Requires-Python <3; 2.8.6 Requires-Python <3; 2.8.7 Requires-Python <3; 2.9.2 Requires-Python <3
ERROR: Could not find a version that satisfies the requirement apex==0.1 (from versions: 0.9.8dev.linux-i686, 0.9.8.dev0, 0.9.8a0.dev0, 0.9.9.dev0, 0.9.10.dev0)
ERROR: No matching distribution found for apex==0.1

failed

CondaEnvException: Pip failed

What is the process of video input? There is no preprocessing code in your code, but Hugging Face spaces supports video input.

missing related projects

Just a reminder that MiniGPT-4 is missing in your related projects. Even you have used some testing images same as used in MiniGPT-4 and also treat MiniGPT-4 as a comparison baseline.

论文中图与描述(代码)不一致

文中 Fig2 里 stage2 画的是Abstractor冻住了，但是论文描述和代码里似乎都是训练的？

How to use Video in the provided interface?

I am confused about how to pass the video into the model through the interface example you provided? Looking forward to your help，Thanks！

Can you transform the v0 checkpoint to the compatible version of new branch?

I've used the v0 checkpoint for some experiments.

Is the training script for the 1st stage provided?

Dear authors,

I want to do the 1st stage training with my own caption data, have you provided the training script of that in this repo?

I only find the instructions related to the 2nd training stage in README.md

Thanks for your help!

HuggingFace Demo is broken.

Ablation on LORA

Have you guys tested finetuning the whole llama decoder for the finetuning stage instead of using LoRA? Curious what findings or insights y'all might have there, since I didn't see it included in the paper.