Giter Site home page Giter Site logo

minicpm-v's Introduction

A GPT-4V Level Multimodal LLM on Your Phone

中文 | English

Join our 💬 WeChat

MiniCPM-Llama3-V 2.5 🤗 🤖 | MiniCPM-V 2.0 🤗 🤖 | Technical Blog

MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. Since February 2024, we have released 4 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:

  • MiniCPM-Llama3-V 2.5: 🔥🔥🔥 The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses proprietary models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 in overall performance. Equipped with the enhanced OCR and instruction-following capability, the model can also support multimodal conversation for over 30 languages including English, Chinese, French, Spanish, German etc. With help of quantization, compilation optimizations, and several efficient inference techniques on CPUs and NPUs, MiniCPM-Llama3-V 2.5 can be efficiently deployed on end-side devices.

  • MiniCPM-V 2.0: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in low hallucination rates.

News

📌 Pinned

  • [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code of our provided forks (llama.cpp, ollama). GGUF models in various sizes are available here. MiniCPM-Llama3-V 2.5 series is not supported by the official repositories yet, and we are working hard to merge PRs. Please stay tuned!
  • [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
  • [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click here to view more details.
  • [2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available here. Come and try it out!

  • [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this link.
  • [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it here!
  • [2024.05.24] We release the MiniCPM-Llama3-V 2.5 gguf, which supports llama.cpp inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
  • [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide efficient inference and simple fine-tuning. Try it now!
  • [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click here to view more details.
  • [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at here!
  • [2024.04.17] MiniCPM-V-2.0 supports deploying WebUI Demo now!
  • [2024.04.15] MiniCPM-V-2.0 now also supports fine-tuning with the SWIFT framework!
  • [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
  • [2024.03.14] MiniCPM-V now supports fine-tuning with the SWIFT framework. Thanks to Jintao for the contribution!
  • [2024.03.01] MiniCPM-V now can be deployed on Mac!
  • [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.

Contents

MiniCPM-Llama3-V 2.5

MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:

  • 🔥 Leading Performance. MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max and greatly outperforms other Llama 3-based MLLMs.

  • 💪 Strong OCR Capabilities. MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.

  • 🏆 Trustworthy Behavior. Leveraging the latest RLAIF-V method (the newest technique in the RLHF-V [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. Data released.

  • 🌏 Multilingual Support. Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from VisCPM, MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to over 30 languages including German, French, Spanish, Italian, Korean etc. All Supported Languages.

  • 🚀 Efficient Deployment. MiniCPM-Llama3-V 2.5 systematically employs model quantization, CPU optimizations, NPU optimizations and compilation optimizations, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a 150x acceleration in end-side MLLM image encoding and a 3x speedup in language decoding.

  • 💫 Easy Usage. MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) GGUF format quantized models in 16 sizes, (3) efficient LoRA fine-tuning with only 2 V100 GPUs, (4) streaming output, (5) quick local WebUI demo setup with Gradio and Streamlit, and (6) interactive demos on HuggingFace Spaces.

Evaluation

Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.
Model Size OCRBench TextVQA val DocVQA test Open-Compass MME MMB test (en) MMB test (cn) MMMU val Math-Vista LLaVA Bench RealWorld QA Object HalBench
Proprietary
Gemini Pro - 680 74.6 88.1 62.9 2148.9 73.6 74.3 48.9 45.8 79.9 60.4 -
GPT-4V (2023.11.06) - 645 78.0 88.4 63.5 1771.5 77.0 74.4 53.8 47.8 93.1 63.0 86.4
Open-source
Mini-Gemini 2.2B - 56.2 34.2* - 1653.0 - - 31.7 - - - -
Qwen-VL-Chat 9.6B 488 61.5 62.6 51.6 1860.0 61.8 56.3 37.0 33.8 67.7 49.3 56.2
DeepSeek-VL-7B 7.3B 435 64.7* 47.0* 54.6 1765.4 73.8 71.4 38.3 36.8 77.8 54.2 -
Yi-VL-34B 34B 290 43.4* 16.9* 52.2 2050.2 72.4 70.7 45.1 30.7 62.3 54.8 79.3
CogVLM-Chat 17.4B 590 70.4 33.3* 54.2 1736.6 65.8 55.9 37.3 34.7 73.9 60.3 73.6
TextMonkey 9.7B 558 64.3 66.7 - - - - - - - - -
Idefics2 8.0B - 73.0 74.0 57.2 1847.6 75.7 68.6 45.2 52.2 49.1 60.7 -
Bunny-LLama-3-8B 8.4B - - - 54.3 1920.3 77.0 73.9 41.3 31.5 61.2 58.8 -
LLaVA-NeXT Llama-3-8B 8.4B - - 78.2 - 1971.5 - - 41.7 37.5 80.1 60.0 -
Phi-3-vision-128k-instruct 4.2B 639* 70.9 - - 1537.5* - - 40.4 44.5 64.2* 58.8* -
MiniCPM-V 1.0 2.8B 366 60.6 38.2 47.5 1650.2 64.1 62.6 38.3 28.9 51.3 51.2 78.4
MiniCPM-V 2.0 2.8B 605 74.1 71.9 54.5 1808.6 69.1 66.5 38.2 38.7 69.2 55.8 85.5
MiniCPM-Llama3-V 2.5 8.5B 725 76.6 84.8 65.1 2024.6 77.2 74.2 45.8 54.3 86.7 63.5 89.7
* We evaluate the officially released checkpoint by ourselves.

Evaluation results of multilingual LLaVA Bench

Examples

We deploy MiniCPM-Llama3-V 2.5 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

MiniCPM-V 2.0

Click to view more details of MiniCPM-V 2.0

MiniCPM-V 2.0 is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.

  • 🔥 State-of-the-art Performance.

    MiniCPM-V 2.0 achieves state-of-the-art performance on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Notably, MiniCPM-V 2.0 shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding, and state-of-the-art performance on OCRBench among open-source models.

  • 🏆 Trustworthy Behavior.

    LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is the first end-side LMM aligned via multimodal RLHF for trustworthy behavior (using the recent RLHF-V [CVPR'24] series technique). This allows the model to match GPT-4V in preventing hallucinations on Object HalBench.

  • 🌟 High-Resolution Images at Any Aspect Raito.

    MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from LLaVA-UHD.

  • ⚡️ High Efficiency.

    MiniCPM-V 2.0 can be efficiently deployed on most GPU cards and personal computers, and even on end devices such as mobile phones. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with favorable memory cost and speed during inference even when dealing with high-resolution images.

  • 🙌 Bilingual Support.

    MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM [ICLR'24].

Examples

We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.

Legacy Models

Model Introduction and Guidance
MiniCPM-V 1.0 Document
OmniLMM-12B Document

Chat with Our Demo on Gradio

We provide online and local demos powered by HuggingFace Gradio, the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.

Online Demo

Click here to try out the online demo of MiniCPM-Llama3-V 2.5MiniCPM-V 2.0 on HuggingFace Spaces.

Local WebUI Demo

You can easily build your own local WebUI demo with Gradio using the following commands.

pip install -r requirements.txt
# For NVIDIA GPUs, run:
python web_demo_2.5.py --device cuda

# For Mac with MPS (Apple silicon or AMD GPUs), run:
PYTORCH_ENABLE_MPS_FALLBACK=1 python web_demo_2.5.py --device mps

Install

  1. Clone this repository and navigate to the source folder
git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V
  1. Create conda environment
conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V
  1. Install dependencies
pip install -r requirements.txt

Inference

Model Zoo

Model Device Memory          Description Download
MiniCPM-Llama3-V 2.5 GPU 19 GB The lastest version, achieving state-of-the end-side multimodal performance. 🤗   
MiniCPM-Llama3-V 2.5 gguf CPU 5 GB The gguf version, lower memory usage and faster inference. 🤗   
MiniCPM-Llama3-V 2.5 int4 GPU 8 GB The int4 quantized version,lower GPU memory usage. 🤗   
MiniCPM-V 2.0 GPU 8 GB Light version, balance the performance the computation cost. 🤗   
MiniCPM-V 1.0 GPU 7 GB Lightest version, achieving the fastest inference. 🤗   

Multi-turn Conversation

Please refer to the following codes to run.

from chat import MiniCPMVChat, img2base64
import torch
import json

torch.manual_seed(0)

chat_model = MiniCPMVChat('openbmb/MiniCPM-Llama3-V-2_5')

im_64 = img2base64('./assets/airplane.jpeg')

# First round chat 
msgs = [{"role": "user", "content": "Tell me the model of this aircraft."}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

# Second round chat 
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Introduce something about Airbus A380."})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.chat(inputs)
print(answer)

You will get the following output:

"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."

"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."

Inference on Multiple GPUs

You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this tutorial for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.

Inference on Mac

Click to view an example, to run MiniCPM-Llama3-V 2.5 on 💻 Mac with MPS (Apple silicon or AMD GPUs).
# test.py  Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]

answer, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True
)
print(answer)

Run with command:

PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py

Deployment on Mobile Phone

MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. 🚀 Click MiniCPM-Llama3-V 2.5 / MiniCPM-V 2.0 to install apk.

Inference with llama.cpp

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more detail. This implementation supports smooth inference of 6~8 token/s on mobile phones (test environment:Xiaomi 14 pro + Snapdragon 8 Gen 3).

Inference with vLLM

Click to see how to inference MiniCPM-V 2.0 with vLLM (MiniCPM-Llama3-V 2.5 coming soon) Because our pull request to vLLM is still waiting for reviewing, we fork this repository to build and test our vLLM demo. Here are the steps:
  1. Clone our version of vLLM:
git clone https://github.com/OpenBMB/vllm.git
  1. Install vLLM:
cd vllm
pip install -e .
  1. Install timm:
pip install timm==0.9.10
  1. Run our demo:
python examples/minicpmv_example.py 

Fine-tuning

Simple Fine-tuning

We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.

Reference Document

With the SWIFT Framework

We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.

Best Practices:MiniCPM-V 1.0, MiniCPM-V 2.0

TODO

  • MiniCPM-V fine-tuning support
  • Code release for real-time interactive assistant

Model License

  • This repository is released under the Apache-2.0 License.

  • The usage of MiniCPM-V model weights must strictly follow MiniCPM Model License.md.

  • The models and weights of MiniCPM are completely free for academic research. after filling out a "questionnaire" for registration, are also available for free commercial use.

Statement

As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers

We will not be liable for any problems arising from the use of MiniCPMV-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Institutions

This project is developed by the following institutions:

Other Multimodal Projects from Our Team

👏 Welcome to explore other multimodal projects of our team:

VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

🌟 Star History

Citation

If you find our model/code/paper helpful, please consider cite our papers 📝 and star us ⭐️!

@article{yu2023rlhf,
  title={Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback},
  author={Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and others},
  journal={arXiv preprint arXiv:2312.00849},
  year={2023}
}
@article{viscpm,
    title={Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages}, 
    author={Jinyi Hu and Yuan Yao and Chongyi Wang and Shan Wang and Yinxu Pan and Qianyu Chen and Tianyu Yu and Hanghao Wu and Yue Zhao and Haoye Zhang and Xu Han and Yankai Lin and Jiao Xue and Dahai Li and Zhiyuan Liu and Maosong Sun},
    journal={arXiv preprint arXiv:2308.12038},
    year={2023}
}
@article{xu2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Xu, Ruyi and Yao, Yuan and Guo, Zonghao and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  journal={arXiv preprint arXiv:2403.11703},
  year={2024}
}
@article{yu2024rlaifv,
  title={RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness}, 
  author={Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and Sun, Maosong},
  journal={arXiv preprint arXiv:2405.17220},
  year={2024}
}

minicpm-v's People

Contributors

bokesyo avatar bokveizen avatar byeongkijeong avatar cuiunbo avatar eltociear avatar haoye17 avatar hill2hill avatar hwwwwwwwh avatar iceflame89 avatar jamepeng avatar jctime avatar lihytotoro avatar qyc-98 avatar tc-mb avatar uuhc avatar wangcx18 avatar waxnkw avatar yaoyuanthu avatar yiranyyu avatar yuzachongyi avatar zzccchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minicpm-v's Issues

memory leak looping inference

Hi, I got out of memory/memory leak after after some calls. Here the script

model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, cache_dir="./cache", torch_dtype=torch.bfloat16)
model = model.to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, cache_dir="./cache")

files = ['a.jpg', 'b.jpg', 'c.jpg', ...]
for image_path in files:
        msgs = [{'role': 'user', 'content': prompt}]
        image = Image.open(image_path).convert('RGB')
        res, context, _ = model.chat(
            image=image,
            msgs=msgs,
            context=None,
            tokenizer=tokenizer,
            sampling=False,
            temperature=0.0
        )
        print(res)

fine tune

Hello, is it possible (will it be) to fine tune with peft lora and quantization?

Request for a 4bit quantization model for omnilmm 12B

I have tried the multi card inference but failed, it seems the layers are not dispatched correctly.

        with init_empty_weights():
            model = OmniLMMForCausalLM.from_pretrained(model_name, tune_clip=True, torch_dtype=torch.bfloat16)
        model = load_checkpoint_and_dispatch(model, model_name, dtype=torch.bfloat16, 
                    device_map="balanced",  no_split_module_classes=['Eva','MistralDecoderLayer', 'ModuleList', 'Resampler']
        )

So it would be convenience to provide a quantization model for low memory cards, it may be nice if under 20GB single card or 40GB multi cards.

关于3b模型在MME上精度结果不一致问题

我下载了3b模型跑了官方的MME测试集,使用prompt为官方的默认prompt,算出来的分数为1378.67,和给出的1452差别较大。请问是测试结果使用的prompt不一致导致的吗?

finetune code

我注意到没有finetune的代码,是不是可以使用viscpm的代码? 还是说team后续会放出?

使用哪一个基座模型?

我观察到minicpm的llm有多个版本,例如fp32,bf16,dpo和sft。
请问多模态模型是使用哪一个llm呢?

数学推理出现幻觉?

数学推理出错
prompt: 1+1/sqrt(2)=?
completion: 1 + 1/sqrt(2) 可以转换为一个数学表达式。

首先,我们可以将根号下的 2 化简为 √2。

然后,我们可以将 1 视为一个常数,并将其加到 1/√2。为了将 1/√2 转换为一个分数,我们可以将其看作是一个带根号的分数。我们可以将根号下的 2 平方,得到 2。因此,1/√2 可以表示为 1/(√2 * √2),即 1/(2 * 2)。

现在,我们可以将 1 + 1/√2 表示为:

1 + 1/√2 = 1 + 1/2

接下来,我们可以将这个表达式转换为一个带分数:

1 + 1/√2 = 2/2 + 1/2 = 3/2

所以,1 + 1/√2 的结果是 3/2 或 1 1/2。

Training Data Composition for MiniCPM-v2

Thanks for the great work!

I am wondering where we can find the training data composition for MiniCPM-v2? Specifically, if any distillation from other LMMs is used?

是否支持mac?

使用Mac部署的时候 提示没有使用CUDA的错误
`

(OmniLMM) crz@crzdeMacBook-Air OmniLMM % pip install flash_attn

Collecting flash_attn
Downloading flash_attn-2.5.2.tar.gz (2.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 882.4 kB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [20 lines of output]
fatal: not a git repository (or any of the parent directories): .git
/private/var/folders/fn/bz39wtv175s83wx3cm11lptm0000gn/T/pip-install-k91r8ka8/flash-attn_fccd9cf359024ff8acc3369bdc82e45f/setup.py:78: UserWarning: flash_attn was requested, but nvcc was not found. Are you sure your environment has nvcc available? If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
warnings.warn(
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/private/var/folders/fn/bz39wtv175s83wx3cm11lptm0000gn/T/pip-install-k91r8ka8/flash-attn_fccd9cf359024ff8acc3369bdc82e45f/setup.py", line 133, in
CUDAExtension(
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
library_dirs += library_paths(cuda=True)
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
File "/Users/crz/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

  torch.__version__  = 2.0.1
  
  
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

miniCPM-V Quantization

Hi, Is it possible convert omniLLM-3B (miniCPM-V) to Quantization version ( e,g: GPTQ ) ?

Question about details on the training procedure and data

Hi, and thanks for this fantastic work!

I am using your models for a scientific project, which requires details about the training procedure and training data. I haven't found a paper or a technical report specifically for this work. Did I miss something?

Is the training procedure and data the same as in https://arxiv.org/pdf/2308.12038.pdf ?

Can you please share some details or a technical report?

Thanks :-)

OCR场景下无法返回全部的信息

首先非常感谢贵团队做出的卓越的贡献,我打算将其作为OCR使用读取图片上的内容并且给出完整的信息

但是尝试多个prompt之后还是无法有效的输出完整的信息,想请教下有什么解决方法或者正确的使用方式吗?

测试OCR图片源自 <MapReduce: Simplified Data Processing on Large Clusters>第一页

iShot2024-04-15 20 51 26

Benchmark results of TextVQA

We got different benchmark results while evaluation the MiniCPM-V 2.0 model provided in huggingface. For example, we got only 66 on TextVQA, which is different from 74.1 provided in the technique report. We directly use the chat api in modeling_minicpmv.py like this.

        output_ids, context, _ = model.chat(
            image=image_tensor,
            msgs=msgs,
            context=None,
            tokenizer=tokenizer,
            sampling=True,
            temperature=0.7
    	)

Would you please provide the detailed hyperparameters for the evaluation?

ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

(OmniLMM) [zhuguibo@GPU1 OmniLMM]$ python example.py
Traceback (most recent call last):
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/example.py", line 52, in
option1()
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/example.py", line 5, in option1
chat_model = OmniLMMChat('/mnt/data/dataProcess/LLM_Models/MiniCPM-V-2.0')
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/chat.py", line 169, in init
self.model = OmniLMM3B(model_path)
File "/mnt/data/dataProcess/LLM_Models/LLM_code/minicpm/OmniLMM/chat.py", line 141, in init
self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(dtype=torch.bfloat16)
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 526, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1091, in from_pretrained
config_class = get_class_from_dynamic_module(
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 500, in get_class_from_dynamic_module
return get_class_in_module(class_name, final_module.replace(".py", ""))
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 200, in get_class_in_module
module = importlib.import_module(module_path)
File "/mnt/data/app/anaconda3/envs/OmniLMM/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 992, in _find_and_load_unlocked
File "", line 241, in _call_with_frames_removed
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 992, in _find_and_load_unlocked
File "", line 241, in _call_with_frames_removed
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers_modules.MiniCPM-V-2'

redmi k70 16GB使用MiniCPM-V时报错

MLCChat failed

Stack trace:
org.apache.tvm.Base$TVMError: TVMError: OpenCL build error for device=0x78b138eaa8
Error: CL_OUT_OF_HOST_MEMORY
Pass
Stack trace:
File "/Users/acha/Desktop/mlc-MiniCPM/3rdparty/tvm/src/runtime/library_module.cc", line 78

at org.apache.tvm.Base.checkCall(Base.java:173)
at org.apache.tvm.Function.invoke(Function.java:130)
at com.modelbest.mlcllm.ChatModule.image(ChatModule.java:64)
at com.modelbest.minicpm.AppViewModel$ChatState$requestImage$f$1$1.invoke(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState$requestImage$f$1$1.invoke(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState.callBackend(AppViewModel.kt:516)
at com.modelbest.minicpm.AppViewModel$ChatState.requestImage$lambda$4(AppViewModel.kt:637)
at com.modelbest.minicpm.AppViewModel$ChatState.$r8$lambda$-81FfOHgL43Kw6mcY6rLiYQKEuE(Unknown Source:0)
at com.modelbest.minicpm.AppViewModel$ChatState$$ExternalSyntheticLambda2.run(Unknown Source:12)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:487)
at java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
at java.lang.Thread.run(Thread.java:1012)

Error message:
TVMError: OpenCL build error for device=0x78b138eaa8
Error: CL_OUT_OF_HOST_MEMORY
Pass
Stack trace:
File "/Users/acha/Desktop/mlc-MiniCPM/3rdparty/tvm/src/runtime/library_module.cc", line 78

object detection能力

请问,支持detection返回bbox吗?很多多模态模型已经支持了,且性能也很优秀,为什么cpmv我测了感觉不支持。
但是cpmv对空间位置关系的感知能力也很强,是什么原因导致放弃了训练这一功能呢?
如果加上detect的预训练,做finetune时有bbox输入对应文本,是否能让模型更容易获取新知识?

installed cuda ,but there is a mismatch in cuda verision ,the code couldn't run without cuda

I installed conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c
pytorch -c nvidia
but an error message showed up, RuntimeError: cutlassF: no kernel found to launch! , it indicates that my cuda is not the correct verision , I downloaded the model from modelscope

我根据requirements.txt安装了依赖 但是在运行代码时 系统提示我需要安装cuda , 但我安装了cuda后依然报错,是说我的cuda版本与你们的不一致,我是自己下载了模型,并且在使用

object hal的评估方法

请问榜单里给出的object hal的分数是如何计算的,和RLHF-V中的计算方式是否有不同。
image

Error about missing methods

Hello. I'm getting followed error when i try to run the demo of MiniCPM-V on Ubuntu 22.04:
(OmniLMM) markusc@ubuntu2:~/github/OmniLMM$ /home/markusc/anaconda3/envs/OmniLMM/bin/python /home/markusc/github/OmniLMM/infer.py Loading checkpoint shards: 100%|████████████████████████████████████████████| 2/2 [00:01<00:00, 1.94it/s] Traceback (most recent call last): File "/home/markusc/github/OmniLMM/infer.py", line 10, in <module> answer = chat_model.process(inputs) AttributeError: 'OmniLMMChat' object has no attribute 'process'

Apart from importing the json module and changing the model path to openbmb/MiniCPM-V, I did not make any modifications to the original code. As followed:

`from chat import OmniLMMChat, img2base64
import json
chat_model = OmniLMMChat('openbmb/MiniCPM-V')

im_64 = img2base64('./assets/worldmap_ck.jpg')

msgs = [{"role": "user", "content": "What is interesting about this image?"}]

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)

msgs.append({"role": "assistant", "content": answer})
msgs.append({"role": "user", "content": "Where is China in the image"})

inputs = {"image": im_64, "question": json.dumps(msgs)}
answer = chat_model.process(inputs)
print(answer)`

Thanks in advance.

Chat without image input

Apologies for asking this in the issues tab.

Is this model capable of chatting without image input? I see in the benchmarks you have some models that can take both image+text and just text. I'm looking to use this model for a project, but I'm unsure if it'll work for my needs.

训练数据以及训练方式?

  1. 是否能开源训练数据复现以排除刷榜行为?
  2. 是否能简单写明训练方法,例如proj训练,是否冻住vit?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.